CMPUT 653 Deep Policy Gradient Methods

Schedule

eClass

Syllabus

Term:

Fall, 2020

Lecture Date and Time:

MW 2:00 - 3:20 p.m.

Lecture Location:

Zoom link

Instructor:

Rupam Mahmood (armahmood@ualberta.ca)

Office Hours:

MW 3:30 - 4:00 p.m. Office hours can be booked in my website

Overview

When the input-output interface of a robot is determined, can we just deploy a general-purpose system for controlling the robot without extensive hand-engineering? Neural networks with parameters learned by policy gradient methods are a candidate for such systems that are already shown to learn controlling real robots from scratch.

In this course, we learn the foundations of policy gradient methods and some of the fundamental differences between standard policy gradient methods such as actor-critic and those combining well with neural networks and achieving practical success such as Proximal Policy Optimization. We discuss a number of recent papers on policy gradient methods and conclude the course with a project guided by the instructor toward developing a mini-research contribution. Throughout the course, there will be a focus on computational frugality and compatibility with real-time updates.

Objectives

Derive and implement deep policy learning methods
Design experiments for policy learning, especially in real-time
Summarize research in policy gradient methods
Produce novel research in policy gradient methods

Prerequisites

This course requires knowledge in basic probability theory, linear algebra, introductory reinforcement learning as well as experience programming deep neural networks using PyTorch in Python 3.

Course Topics

Objectives, estimation, Gradient Bandit & function approximation
Markov decision processes, value functions & Dynamic programming
Temporal difference learning
Off-policy learning
REINFORCE and batch actor-critic
Lambda returns, advantage estimation and PPO
Online representation search
Policy gradient methods for continuous actions such as DDPG and SAC

Course Work and Evaluation

Written assignment on basics of reinforcement learning 15%
Midterm on basics of reinforcement learning 10%
Programming assignment on policy gradient methods 15%
Course participation (forum) 10%
Guided course project 50%
- Project initial report 10%
- Proposal final presentation 10%
- Project final report 20%
- Project code 10%

Course Materials

All course reading material will be available online. We will be using the following textbook extensively: Sutton and Barto, Reinforcement Learning: An Introduction, MIT Press. The book is available from the bookstore or online as a pdf here: http://www.incompleteideas.net/book/the-book-2nd.html

Academic Integrity

All assignments written and programming are to be done individually. No exceptions. Students must write their own answers and code. Students are permitted and encouraged to discuss assignment problems and the contents of the course. However, the discussion should always be about high-level ideas. Students should not discuss with each other (or tutors) while writing answers to written questions our programming. Absolutely no sharing of answers or code sharing with other students or tutors. All the sources used for problem solution must be acknowledged, e.g. web sites, books, research papers, personal communication with people, etc. The University of Alberta is committed to the highest standards of academic integrity and honesty. Students are expected to be familiar with these standards regarding academic honesty and to uphold the policies of the University in this respect. Students are particularly urged to familiarize themselves with the provisions of the Code of Student Behaviour and avoid any behaviour which could potentially result in suspicions of cheating, plagiarism, misrepresentation of facts and/or participation in an offence. Academic dishonesty is a serious offence and can result in suspension or expulsion from the University. (GFC 29 SEP 2003)