Back to Full Curriculum
ML-EL1Semester 74 (3-0-2)Elective

Reinforcement Learning

Markov Decision Processes (MDPs) formalization (states, actions, transition probabilities, rewards), Bellman equations and optimality principle, Model-free vs. model-based RL, Exploration vs. exploitation tradeoff ( -...

Syllabus

01

Unit 1: Reinforcement Learning Foundations

Markov Decision Processes (MDPs) formalization (states, actions, transition probabilities, rewards), Bellman equations and optimality principle, Model-free vs. model-based RL, Exploration vs. exploitation tradeoff ( -greedy, UCB, entropy regularization), Discounted future rewards and value iteration, Policy iteration algorithm, Generalized Policy Iteration (GPI).

02

Unit 2: Model-Free Temporal Difference Learning

Monte Carlo methods (first-visit, every-visit, importance sampling), Temporal Difference (TD) learning (TD( ), TD( )), Q-Learning algorithm and off-policy learning, SARSA and on-policy control, Double Q-Learning and expected SARSA, N-step bootstrapping methods, Eligibility traces implementation.

03

Unit 3: Function Approximation and Deep RL

Linear function approximation (average reward, gradient TD), Tile coding and radial basis functions, Deep Q-Networks (DQN) architecture (experience replay, target networks), Double DQN improvements, Dueling DQN and distributional RL, Rainbow DQN combining multiple improvements, Policy gradient methods (REINFORCE, actor-critic).

04

Unit 4: Policy Gradient Methods and Actor-Critic

REINFORCE algorithm with baseline subtraction, Advantage Actor-Critic (A2C/A3C), Proximal Policy Optimization (PPO) - clipped objective, Trust Region Policy Optimization (TRPO), Soft Actor-Critic (SAC), Deterministic Policy Gradients (DPG), Twin Delayed DDPG (TD3), Continuous control benchmarks (MuJoCo, PyBullet).

05

Unit 5: Advanced RL Topics and Applications

Multi-agent RL (MARL) cooperation/competition, Centralized training with decentralized execution (CTDE), Self-play and population-based training, Model-based RL (Dyna, MBPO, Dreamer), Hierarchical RL (options framework, Feudal Networks), Offline RL (conservative Q-learning, behavior regularization), RLHF (RL from Human Feedback) for LLM alignment.