Introduction to Reinforcement Learning
Welcome to the repository for the Introduction to Reinforcement Learning course at Leiden University (4032IRLRN). This repository contains all assignment reports, each exploring a different aspect of reinforcement learning: from bandit problems and dynamic programming to model-free and model-based learning. To see the full code for each report, you may refer to the GitHub repository by clicking the link below.
Authors
- Adrien Joon-Ha Im
- Bence Válint
Table of Contents
Assignment 1A – Exploration Strategies in Bandits
Comparative Analysis of Exploration Techniques in Multi-Armed Bandits
In this assignment, we explired the fundamental challenge of the exploration-exploitation trade-off in reinforcement learning through multi-armed bandit problems. We implemented and evaluated three key strategies: ε-Greedy, Optimistic Initialization, and Upper Confidence Bound (UCB). Each method was tested across various parameter settings to assess how effective they each are in maximizing cumulative rewards. Our experiments showed that while ε-Greedy is a simple and effective baseline, both Optimistic Initialization and UCB proved to show superior performance as they both converged faster and yielded higher long-term returns. Furthermore, we highlighted the importance of parameter tuning in optimizing learning efficiency.
Grade: 8.0 / 10.0
Assignment 1B – Dynamic Programming
Solving Markov Decision Processes with Policy and Value Iteration
This assignment focused on solving Markov Decision Processes (MDP) using Dynamic Programming (DP) in the Windy Gridworld environment. We applied and compared two basic DP algorithms: Policy Iteration and Value Iteration. Both methods used knowledge of the full environment in order to derive optimal policies through evaluation and improvement of value functions. Our results showed the strength of DP in small, deterministic environments, and also showed the importance of the discount factor in defining the behavior of the agent. While DP guarantees convergence, we also discussed the limitation of its scalability and limitations of its applications in solving real life problems.
Grade: 8.0 / 10.0
Assignment 2 – Model-Free Reinforcement Learning
Comparative Study of Model-Free Algorithms in Grid-Based RL
In this assignment, we implemented and compared four basic model-free reinforcement learning algorithms: Q-Learning, SARSA, Expected SARSA, and n-step SARSA. Each of these algorithms were studied in a stochastic grid environment called the ShortCut Environment. Our results showed that there are key differences between on-policy and off-policy methods. Q-Learning consistently found the most risky but optimal trajectories, while SARSA and Expected SARSA prioritized safer paths. n-step SARSA provided a flexible middle ground, with its performance depending heavily on the chosen step size. This assignment deepened our understanding trade-offs in exploration, and robustness in dynamic environments.
Grade: 8.9 / 10.0
Assignment 3 – Model-Based Reinforcement Learning
Planning with Dyna and Prioritized Sweeping in Stochastic Gridworlds
In this final assignment, we investigated two model-based reinforcement learning algorithms: Dyna and Prioritized Sweeping (PS). Both these agents use a simulated experience to improve learning efficiency, updating Q-values based on an internal model of state transition and rewards. We compared both algorithms under varying levels of stochasticity and budgets, and included Q-learning as a baseline for model-free reference. Our results showed that both Dyna and PS outperformed Q-learning in learning speed, especially in deterministic settings. PS showed faster initial learning due to its prioritized value propagation, and Dyna eventually reached higher episode returns. The assignment helped us understand the trade-off between efficiency and computational cost, as well as the limitations of model-based planning under uncertainty.
Grade: 8.0 / 10.0