Q-Learning Basics

Introduction

Q-Learning is a fundamental reinforcement learning algorithm that teaches agents to make optimal decisions through trial and error. Unlike supervised learning where we provide correct answers, or unsupervised learning where we find patterns, reinforcement learning is about learning from rewards and punishments.

Imagine teaching a robot to navigate a maze. You don't tell it exactly which moves to make (supervised learning), and you don't just show it maze patterns (unsupervised learning). Instead, you give it rewards when it gets closer to the goal and penalties when it hits walls. The robot learns the best strategy through experience.

Q-Learning is elegant and powerful: it learns the "quality" (Q) of taking each action in each state, building a table of values that guide optimal decision-making.

What You'll Learn

By the end of this module, you will:

Understand the fundamentals of reinforcement learning and agent-environment interaction
Learn how Q-learning uses temporal difference learning to estimate action values
Understand the exploration-exploitation tradeoff and epsilon-greedy policies
Interpret Q-values and how they guide optimal decision-making
Recognize when Q-learning converges and how to evaluate learned policies
Visualize Q-value evolution and optimal paths in grid-world environments

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is about learning to make sequential decisions to maximize cumulative reward.

Key Components

Agent: The learner or decision-maker (e.g., robot, game player, trading algorithm)

Environment: Everything the agent interacts with (e.g., maze, game board, stock market)

State (s): The current situation the agent is in

Action (a): A choice the agent can make

Reward (r): Immediate feedback from the environment (positive or negative)

Policy (π): The agent's strategy - a mapping from states to actions

The RL Loop

1. Agent observes current state s
2. Agent selects action a based on policy
3. Environment transitions to new state s'
4. Environment gives reward r
5. Agent updates its knowledge
6. Repeat

Real-World Examples

Game Playing: Chess, Go, video games
Robotics: Navigation, manipulation, locomotion
Autonomous Vehicles: Driving decisions
Resource Management: Power grid optimization, inventory control
Finance: Trading strategies, portfolio management
Healthcare: Treatment planning, drug dosing

What is Q-Learning?

Q-Learning learns a Q-function Q(s, a) that estimates the expected cumulative reward of taking action a in state s and following the optimal policy thereafter.

Think of Q-values as "quality scores" for state-action pairs:

High Q-value → Good action in this state
Low Q-value → Poor action in this state

The agent's goal is to learn Q-values that lead to maximum total reward.

The Q-Table

Q-Learning stores Q-values in a table:

         Action 0  Action 1  Action 2  Action 3
State 0    2.5       1.8       3.2       0.9
State 1    4.1       3.7       2.3       4.5
State 2    1.2       5.8       4.3       2.1
...

Each cell Q(s, a) represents the expected value of taking action a in state s.

The Q-Learning Algorithm

Q-Learning updates Q-values using the Bellman equation and temporal difference learning.

The Update Rule

Q(s, a) ← Q(s, a) + α[r + γ max Q(s', a') - Q(s, a)]
                      └─────────────────────────┘
                         Temporal Difference Error

Where:

s: Current state
a: Action taken
r: Reward received
s': Next state
α: Learning rate (0 to 1)
γ: Discount factor (0 to 1)
max Q(s', a'): Maximum Q-value for next state

Breaking Down the Update

Current Estimate: Q(s, a)

What we currently think this action is worth

Target: r + γ max Q(s', a')

Immediate reward + discounted best future value

Error: Target - Current Estimate

How wrong our current estimate is

Update: Current + α × Error

Move current estimate toward target

Learning Rate (α)

Controls how much new information overrides old:

α = 0: Never learn (keep old Q-values)
α = 1: Completely replace old values
α = 0.1: Typical value, gradual learning

High α: Fast learning but unstable Low α: Slow learning but stable

Discount Factor (γ)

Controls how much the agent values future rewards:

γ = 0: Only care about immediate reward (myopic)
γ = 1: Future rewards as important as immediate (far-sighted)
γ = 0.9-0.99: Typical values

High γ: Agent plans ahead Low γ: Agent focuses on short-term gains

Exploration vs Exploitation

A fundamental challenge in RL: should the agent:

Exploit: Choose the best known action (maximize immediate reward) Explore: Try other actions (discover potentially better options)

Too much exploitation → Agent gets stuck in suboptimal behavior Too much exploration → Agent wastes time on bad actions

Epsilon-Greedy Policy

A simple solution to balance exploration and exploitation:

With probability ε:
    Choose random action (explore)
With probability 1-ε:
    Choose action with highest Q-value (exploit)

ε = 0.1: 10% exploration, 90% exploitation (typical) ε = 1.0: Pure exploration (random) ε = 0.0: Pure exploitation (greedy)

Epsilon Decay

Start with high exploration, gradually reduce it:

ε = ε × decay_rate
ε = max(ε, minimum_epsilon)

This allows:

Early training: Explore to discover good strategies
Late training: Exploit learned knowledge

Grid-World Example

Let's understand Q-Learning through a simple grid-world:

S . . . G
. X . X .
. . . . .
. X X X .
. . . . .

S: Start position
G: Goal (reward = +100)
X: Obstacles (can't enter)
.: Empty cells (reward = -1 per step)

Actions: Up, Right, Down, Left

Initial Q-Table

All Q-values start at 0:

Q(any state, any action) = 0

Episode 1

Agent explores randomly, eventually reaches goal:

Path: S → Right → Down → Right → Right → Up → G
Total reward: -5 (steps) + 100 (goal) = 95

Q-values update along the path, learning that actions leading to the goal are valuable.

Episode 100

After many episodes:

Q-values converge to optimal values
Agent learns shortest path
High Q-values point toward goal
Low Q-values point toward obstacles or away from goal

Convergence and Optimality

When Does Q-Learning Converge?

Q-Learning is guaranteed to converge to optimal Q-values if:

All state-action pairs are visited infinitely often
- Ensured by sufficient exploration (ε > 0)
Learning rate decreases appropriately
- Can use fixed small α or decreasing schedule
Rewards are bounded
- Prevents infinite Q-values