Q-Learning Basics

Learn how Q-Learning teaches agents to make optimal decisions through trial and error in grid-world environments

intermediate40 min

Q-Learning Basics

Introduction

Q-Learning is a fundamental reinforcement learning algorithm that teaches agents to make optimal decisions through trial and error. Unlike supervised learning where we provide correct answers, or unsupervised learning where we find patterns, reinforcement learning is about learning from rewards and punishments.

Imagine teaching a robot to navigate a maze. You don't tell it exactly which moves to make (supervised learning), and you don't just show it maze patterns (unsupervised learning). Instead, you give it rewards when it gets closer to the goal and penalties when it hits walls. The robot learns the best strategy through experience.

Q-Learning is elegant and powerful: it learns the "quality" (Q) of taking each action in each state, building a table of values that guide optimal decision-making.

What You'll Learn

By the end of this module, you will:

  • Understand the fundamentals of reinforcement learning and agent-environment interaction
  • Learn how Q-learning uses temporal difference learning to estimate action values
  • Understand the exploration-exploitation tradeoff and epsilon-greedy policies
  • Interpret Q-values and how they guide optimal decision-making
  • Recognize when Q-learning converges and how to evaluate learned policies
  • Visualize Q-value evolution and optimal paths in grid-world environments

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is about learning to make sequential decisions to maximize cumulative reward.

Key Components

Agent: The learner or decision-maker (e.g., robot, game player, trading algorithm)

Environment: Everything the agent interacts with (e.g., maze, game board, stock market)

State (s): The current situation the agent is in

Action (a): A choice the agent can make

Reward (r): Immediate feedback from the environment (positive or negative)

Policy (π): The agent's strategy - a mapping from states to actions

The RL Loop

1. Agent observes current state s
2. Agent selects action a based on policy
3. Environment transitions to new state s'
4. Environment gives reward r
5. Agent updates its knowledge
6. Repeat

Real-World Examples

  • Game Playing: Chess, Go, video games
  • Robotics: Navigation, manipulation, locomotion
  • Autonomous Vehicles: Driving decisions
  • Resource Management: Power grid optimization, inventory control
  • Finance: Trading strategies, portfolio management
  • Healthcare: Treatment planning, drug dosing

What is Q-Learning?

Q-Learning learns a Q-function Q(s, a) that estimates the expected cumulative reward of taking action a in state s and following the optimal policy thereafter.

Think of Q-values as "quality scores" for state-action pairs:

  • High Q-value → Good action in this state
  • Low Q-value → Poor action in this state

The agent's goal is to learn Q-values that lead to maximum total reward.

The Q-Table

Q-Learning stores Q-values in a table:

         Action 0  Action 1  Action 2  Action 3
State 0    2.5       1.8       3.2       0.9
State 1    4.1       3.7       2.3       4.5
State 2    1.2       5.8       4.3       2.1
...

Each cell Q(s, a) represents the expected value of taking action a in state s.

The Q-Learning Algorithm

Q-Learning updates Q-values using the Bellman equation and temporal difference learning.

The Update Rule

Q(s, a) ← Q(s, a) + α[r + γ max Q(s', a') - Q(s, a)]
                      └─────────────────────────┘
                         Temporal Difference Error

Where:

  • s: Current state
  • a: Action taken
  • r: Reward received
  • s': Next state
  • α: Learning rate (0 to 1)
  • γ: Discount factor (0 to 1)
  • max Q(s', a'): Maximum Q-value for next state

Breaking Down the Update

Current Estimate: Q(s, a)

  • What we currently think this action is worth

Target: r + γ max Q(s', a')

  • Immediate reward + discounted best future value

Error: Target - Current Estimate

  • How wrong our current estimate is

Update: Current + α × Error

  • Move current estimate toward target

Learning Rate (α)

Controls how much new information overrides old:

  • α = 0: Never learn (keep old Q-values)
  • α = 1: Completely replace old values
  • α = 0.1: Typical value, gradual learning

High α: Fast learning but unstable Low α: Slow learning but stable

Discount Factor (γ)

Controls how much the agent values future rewards:

  • γ = 0: Only care about immediate reward (myopic)
  • γ = 1: Future rewards as important as immediate (far-sighted)
  • γ = 0.9-0.99: Typical values

High γ: Agent plans ahead Low γ: Agent focuses on short-term gains

Exploration vs Exploitation

A fundamental challenge in RL: should the agent:

Exploit: Choose the best known action (maximize immediate reward) Explore: Try other actions (discover potentially better options)

Too much exploitation → Agent gets stuck in suboptimal behavior Too much exploration → Agent wastes time on bad actions

Epsilon-Greedy Policy

A simple solution to balance exploration and exploitation:

With probability ε:
    Choose random action (explore)
With probability 1-ε:
    Choose action with highest Q-value (exploit)

ε = 0.1: 10% exploration, 90% exploitation (typical) ε = 1.0: Pure exploration (random) ε = 0.0: Pure exploitation (greedy)

Epsilon Decay

Start with high exploration, gradually reduce it:

ε = ε × decay_rate
ε = max(ε, minimum_epsilon)

This allows:

  • Early training: Explore to discover good strategies
  • Late training: Exploit learned knowledge

Grid-World Example

Let's understand Q-Learning through a simple grid-world:

S . . . G
. X . X .
. . . . .
. X X X .
. . . . .
  • S: Start position
  • G: Goal (reward = +100)
  • X: Obstacles (can't enter)
  • .: Empty cells (reward = -1 per step)

Actions: Up, Right, Down, Left

Initial Q-Table

All Q-values start at 0:

Q(any state, any action) = 0

Episode 1

Agent explores randomly, eventually reaches goal:

  • Path: S → Right → Down → Right → Right → Up → G
  • Total reward: -5 (steps) + 100 (goal) = 95

Q-values update along the path, learning that actions leading to the goal are valuable.

Episode 100

After many episodes:

  • Q-values converge to optimal values
  • Agent learns shortest path
  • High Q-values point toward goal
  • Low Q-values point toward obstacles or away from goal

Convergence and Optimality

When Does Q-Learning Converge?

Q-Learning is guaranteed to converge to optimal Q-values if:

  1. All state-action pairs are visited infinitely often
    • Ensured by sufficient exploration (ε > 0)
  2. Learning rate decreases appropriately
    • Can use fixed small α or decreasing schedule
  3. Rewards are bounded
    • Prevents infinite Q-values

Signs of Convergence

  • Q-values stop changing significantly
  • Policy (action choices) stabilizes
  • Reward per episode plateaus
  • Agent consistently finds optimal path

Evaluating the Learned Policy

Success Rate: Percentage of episodes reaching the goal

Average Reward: Mean total reward over recent episodes

Path Length: Number of steps in learned optimal path

Q-Value Stability: How much Q-values change between episodes

Advantages of Q-Learning

Model-Free

Doesn't need to know environment dynamics:

  • No need for transition probabilities
  • No need for reward function model
  • Learns directly from experience

Off-Policy

Learns optimal policy while following exploratory policy:

  • Can learn from random actions
  • Can learn from demonstrations
  • Separates learning from behavior

Simple and Effective

  • Easy to implement
  • Works well for many problems
  • Foundation for advanced methods

Limitations and Challenges

Scalability

Q-table grows with state and action space:

  • 1000 states × 4 actions = 4000 entries (manageable)
  • 1,000,000 states × 10 actions = 10,000,000 entries (problematic)

Solution: Deep Q-Networks (DQN) use neural networks instead of tables

Continuous Spaces

Q-Learning requires discrete states and actions:

  • Can't directly handle continuous positions, velocities, etc.

Solution: Discretize spaces or use function approximation

Slow Convergence

May need many episodes to learn:

  • Especially in large environments
  • Especially with sparse rewards

Solutions: Better exploration, reward shaping, transfer learning

Credit Assignment

Hard to know which actions led to reward:

  • Reward comes at end of episode
  • Which earlier actions were responsible?

Solution: Eligibility traces (Q(λ))

Tips for Better Q-Learning

1. Tune Hyperparameters

Experiment with:

  • Learning rate α (try 0.01 to 0.5)
  • Discount factor γ (try 0.9 to 0.99)
  • Exploration rate ε (try 0.1 to 0.3)

2. Use Epsilon Decay

Start with high exploration (ε = 1.0), decay to low (ε = 0.01):

ε = ε × 0.995 after each episode

3. Initialize Q-Values Optimistically

Start with high Q-values (e.g., 10) instead of 0:

  • Encourages exploration of all actions
  • Agent discovers which actions are actually bad

4. Reward Shaping

Add intermediate rewards to guide learning:

  • Small reward for moving toward goal
  • Penalty for moving away
  • Be careful not to change optimal policy!

5. Visualize Learning

Plot:

  • Reward per episode
  • Steps per episode
  • Q-value heatmaps
  • Learned paths

6. Run Multiple Seeds

Q-Learning can be sensitive to randomness:

  • Run with different random seeds
  • Average results across runs
  • Report mean and standard deviation

Extensions and Variants

SARSA (State-Action-Reward-State-Action)

On-policy version of Q-Learning:

  • Updates based on action actually taken
  • More conservative than Q-Learning
  • Better for risky environments

Double Q-Learning

Reduces overestimation of Q-values:

  • Maintains two Q-tables
  • Uses one to select action, other to evaluate
  • More accurate value estimates

Q(λ) with Eligibility Traces

Faster credit assignment:

  • Tracks which states contributed to reward
  • Updates multiple states per step
  • Bridges between TD and Monte Carlo

Deep Q-Networks (DQN)

Uses neural networks instead of tables:

  • Handles large state spaces
  • Can process images directly
  • Enabled breakthroughs in Atari games

Real-World Applications

Game Playing

  • Atari games (DQN)
  • Board games (AlphaGo uses policy gradients, but Q-Learning principles apply)
  • Real-time strategy games

Robotics

  • Robot navigation
  • Manipulation tasks
  • Quadcopter control
  • Humanoid walking

Resource Management

  • Traffic light control
  • Power grid optimization
  • Data center cooling
  • Inventory management

Finance

  • Algorithmic trading
  • Portfolio optimization
  • Option pricing
  • Risk management

Healthcare

  • Treatment planning
  • Drug dosing
  • Clinical trial design
  • Hospital resource allocation

Summary

Q-Learning is a foundational reinforcement learning algorithm that:

  • Learns optimal action values (Q-values) through experience
  • Uses temporal difference learning to update estimates
  • Balances exploration and exploitation with epsilon-greedy
  • Converges to optimal policy under appropriate conditions
  • Works without knowing environment dynamics (model-free)
  • Serves as foundation for advanced RL methods

Understanding Q-Learning provides:

  • Foundation for deep reinforcement learning
  • Intuition for value-based methods
  • Framework for sequential decision-making
  • Basis for understanding policy gradients and actor-critic methods

Next Steps

After mastering Q-Learning, explore:

  • Deep Q-Networks (DQN): Scale to large state spaces
  • Policy Gradients: Learn policies directly
  • Actor-Critic Methods: Combine value and policy learning
  • Multi-Armed Bandits: Simpler RL problem
  • Markov Decision Processes: Theoretical foundations
  • Advanced Exploration: UCB, Thompson sampling, curiosity-driven

Sign in to Continue

Sign in with Google to save your learning progress, quiz scores, and bookmarks across devices.

Track your progress across all modules
Save quiz scores and bookmarks
Sync learning data across devices