Text Generation & Language Models

Learn how machines generate text using recurrent neural networks

advanced45 min

Text Generation & Language Models

Introduction

Text generation is one of the most fascinating applications of machine learning - teaching computers to write like humans. From autocomplete suggestions to creative writing assistants, language models have revolutionized how we interact with text.

In this module, you'll learn how machines can generate coherent text by learning patterns from training data. You'll explore character-level language models, understand how context influences predictions, and see how temperature controls creativity in generation.

What is Text Generation?

Text generation is the task of automatically producing human-like text based on learned patterns from a training corpus. Language models learn the statistical structure of language and use it to predict what comes next.

Types of Text Generation

Character-Level Generation

Input: "The quick brow"
Output: "n fox jumps over..."

Predicts one character at a time based on previous characters.

Word-Level Generation

Input: "The quick brown"
Output: "fox jumps over the lazy dog"

Predicts one word at a time based on previous words.

Sentence-Level Generation

Input: "Machine learning is"
Output: "a subset of artificial intelligence that enables computers to learn from data."

Generates complete sentences or paragraphs.

Why is Text Generation Important?

Applications

  1. Autocomplete and Suggestions
    • Email composition (Gmail Smart Compose)
    • Search query suggestions
    • Code completion (GitHub Copilot)
    • Messaging apps
  2. Content Creation
    • Article writing assistance
    • Creative writing tools
    • Marketing copy generation
    • Social media posts
  3. Conversational AI
    • Chatbots and virtual assistants
    • Customer service automation
    • Language tutoring
    • Interactive storytelling
  4. Translation and Summarization
    • Machine translation
    • Text summarization
    • Paraphrasing
    • Style transfer
  5. Accessibility
    • Text-to-speech systems
    • Augmentative communication
    • Writing assistance for disabilities

How Language Models Work

The Core Idea

A language model learns the probability distribution over sequences of text:

P(next_char | previous_chars)

Given "The quick bro", what's the probability of:

  • "w" (high - completes "brown")
  • "t" (medium - could be "brother")
  • "x" (very low - unlikely)

Character-Level Models

Character-level models predict one character at a time:

Advantages:

  • Small vocabulary (26 letters + punctuation)
  • Can generate any word, even made-up ones
  • Handles misspellings naturally
  • Works for any language

Disadvantages:

  • Longer sequences to model
  • May generate nonsense words
  • Slower generation
  • Harder to capture long-range dependencies

Word-Level Models

Word-level models predict one word at a time:

Advantages:

  • Shorter sequences
  • Always generates real words
  • Faster generation
  • Better long-range dependencies

Disadvantages:

  • Large vocabulary (thousands of words)
  • Cannot generate new words
  • Struggles with rare words
  • Requires more memory

Recurrent Neural Networks (RNNs)

RNNs are designed for sequential data like text. They maintain a "hidden state" that captures information from previous inputs.

Architecture

Input → [RNN Cell] → Output
           ↓
      Hidden State
           ↓
      (feeds back)

At each step:

  1. Take current input (character/word)
  2. Combine with previous hidden state
  3. Produce output and new hidden state
  4. Repeat for next input

Mathematical Formulation

h_t = tanh(W_xh × x_t + W_hh × h_{t-1} + b_h)
y_t = W_hy × h_t + b_y

Where:

  • x_t: Current input
  • h_t: Hidden state at time t
  • y_t: Output at time t
  • W: Weight matrices
  • b: Bias vectors

Why RNNs for Text?

  1. Variable Length: Handle sequences of any length
  2. Memory: Hidden state remembers previous context
  3. Parameter Sharing: Same weights for all positions
  4. Sequential Processing: Natural for text

Training a Language Model

Data Preparation

  1. Collect Training Text
    "To be or not to be, that is the question."
    
  2. Build Vocabulary
    Characters: {T, o, , b, e, r, n, t, ., ...}
    Vocabulary size: ~50 characters
    
  3. Create Training Sequences
    Input:  "To be or"
    Target: "o be or "
    
    Input:  "o be or "
    Target: " be or n"
    

Training Process

  1. Forward Pass
    • Feed input sequence through RNN
    • Get predictions for each position
    • Compute probabilities with softmax
  2. Loss Calculation
    • Compare predictions to actual next characters
    • Use cross-entropy loss
    • Measure how wrong predictions are
  3. Backward Pass
    • Compute gradients
    • Update weights to reduce loss
    • Use backpropagation through time (BPTT)
  4. Repeat
    • Process all training sequences
    • Multiple epochs until convergence

Cross-Entropy Loss

Measures how well predictions match actual distribution:

Loss = -Σ(actual_i × log(predicted_i))

Lower loss = better predictions

Text Generation Process

Step-by-Step Generation

  1. Start with Seed Text
    Seed: "The quick"
    
  2. Encode Seed
    • Convert characters to one-hot vectors
    • Feed through RNN to get hidden state
  3. Predict Next Character
    • Use current hidden state
    • Compute probability distribution
    • Sample next character
  4. Update Context
    • Append generated character
    • Update hidden state
    • Repeat from step 3
  5. Stop Condition
    • Reach maximum length
    • Generate end-of-sequence token
    • Natural stopping point (period, newline)

Example Generation

Seed: "To be"

Step 1: Predict after "To be"
  Probabilities: {' ': 0.4, ',': 0.2, 'o': 0.15, ...}
  Sample: ' ' (space)
  Generated: "To be "

Step 2: Predict after "To be "
  Probabilities: {'o': 0.3, 'a': 0.2, 't': 0.15, ...}
  Sample: 'o'
  Generated: "To be o"

Step 3: Predict after "To be o"
  Probabilities: {'r': 0.5, 'n': 0.2, 'f': 0.1, ...}
  Sample: 'r'
  Generated: "To be or"

... continue until done

Temperature Sampling

Temperature controls the randomness of generation by adjusting the probability distribution.

How It Works

adjusted_prob_i = prob_i^(1/temperature)

Then renormalize to sum to 1.

Temperature Effects

Low Temperature (0.2 - 0.5)

  • More conservative
  • Picks high-probability characters
  • More predictable and coherent
  • Less creative
Temperature = 0.3
"To be or not to be, that is the question."
(follows training data closely)

Medium Temperature (0.7 - 1.0)

  • Balanced approach
  • Some randomness
  • Good coherence with variety
  • Recommended for most uses
Temperature = 0.8
"To be or not to be, that is the matter of great importance."
(reasonable variation)

High Temperature (1.5 - 2.0)

  • Very creative
  • More random choices
  • May generate nonsense
  • Explores unusual combinations
Temperature = 1.8
"To be or nox to qe, zhat is the quibble."
(creative but less coherent)

Choosing Temperature

  • Autocomplete: Low (0.3-0.5) - predictable
  • Creative Writing: Medium-High (0.8-1.2) - interesting
  • Exploration: High (1.5+) - discover patterns
  • Formal Text: Low (0.2-0.4) - conservative

Challenges in Text Generation

1. Long-Range Dependencies

Problem: RNNs struggle to remember information from far back

"The cat, which was sitting on the mat that was placed near the window, [was/were] sleeping."

Need to remember "cat" (singular) to use "was"

Solutions:

  • LSTM (Long Short-Term Memory)
  • GRU (Gated Recurrent Unit)
  • Transformer models (attention mechanism)

2. Vanishing Gradients

Problem: Gradients become very small during backpropagation through time

Effect: Model can't learn long-term patterns

Solutions:

  • LSTM/GRU architectures
  • Gradient clipping
  • Better initialization

3. Exposure Bias

Problem: Training uses real data, but generation uses model's own predictions

Effect: Errors compound during generation

Solutions:

  • Scheduled sampling
  • Reinforcement learning
  • Teacher forcing variations

4. Repetition

Problem: Models may generate repetitive text

"The cat sat on the mat. The cat sat on the mat. The cat sat..."

Solutions:

  • Repetition penalties
  • Diverse beam search
  • Nucleus sampling

5. Coherence

Problem: Generated text may be grammatical but nonsensical

"The colorless green ideas sleep furiously."

Solutions:

  • Larger models
  • More training data
  • Better architectures (Transformers)
  • Fine-tuning on specific domains

Interactive Demo

Use the controls to train and generate text:

  1. Choose Training Corpus: Select text style (Shakespeare, poetry, code)
  2. Set Sequence Length: How much context to use (longer = more context)
  3. Configure Hidden Size: Model capacity (larger = more complex patterns)
  4. Adjust Temperature: Control creativity (low = safe, high = creative)
  5. Set Max Length: How much text to generate
  6. Training Parameters: Learning rate and epochs

Observe:

  • Generated Text: See character-by-character generation
  • Step Details: Probability distribution for each character
  • Top Predictions: Most likely next characters at each step
  • Training Loss: Monitor model learning
  • Animation: Watch generation unfold in real-time

Advanced Techniques

LSTM and GRU

Long Short-Term Memory (LSTM)

  • Special gates control information flow
  • Cell state carries long-term memory
  • Better at long-range dependencies

Gated Recurrent Unit (GRU)

  • Simplified version of LSTM
  • Fewer parameters
  • Often similar performance

Attention Mechanisms

Allow model to focus on relevant parts of input:

"The cat, which was very fluffy, [sat] on the mat."

Attention helps model focus on "cat" when predicting "sat"

Transformer Models

Modern architecture that replaced RNNs:

  • Self-attention mechanism
  • Parallel processing (faster training)
  • Better long-range dependencies
  • State-of-the-art results

Examples: GPT, BERT, T5, GPT-3, GPT-4

Transfer Learning

Use pre-trained models:

  1. Pre-train on massive corpus (billions of words)
  2. Fine-tune on specific task
  3. Achieve better results with less data

Sampling Strategies

Greedy Sampling: Always pick highest probability

  • Deterministic
  • May be repetitive

Top-k Sampling: Sample from top k most likely

  • More diverse
  • Controlled randomness

Nucleus (Top-p) Sampling: Sample from smallest set with cumulative probability p

  • Adaptive
  • Balances quality and diversity

Beam Search: Keep track of multiple hypotheses

  • Better for translation
  • More coherent output

Use Cases

Code Generation

GitHub Copilot

  • Suggests code completions
  • Generates functions from comments
  • Learns from billions of lines of code
# Function to calculate fibonacci
def fibonacci(n):
    # [AI generates implementation]

Creative Writing

AI Writing Assistants

  • Story generation
  • Poetry creation
  • Dialogue writing
  • Character development

Email and Messaging

Smart Compose

  • Gmail suggests completions
  • Saves time typing
  • Learns your style

Chatbots

Conversational AI

  • Customer service
  • Virtual assistants
  • Language learning
  • Therapy bots

Content Creation

Marketing and SEO

  • Product descriptions
  • Blog posts
  • Social media content
  • Ad copy

Best Practices

1. Quality Training Data

  • Use clean, well-formatted text
  • Ensure data matches target domain
  • Remove noise and errors
  • Consider data diversity

2. Appropriate Model Size

  • Start small, scale up if needed
  • Balance capacity and overfitting
  • Consider computational constraints
  • Monitor training/validation loss

3. Hyperparameter Tuning

  • Experiment with sequence length
  • Try different hidden sizes
  • Adjust learning rate
  • Test various temperatures

4. Evaluation

  • Human evaluation (coherence, quality)
  • Perplexity (how surprised model is)
  • BLEU score (for translation)
  • Diversity metrics

5. Generation Strategy

  • Choose appropriate temperature
  • Use sampling strategies
  • Implement repetition penalties
  • Set reasonable length limits

Common Pitfalls

Overfitting

Problem: Model memorizes training data Solution: More data, regularization, early stopping

Underfitting

Problem: Model too simple for task Solution: Larger model, more training, better features

Poor Seed Text

Problem: Generation starts poorly Solution: Use representative seeds, multiple attempts

Inappropriate Temperature

Problem: Too conservative or too random Solution: Experiment with different values

Evaluation Metrics

Perplexity

Measures how well model predicts test data:

Perplexity = exp(average_loss)

Lower perplexity = better model

BLEU Score

For translation and generation:

  • Compares generated text to references
  • Measures n-gram overlap
  • 0 to 1 (higher is better)

Human Evaluation

Most important for generation:

  • Fluency: Is it grammatical?
  • Coherence: Does it make sense?
  • Relevance: Is it on-topic?
  • Creativity: Is it interesting?

Further Reading

Research Papers

  • "Generating Sequences With Recurrent Neural Networks" - Graves (2013)
  • "Attention Is All You Need" - Vaswani et al. (2017)
  • "Language Models are Few-Shot Learners" - Brown et al. (2020) GPT-3
  • "BERT: Pre-training of Deep Bidirectional Transformers" - Devlin et al. (2018)

Books

  • "Deep Learning" by Goodfellow, Bengio & Courville
  • "Speech and Language Processing" by Jurafsky & Martin
  • "Natural Language Processing with Transformers" by Tunstall et al.

Tools and Libraries

  • PyTorch/TensorFlow: Deep learning frameworks
  • Hugging Face Transformers: Pre-trained models
  • OpenAI API: GPT models
  • spaCy: NLP pipeline
  • NLTK: Text processing

Online Resources

Summary

Text generation with language models is a powerful technique:

  • Character-Level Models: Generate text one character at a time
  • RNNs: Process sequences with hidden state memory
  • Training: Learn patterns from corpus using cross-entropy loss
  • Temperature: Control randomness and creativity
  • Step-by-Step: Generate by repeatedly predicting next character

Modern language models like GPT have revolutionized NLP, but understanding the fundamentals of character-level RNNs provides crucial insights into how these systems work. Start with simple models, experiment with parameters, and observe how context and temperature affect generation quality!

Sign in to Continue

Sign in with Google to save your learning progress, quiz scores, and bookmarks across devices.

Track your progress across all modules
Save quiz scores and bookmarks
Sync learning data across devices