Lesson 5: Text Generation & Streaming UIs

Topics Covered

The API Call: Anatomy of a chat completion request.
Provider APIs: OpenAI, Anthropic Claude, and Google Gemini side-by-side.
Streaming: Server-Sent Events (SSE) for real-time token delivery.
Frontend Integration: Building responsive chat UIs that stream.
Backend Patterns: FastAPI endpoints that proxy and stream LLM responses.
Provider Abstraction: Writing code that doesn't care which LLM it's calling.

You've mastered tokens, prompts, system messages, and generation parameters.

Now it's time to actually call an API and build something real. In this lesson, we'll create a streaming chat interface—the foundation of every AI-powered application.

1. The Anatomy of an API Call

Every LLM API follows the same basic pattern:

The request contains:

Messages: The conversation history (system, user, assistant roles)
Model: Which model to use (gpt-4o, claude-sonnet-4-20250514, gemini-1.5-pro)
Parameters: Temperature, max_tokens, etc. (from Lesson 4)

The response contains:

Content: The generated text
Usage: Token counts (for billing)
Finish reason: Why generation stopped (length, stop, tool_calls)

2. Your First API Call: OpenAI

Let's start with OpenAI—the most common starting point.

Setup

mkdir streaming-chat
cd streaming-chat
uv init
uv add openai python-dotenv
touch chat_openai.py

Create a .env file:

OPENAI_API_KEY=sk-your-key-here

Basic Chat Completion

chat_openai.py
"""
Basic OpenAI Chat Completion
============================
Your first API call. No streaming yet—just request/response.
"""

import os
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = OpenAI()  # Automatically uses OPENAI_API_KEY from environment

def chat(user_message: str, system_prompt: str = "You are a helpful assistant.") -> str:
    """
    Send a message and get a response.

    This is the simplest possible API call.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Fast and cheap for testing
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        temperature=0.7,
        max_tokens=1000,
    )

    # Extract the response
    return response.choices[0].message.content


def chat_with_response_object(
    user_message: str,
    system_prompt: str = "You are a helpful assistant."
) -> dict:
    """
    Send a message and return the full response object.

    Useful for understanding the complete API response structure.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        temperature=0.7,
        max_tokens=1000,
    )

    # Pydantic models have model_dump() to serialize to dict
    return response.model_dump()


def chat_with_history(
    messages: list[dict],
    system_prompt: str = "You are a helpful assistant."
) -> tuple[str, list[dict]]:
    """
    Chat with conversation history.

    Returns the response AND the updated message history.
    """
    # Prepend system prompt if not already there
    if not messages or messages[0].get("role") != "system":
        messages = [{"role": "system", "content": system_prompt}] + messages

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.7,
        max_tokens=1000,
    )

    assistant_message = response.choices[0].message.content

    # Add assistant response to history
    messages.append({"role": "assistant", "content": assistant_message})

    return assistant_message, messages


if __name__ == "__main__":
    # Test basic chat
    print("=== Basic Chat ===")
    response = chat("What's the capital of France?")
    print(f"Response: {response}")

    # Test with full response object
    print("\n=== Full Response Object ===")
    full_response = chat_with_response_object("What's the capital of France?")
    print(json.dumps(full_response, indent=2))

    # Test with history
    print("\n=== Chat with History ===")
    history = []

    # Turn 1
    history.append({"role": "user", "content": "My name is Alex."})
    response, history = chat_with_history(history)
    print(f"Assistant: {response}")

    # Turn 2
    history.append({"role": "user", "content": "What's my name?"})
    response, history = chat_with_history(history)
    print(f"Assistant: {response}")

    print(f"\nTotal messages in history: {len(history)}")

Run It

uv run chat_openai.py

Understanding the Response Object

# The full response object looks like this:
{
    "id": "chatcmpl-abc123",
    "object": "chat.completion",
    "created": 1699000000,
    "model": "gpt-4o-mini",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The capital of France is Paris."
            },
            "finish_reason": "stop"  # or "length", "tool_calls"
        }
    ],
    "usage": {
        "prompt_tokens": 25,
        "completion_tokens": 8,
        "total_tokens": 33
    }
}

Key fields:

choices[0].message.content — The actual response text
choices[0].finish_reason — Why it stopped ("stop" = natural end, "length" = hit max_tokens)
usage — Token counts for billing

3. Provider Differences: OpenAI vs Claude

The core concept is identical, but APIs differ slightly:

Aspect	OpenAI	Claude
Client	`OpenAI()`	`Anthropic()`
Method	`client.chat.completions.create()`	`client.messages.create()`
System prompt	In messages array	Separate `system` parameter
Response content	`response.choices[0].message.content`	`response.content[0].text`
Streaming	`stream=True`	`client.messages.stream()` context manager

# Claude equivalent (key differences only)
from anthropic import Anthropic
client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1000,
    system="You are a helpful assistant.",  # Separate parameter!
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.content[0].text)  # Different response structure

4. Streaming: The Real-Time Experience

Non-streaming responses make users wait 2-10 seconds staring at a loading spinner. Streaming delivers tokens as they're generated, creating a "typing" effect that feels instant.

How Streaming Works

Server-Sent Events (SSE): A standard protocol for servers to push data to clients over HTTP. Each chunk arrives as a text line starting with data: .

OpenAI Streaming

chat_openai_streaming.py
"""
OpenAI Streaming Chat
=====================
Watch tokens arrive in real-time.
"""

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()

def chat_streaming(
    user_message: str,
    system_prompt: str = "You are a helpful assistant."
):
    """
    Stream a response token by token.
    
    Yields chunks as they arrive.
    """
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        temperature=0.7,
        max_tokens=1000,
        stream=True,  # This is the magic flag!
    )
    
    for chunk in stream:
        # Each chunk has a delta with partial content
        if chunk.choices[0].delta.content is not None:
            yield chunk.choices[0].delta.content


def chat_streaming_full(
    user_message: str,
    system_prompt: str = "You are a helpful assistant."
) -> str:
    """
    Stream but also return the complete response.
    
    Useful when you need to both stream to UI AND save the full response.
    """
    full_response = ""
    
    for chunk in chat_streaming(user_message, system_prompt):
        print(chunk, end="", flush=True)  # Print without newline
        full_response += chunk
    
    print()  # Final newline
    return full_response


if __name__ == "__main__":
    print("=== Streaming Response ===")
    print("Assistant: ", end="")
    
    response = chat_streaming_full(
        "Write a haiku about programming."
    )
    
    print(f"\n[Full response saved: {len(response)} characters]")

Claude Streaming

Claude uses a context manager: with client.messages.stream(...) as stream: and then iterates stream.text_stream. The concept is identical.

5. Building a Streaming Backend (FastAPI)

For production, you'll need a backend that:

Receives requests from your frontend
Calls the LLM API
Streams responses back to the client

Setup

uv add fastapi uvicorn sse-starlette
touch server.py

Streaming API Server

server.py
"""
Streaming Chat API Server
=========================
A FastAPI backend that streams LLM responses to clients.
"""

import os
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from sse_starlette.sse import EventSourceResponse
from openai import OpenAI
from dotenv import load_dotenv
import json
import asyncio

load_dotenv()

app = FastAPI(title="Streaming Chat API")

# Enable CORS for frontend
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Configure properly in production!
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

client = OpenAI()


class ChatRequest(BaseModel):
    """Request body for chat endpoint."""
    messages: list[dict]
    model: str = "gpt-4o-mini"
    temperature: float = 0.7
    max_tokens: int = 1000
    system_prompt: str = "You are a helpful assistant."


class ChatResponse(BaseModel):
    """Non-streaming response."""
    content: str
    finish_reason: str
    usage: dict


# ─────────────────────────────────────────────────────────────────────────────
# Non-Streaming Endpoint
# ─────────────────────────────────────────────────────────────────────────────

@app.post("/api/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """
    Non-streaming chat endpoint.
    
    Use this when you don't need real-time streaming.
    """
    try:
        # Prepend system message
        messages = [{"role": "system", "content": request.system_prompt}]
        messages.extend(request.messages)
        
        response = client.chat.completions.create(
            model=request.model,
            messages=messages,
            temperature=request.temperature,
            max_tokens=request.max_tokens,
        )
        
        return ChatResponse(
            content=response.choices[0].message.content,
            finish_reason=response.choices[0].finish_reason,
            usage={
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
                "total_tokens": response.usage.total_tokens,
            }
        )
    
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


# ─────────────────────────────────────────────────────────────────────────────
# Streaming Endpoint
# ─────────────────────────────────────────────────────────────────────────────

@app.post("/api/chat/stream")
async def chat_stream(request: ChatRequest):
    """
    Streaming chat endpoint using Server-Sent Events.
    
    The frontend connects and receives chunks as they arrive.
    """
    
    async def generate():
        try:
            # Prepend system message
            messages = [{"role": "system", "content": request.system_prompt}]
            messages.extend(request.messages)
            
            stream = client.chat.completions.create(
                model=request.model,
                messages=messages,
                temperature=request.temperature,
                max_tokens=request.max_tokens,
                stream=True,
            )
            
            for chunk in stream:
                if chunk.choices[0].delta.content is not None:
                    # Send each chunk as an SSE event
                    data = json.dumps({
                        "content": chunk.choices[0].delta.content,
                        "done": False
                    })
                    yield {"event": "message", "data": data}
                    
                    # Small delay to prevent overwhelming the client
                    await asyncio.sleep(0.01)
            
            # Signal completion
            yield {"event": "message", "data": json.dumps({"content": "", "done": True})}
            
        except Exception as e:
            yield {"event": "error", "data": json.dumps({"error": str(e)})}
    
    return EventSourceResponse(generate())


# ─────────────────────────────────────────────────────────────────────────────
# Health Check
# ─────────────────────────────────────────────────────────────────────────────

@app.get("/health")
async def health():
    return {"status": "ok"}


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run the Server

uv run server.py

Test with curl:

Non-streaming
curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

Streaming
curl -X POST http://localhost:8000/api/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Write a short poem."}]}'

6. Provider Abstraction

For production, abstract away provider differences:

llm_client.py
from abc import ABC, abstractmethod
from typing import Generator

class LLMClient(ABC):
    @abstractmethod
    def chat(self, messages: list[dict]) -> str: pass
    
    @abstractmethod
    def chat_stream(self, messages: list[dict]) -> Generator[str, None, None]: pass


class OpenAIClient(LLMClient):
    def __init__(self, model: str = "gpt-4o-mini"):
        from openai import OpenAI
        self.client = OpenAI()
        self.model = model
    
    def chat(self, messages: list[dict]) -> str:
        response = self.client.chat.completions.create(
            model=self.model, messages=messages
        )
        return response.choices[0].message.content
    
    def chat_stream(self, messages: list[dict]) -> Generator[str, None, None]:
        stream = self.client.chat.completions.create(
            model=self.model, messages=messages, stream=True
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content


def get_client(provider: str = "openai") -> LLMClient:
    if provider == "openai":
        return OpenAIClient()
    elif provider == "anthropic":
        return AnthropicClient()  # Similar implementation
    raise ValueError(f"Unknown provider: {provider}")

7. Error Handling & Retries

Retry decorator for API calls
import time
from functools import wraps

def retry_with_backoff(max_retries=3, initial_delay=1.0):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            delay = initial_delay
            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries:
                        raise
                    time.sleep(delay)
                    delay *= 2  # Exponential backoff
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def chat_with_retry(messages):
    return client.chat.completions.create(model="gpt-4o-mini", messages=messages)

8. Common Pitfalls

Symptom	Cause	Fix
No response, then everything at once	Not using `stream=True`	Enable streaming on both backend and frontend
CORS errors in browser	Backend not configured for cross-origin	Add CORS middleware with correct origins
Response cuts off mid-sentence	Hit `max_tokens` limit	Increase limit or check `finish_reason`
"Invalid API key" errors	Key not in environment	Check `.env` file and `load_dotenv()` call
Memory grows with long conversations	Sending full history every time	Implement conversation truncation or summarization
Rate limit errors	Too many requests	Implement retry with exponential backoff

9. Key Takeaways

Streaming transforms UX. Users perceive streaming responses as faster even when total time is the same.
APIs are similar but not identical. OpenAI and Claude have different message formats—abstract early.
SSE is the standard. Server-Sent Events are how you stream from backend to frontend.
Frontend state is tricky. Append to the last message as chunks arrive; don't create new messages.
Always implement retries. Rate limits and timeouts are normal—handle them gracefully.
Provider abstraction pays off. Write to an interface, swap implementations easily.

10. What's Next

You've built your first real AI integration! In Lesson 6: Structured Data Extraction, we'll learn how to get reliable JSON output from LLMs—turning messy text into typed objects your code can actually use.

We'll cover:

JSON mode and response formats
Schema validation with Pydantic and Zod
Retry strategies for malformed output
Building a document parser that extracts structured entities

11. Additional Resources

OpenAI API Reference — Official documentation
Anthropic API Reference — Claude API docs
Server-Sent Events (MDN) — SSE specification
Vercel AI SDK — Production-ready React hooks for AI chat

1. The Anatomy of an API Call​

2. Your First API Call: OpenAI​

Setup​

Basic Chat Completion​

Run It​

Understanding the Response Object​

3. Provider Differences: OpenAI vs Claude​

4. Streaming: The Real-Time Experience​

How Streaming Works​

OpenAI Streaming​

5. Building a Streaming Backend (FastAPI)​

Setup​

Streaming API Server​

Run the Server​

6. Provider Abstraction​

7. Error Handling & Retries​

8. Common Pitfalls​

9. Key Takeaways​

10. What's Next​

11. Additional Resources​

1. The Anatomy of an API Call

2. Your First API Call: OpenAI

Setup

Basic Chat Completion

Run It

Understanding the Response Object

3. Provider Differences: OpenAI vs Claude

4. Streaming: The Real-Time Experience

How Streaming Works

OpenAI Streaming

5. Building a Streaming Backend (FastAPI)

Setup

Streaming API Server

Run the Server

6. Provider Abstraction

7. Error Handling & Retries

8. Common Pitfalls

9. Key Takeaways

10. What's Next

11. Additional Resources