Lesson 5: Text Generation & Streaming UIs
- The API Call: Anatomy of a chat completion request.
- Provider APIs: OpenAI, Anthropic Claude, and Google Gemini side-by-side.
- Streaming: Server-Sent Events (SSE) for real-time token delivery.
- Frontend Integration: Building responsive chat UIs that stream.
- Backend Patterns: FastAPI endpoints that proxy and stream LLM responses.
- Provider Abstraction: Writing code that doesn't care which LLM it's calling.
You've mastered tokens, prompts, system messages, and generation parameters.
Now it's time to actually call an API and build something real. In this lesson, we'll create a streaming chat interface—the foundation of every AI-powered application.
1. The Anatomy of an API Call
Every LLM API follows the same basic pattern:
The request contains:
- Messages: The conversation history (system, user, assistant roles)
- Model: Which model to use (gpt-4o, claude-sonnet-4-20250514, gemini-1.5-pro)
- Parameters: Temperature, max_tokens, etc. (from Lesson 4)
The response contains:
- Content: The generated text
- Usage: Token counts (for billing)
- Finish reason: Why generation stopped (length, stop, tool_calls)
2. Your First API Call: OpenAI
Let's start with OpenAI—the most common starting point.
Setup
mkdir streaming-chat
cd streaming-chat
uv init
uv add openai python-dotenv
touch chat_openai.py
Create a .env file:
OPENAI_API_KEY=sk-your-key-here
Basic Chat Completion
"""
Basic OpenAI Chat Completion
============================
Your first API call. No streaming yet—just request/response.
"""
import os
import json
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
# Initialize the client
client = OpenAI() # Automatically uses OPENAI_API_KEY from environment
def chat(user_message: str, system_prompt: str = "You are a helpful assistant.") -> str:
"""
Send a message and get a response.
This is the simplest possible API call.
"""
response = client.chat.completions.create(
model="gpt-4o-mini", # Fast and cheap for testing
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
temperature=0.7,
max_tokens=1000,
)
# Extract the response
return response.choices[0].message.content
def chat_with_response_object(
user_message: str,
system_prompt: str = "You are a helpful assistant."
) -> dict:
"""
Send a message and return the full response object.
Useful for understanding the complete API response structure.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
temperature=0.7,
max_tokens=1000,
)
# Pydantic models have model_dump() to serialize to dict
return response.model_dump()
def chat_with_history(
messages: list[dict],
system_prompt: str = "You are a helpful assistant."
) -> tuple[str, list[dict]]:
"""
Chat with conversation history.
Returns the response AND the updated message history.
"""
# Prepend system prompt if not already there
if not messages or messages[0].get("role") != "system":
messages = [{"role": "system", "content": system_prompt}] + messages
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.7,
max_tokens=1000,
)
assistant_message = response.choices[0].message.content
# Add assistant response to history
messages.append({"role": "assistant", "content": assistant_message})
return assistant_message, messages
if __name__ == "__main__":
# Test basic chat
print("=== Basic Chat ===")
response = chat("What's the capital of France?")
print(f"Response: {response}")
# Test with full response object
print("\n=== Full Response Object ===")
full_response = chat_with_response_object("What's the capital of France?")
print(json.dumps(full_response, indent=2))
# Test with history
print("\n=== Chat with History ===")
history = []
# Turn 1
history.append({"role": "user", "content": "My name is Alex."})
response, history = chat_with_history(history)
print(f"Assistant: {response}")
# Turn 2
history.append({"role": "user", "content": "What's my name?"})
response, history = chat_with_history(history)
print(f"Assistant: {response}")
print(f"\nTotal messages in history: {len(history)}")
Run It
uv run chat_openai.py
Understanding the Response Object
# The full response object looks like this:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1699000000,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop" # or "length", "tool_calls"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
}
}
Key fields:
choices[0].message.content— The actual response textchoices[0].finish_reason— Why it stopped ("stop" = natural end, "length" = hit max_tokens)usage— Token counts for billing
3. Provider Differences: OpenAI vs Claude
The core concept is identical, but APIs differ slightly:
| Aspect | OpenAI | Claude |
|---|---|---|
| Client | OpenAI() | Anthropic() |
| Method | client.chat.completions.create() | client.messages.create() |
| System prompt | In messages array | Separate system parameter |
| Response content | response.choices[0].message.content | response.content[0].text |
| Streaming | stream=True | client.messages.stream() context manager |
# Claude equivalent (key differences only)
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
system="You are a helpful assistant.", # Separate parameter!
messages=[{"role": "user", "content": "Hello"}],
)
print(response.content[0].text) # Different response structure
4. Streaming: The Real-Time Experience
Non-streaming responses make users wait 2-10 seconds staring at a loading spinner. Streaming delivers tokens as they're generated, creating a "typing" effect that feels instant.
How Streaming Works
Server-Sent Events (SSE): A standard protocol for servers to push data to clients over HTTP. Each chunk arrives as a text line starting with data: .
OpenAI Streaming
"""
OpenAI Streaming Chat
=====================
Watch tokens arrive in real-time.
"""
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def chat_streaming(
user_message: str,
system_prompt: str = "You are a helpful assistant."
):
"""
Stream a response token by token.
Yields chunks as they arrive.
"""
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
temperature=0.7,
max_tokens=1000,
stream=True, # This is the magic flag!
)
for chunk in stream:
# Each chunk has a delta with partial content
if chunk.choices[0].delta.content is not None:
yield chunk.choices[0].delta.content
def chat_streaming_full(
user_message: str,
system_prompt: str = "You are a helpful assistant."
) -> str:
"""
Stream but also return the complete response.
Useful when you need to both stream to UI AND save the full response.
"""
full_response = ""
for chunk in chat_streaming(user_message, system_prompt):
print(chunk, end="", flush=True) # Print without newline
full_response += chunk
print() # Final newline
return full_response
if __name__ == "__main__":
print("=== Streaming Response ===")
print("Assistant: ", end="")
response = chat_streaming_full(
"Write a haiku about programming."
)
print(f"\n[Full response saved: {len(response)} characters]")
Claude uses a context manager: with client.messages.stream(...) as stream: and then iterates stream.text_stream. The concept is identical.
5. Building a Streaming Backend (FastAPI)
For production, you'll need a backend that:
- Receives requests from your frontend
- Calls the LLM API
- Streams responses back to the client
Setup
uv add fastapi uvicorn sse-starlette
touch server.py
Streaming API Server
"""
Streaming Chat API Server
=========================
A FastAPI backend that streams LLM responses to clients.
"""
import os
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from sse_starlette.sse import EventSourceResponse
from openai import OpenAI
from dotenv import load_dotenv
import json
import asyncio
load_dotenv()
app = FastAPI(title="Streaming Chat API")
# Enable CORS for frontend
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Configure properly in production!
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
client = OpenAI()
class ChatRequest(BaseModel):
"""Request body for chat endpoint."""
messages: list[dict]
model: str = "gpt-4o-mini"
temperature: float = 0.7
max_tokens: int = 1000
system_prompt: str = "You are a helpful assistant."
class ChatResponse(BaseModel):
"""Non-streaming response."""
content: str
finish_reason: str
usage: dict
# ─────────────────────────────────────────────────────────────────────────────
# Non-Streaming Endpoint
# ─────────────────────────────────────────────────────────────────────────────
@app.post("/api/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
"""
Non-streaming chat endpoint.
Use this when you don't need real-time streaming.
"""
try:
# Prepend system message
messages = [{"role": "system", "content": request.system_prompt}]
messages.extend(request.messages)
response = client.chat.completions.create(
model=request.model,
messages=messages,
temperature=request.temperature,
max_tokens=request.max_tokens,
)
return ChatResponse(
content=response.choices[0].message.content,
finish_reason=response.choices[0].finish_reason,
usage={
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens,
}
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# ─────────────────────────────────────────────────────────────────────────────
# Streaming Endpoint
# ─────────────────────────────────────────────────────────────────────────────
@app.post("/api/chat/stream")
async def chat_stream(request: ChatRequest):
"""
Streaming chat endpoint using Server-Sent Events.
The frontend connects and receives chunks as they arrive.
"""
async def generate():
try:
# Prepend system message
messages = [{"role": "system", "content": request.system_prompt}]
messages.extend(request.messages)
stream = client.chat.completions.create(
model=request.model,
messages=messages,
temperature=request.temperature,
max_tokens=request.max_tokens,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
# Send each chunk as an SSE event
data = json.dumps({
"content": chunk.choices[0].delta.content,
"done": False
})
yield {"event": "message", "data": data}
# Small delay to prevent overwhelming the client
await asyncio.sleep(0.01)
# Signal completion
yield {"event": "message", "data": json.dumps({"content": "", "done": True})}
except Exception as e:
yield {"event": "error", "data": json.dumps({"error": str(e)})}
return EventSourceResponse(generate())
# ─────────────────────────────────────────────────────────────────────────────
# Health Check
# ─────────────────────────────────────────────────────────────────────────────
@app.get("/health")
async def health():
return {"status": "ok"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Run the Server
uv run server.py
Test with curl:
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}]}'
curl -X POST http://localhost:8000/api/chat/stream \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Write a short poem."}]}'
6. Provider Abstraction
For production, abstract away provider differences:
from abc import ABC, abstractmethod
from typing import Generator
class LLMClient(ABC):
@abstractmethod
def chat(self, messages: list[dict]) -> str: pass
@abstractmethod
def chat_stream(self, messages: list[dict]) -> Generator[str, None, None]: pass
class OpenAIClient(LLMClient):
def __init__(self, model: str = "gpt-4o-mini"):
from openai import OpenAI
self.client = OpenAI()
self.model = model
def chat(self, messages: list[dict]) -> str:
response = self.client.chat.completions.create(
model=self.model, messages=messages
)
return response.choices[0].message.content
def chat_stream(self, messages: list[dict]) -> Generator[str, None, None]:
stream = self.client.chat.completions.create(
model=self.model, messages=messages, stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
def get_client(provider: str = "openai") -> LLMClient:
if provider == "openai":
return OpenAIClient()
elif provider == "anthropic":
return AnthropicClient() # Similar implementation
raise ValueError(f"Unknown provider: {provider}")
7. Error Handling & Retries
import time
from functools import wraps
def retry_with_backoff(max_retries=3, initial_delay=1.0):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
delay = initial_delay
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_retries:
raise
time.sleep(delay)
delay *= 2 # Exponential backoff
return wrapper
return decorator
@retry_with_backoff(max_retries=3)
def chat_with_retry(messages):
return client.chat.completions.create(model="gpt-4o-mini", messages=messages)
8. Common Pitfalls
| Symptom | Cause | Fix |
|---|---|---|
| No response, then everything at once | Not using stream=True | Enable streaming on both backend and frontend |
| CORS errors in browser | Backend not configured for cross-origin | Add CORS middleware with correct origins |
| Response cuts off mid-sentence | Hit max_tokens limit | Increase limit or check finish_reason |
| "Invalid API key" errors | Key not in environment | Check .env file and load_dotenv() call |
| Memory grows with long conversations | Sending full history every time | Implement conversation truncation or summarization |
| Rate limit errors | Too many requests | Implement retry with exponential backoff |
9. Key Takeaways
-
Streaming transforms UX. Users perceive streaming responses as faster even when total time is the same.
-
APIs are similar but not identical. OpenAI and Claude have different message formats—abstract early.
-
SSE is the standard. Server-Sent Events are how you stream from backend to frontend.
-
Frontend state is tricky. Append to the last message as chunks arrive; don't create new messages.
-
Always implement retries. Rate limits and timeouts are normal—handle them gracefully.
-
Provider abstraction pays off. Write to an interface, swap implementations easily.
10. What's Next
You've built your first real AI integration! In Lesson 6: Structured Data Extraction, we'll learn how to get reliable JSON output from LLMs—turning messy text into typed objects your code can actually use.
We'll cover:
- JSON mode and response formats
- Schema validation with Pydantic and Zod
- Retry strategies for malformed output
- Building a document parser that extracts structured entities
11. Additional Resources
- OpenAI API Reference — Official documentation
- Anthropic API Reference — Claude API docs
- Server-Sent Events (MDN) — SSE specification
- Vercel AI SDK — Production-ready React hooks for AI chat