Skip to main content

Lesson 5: Text Generation & Streaming UIs

Topics Covered
  • The API Call: Anatomy of a chat completion request.
  • Provider APIs: OpenAI, Anthropic Claude, and Google Gemini side-by-side.
  • Streaming: Server-Sent Events (SSE) for real-time token delivery.
  • Frontend Integration: Building responsive chat UIs that stream.
  • Backend Patterns: FastAPI endpoints that proxy and stream LLM responses.
  • Provider Abstraction: Writing code that doesn't care which LLM it's calling.

You've mastered tokens, prompts, system messages, and generation parameters.

Now it's time to actually call an API and build something real. In this lesson, we'll create a streaming chat interface—the foundation of every AI-powered application.

1. The Anatomy of an API Call

Every LLM API follows the same basic pattern:

The request contains:

  1. Messages: The conversation history (system, user, assistant roles)
  2. Model: Which model to use (gpt-4o, claude-sonnet-4-20250514, gemini-1.5-pro)
  3. Parameters: Temperature, max_tokens, etc. (from Lesson 4)

The response contains:

  1. Content: The generated text
  2. Usage: Token counts (for billing)
  3. Finish reason: Why generation stopped (length, stop, tool_calls)

2. Your First API Call: OpenAI

Let's start with OpenAI—the most common starting point.

Setup

mkdir streaming-chat
cd streaming-chat
uv init
uv add openai python-dotenv
touch chat_openai.py

Create a .env file:

OPENAI_API_KEY=sk-your-key-here

Basic Chat Completion

chat_openai.py
"""
Basic OpenAI Chat Completion
============================
Your first API call. No streaming yet—just request/response.
"""

import os
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = OpenAI() # Automatically uses OPENAI_API_KEY from environment

def chat(user_message: str, system_prompt: str = "You are a helpful assistant.") -> str:
"""
Send a message and get a response.

This is the simplest possible API call.
"""
response = client.chat.completions.create(
model="gpt-4o-mini", # Fast and cheap for testing
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
temperature=0.7,
max_tokens=1000,
)

# Extract the response
return response.choices[0].message.content


def chat_with_response_object(
user_message: str,
system_prompt: str = "You are a helpful assistant."
) -> dict:
"""
Send a message and return the full response object.

Useful for understanding the complete API response structure.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
temperature=0.7,
max_tokens=1000,
)

# Pydantic models have model_dump() to serialize to dict
return response.model_dump()


def chat_with_history(
messages: list[dict],
system_prompt: str = "You are a helpful assistant."
) -> tuple[str, list[dict]]:
"""
Chat with conversation history.

Returns the response AND the updated message history.
"""
# Prepend system prompt if not already there
if not messages or messages[0].get("role") != "system":
messages = [{"role": "system", "content": system_prompt}] + messages

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.7,
max_tokens=1000,
)

assistant_message = response.choices[0].message.content

# Add assistant response to history
messages.append({"role": "assistant", "content": assistant_message})

return assistant_message, messages


if __name__ == "__main__":
# Test basic chat
print("=== Basic Chat ===")
response = chat("What's the capital of France?")
print(f"Response: {response}")

# Test with full response object
print("\n=== Full Response Object ===")
full_response = chat_with_response_object("What's the capital of France?")
print(json.dumps(full_response, indent=2))

# Test with history
print("\n=== Chat with History ===")
history = []

# Turn 1
history.append({"role": "user", "content": "My name is Alex."})
response, history = chat_with_history(history)
print(f"Assistant: {response}")

# Turn 2
history.append({"role": "user", "content": "What's my name?"})
response, history = chat_with_history(history)
print(f"Assistant: {response}")

print(f"\nTotal messages in history: {len(history)}")

Run It

uv run chat_openai.py

Understanding the Response Object

# The full response object looks like this:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1699000000,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop" # or "length", "tool_calls"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
}
}

Key fields:

  • choices[0].message.content — The actual response text
  • choices[0].finish_reason — Why it stopped ("stop" = natural end, "length" = hit max_tokens)
  • usage — Token counts for billing

3. Provider Differences: OpenAI vs Claude

The core concept is identical, but APIs differ slightly:

AspectOpenAIClaude
ClientOpenAI()Anthropic()
Methodclient.chat.completions.create()client.messages.create()
System promptIn messages arraySeparate system parameter
Response contentresponse.choices[0].message.contentresponse.content[0].text
Streamingstream=Trueclient.messages.stream() context manager
# Claude equivalent (key differences only)
from anthropic import Anthropic
client = Anthropic()

response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
system="You are a helpful assistant.", # Separate parameter!
messages=[{"role": "user", "content": "Hello"}],
)
print(response.content[0].text) # Different response structure

4. Streaming: The Real-Time Experience

Non-streaming responses make users wait 2-10 seconds staring at a loading spinner. Streaming delivers tokens as they're generated, creating a "typing" effect that feels instant.

How Streaming Works

Server-Sent Events (SSE): A standard protocol for servers to push data to clients over HTTP. Each chunk arrives as a text line starting with data: .

OpenAI Streaming

chat_openai_streaming.py
"""
OpenAI Streaming Chat
=====================
Watch tokens arrive in real-time.
"""

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()

def chat_streaming(
user_message: str,
system_prompt: str = "You are a helpful assistant."
):
"""
Stream a response token by token.

Yields chunks as they arrive.
"""
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
temperature=0.7,
max_tokens=1000,
stream=True, # This is the magic flag!
)

for chunk in stream:
# Each chunk has a delta with partial content
if chunk.choices[0].delta.content is not None:
yield chunk.choices[0].delta.content


def chat_streaming_full(
user_message: str,
system_prompt: str = "You are a helpful assistant."
) -> str:
"""
Stream but also return the complete response.

Useful when you need to both stream to UI AND save the full response.
"""
full_response = ""

for chunk in chat_streaming(user_message, system_prompt):
print(chunk, end="", flush=True) # Print without newline
full_response += chunk

print() # Final newline
return full_response


if __name__ == "__main__":
print("=== Streaming Response ===")
print("Assistant: ", end="")

response = chat_streaming_full(
"Write a haiku about programming."
)

print(f"\n[Full response saved: {len(response)} characters]")
Claude Streaming

Claude uses a context manager: with client.messages.stream(...) as stream: and then iterates stream.text_stream. The concept is identical.

5. Building a Streaming Backend (FastAPI)

For production, you'll need a backend that:

  1. Receives requests from your frontend
  2. Calls the LLM API
  3. Streams responses back to the client

Setup

uv add fastapi uvicorn sse-starlette
touch server.py

Streaming API Server

server.py
"""
Streaming Chat API Server
=========================
A FastAPI backend that streams LLM responses to clients.
"""

import os
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from sse_starlette.sse import EventSourceResponse
from openai import OpenAI
from dotenv import load_dotenv
import json
import asyncio

load_dotenv()

app = FastAPI(title="Streaming Chat API")

# Enable CORS for frontend
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Configure properly in production!
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

client = OpenAI()


class ChatRequest(BaseModel):
"""Request body for chat endpoint."""
messages: list[dict]
model: str = "gpt-4o-mini"
temperature: float = 0.7
max_tokens: int = 1000
system_prompt: str = "You are a helpful assistant."


class ChatResponse(BaseModel):
"""Non-streaming response."""
content: str
finish_reason: str
usage: dict


# ─────────────────────────────────────────────────────────────────────────────
# Non-Streaming Endpoint
# ─────────────────────────────────────────────────────────────────────────────

@app.post("/api/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
"""
Non-streaming chat endpoint.

Use this when you don't need real-time streaming.
"""
try:
# Prepend system message
messages = [{"role": "system", "content": request.system_prompt}]
messages.extend(request.messages)

response = client.chat.completions.create(
model=request.model,
messages=messages,
temperature=request.temperature,
max_tokens=request.max_tokens,
)

return ChatResponse(
content=response.choices[0].message.content,
finish_reason=response.choices[0].finish_reason,
usage={
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens,
}
)

except Exception as e:
raise HTTPException(status_code=500, detail=str(e))


# ─────────────────────────────────────────────────────────────────────────────
# Streaming Endpoint
# ─────────────────────────────────────────────────────────────────────────────

@app.post("/api/chat/stream")
async def chat_stream(request: ChatRequest):
"""
Streaming chat endpoint using Server-Sent Events.

The frontend connects and receives chunks as they arrive.
"""

async def generate():
try:
# Prepend system message
messages = [{"role": "system", "content": request.system_prompt}]
messages.extend(request.messages)

stream = client.chat.completions.create(
model=request.model,
messages=messages,
temperature=request.temperature,
max_tokens=request.max_tokens,
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content is not None:
# Send each chunk as an SSE event
data = json.dumps({
"content": chunk.choices[0].delta.content,
"done": False
})
yield {"event": "message", "data": data}

# Small delay to prevent overwhelming the client
await asyncio.sleep(0.01)

# Signal completion
yield {"event": "message", "data": json.dumps({"content": "", "done": True})}

except Exception as e:
yield {"event": "error", "data": json.dumps({"error": str(e)})}

return EventSourceResponse(generate())


# ─────────────────────────────────────────────────────────────────────────────
# Health Check
# ─────────────────────────────────────────────────────────────────────────────

@app.get("/health")
async def health():
return {"status": "ok"}


if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

Run the Server

uv run server.py

Test with curl:

Non-streaming
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}]}'
Streaming
curl -X POST http://localhost:8000/api/chat/stream \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Write a short poem."}]}'

6. Provider Abstraction

For production, abstract away provider differences:

llm_client.py
from abc import ABC, abstractmethod
from typing import Generator

class LLMClient(ABC):
@abstractmethod
def chat(self, messages: list[dict]) -> str: pass

@abstractmethod
def chat_stream(self, messages: list[dict]) -> Generator[str, None, None]: pass


class OpenAIClient(LLMClient):
def __init__(self, model: str = "gpt-4o-mini"):
from openai import OpenAI
self.client = OpenAI()
self.model = model

def chat(self, messages: list[dict]) -> str:
response = self.client.chat.completions.create(
model=self.model, messages=messages
)
return response.choices[0].message.content

def chat_stream(self, messages: list[dict]) -> Generator[str, None, None]:
stream = self.client.chat.completions.create(
model=self.model, messages=messages, stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content


def get_client(provider: str = "openai") -> LLMClient:
if provider == "openai":
return OpenAIClient()
elif provider == "anthropic":
return AnthropicClient() # Similar implementation
raise ValueError(f"Unknown provider: {provider}")

7. Error Handling & Retries

Retry decorator for API calls
import time
from functools import wraps

def retry_with_backoff(max_retries=3, initial_delay=1.0):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
delay = initial_delay
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_retries:
raise
time.sleep(delay)
delay *= 2 # Exponential backoff
return wrapper
return decorator

@retry_with_backoff(max_retries=3)
def chat_with_retry(messages):
return client.chat.completions.create(model="gpt-4o-mini", messages=messages)

8. Common Pitfalls

SymptomCauseFix
No response, then everything at onceNot using stream=TrueEnable streaming on both backend and frontend
CORS errors in browserBackend not configured for cross-originAdd CORS middleware with correct origins
Response cuts off mid-sentenceHit max_tokens limitIncrease limit or check finish_reason
"Invalid API key" errorsKey not in environmentCheck .env file and load_dotenv() call
Memory grows with long conversationsSending full history every timeImplement conversation truncation or summarization
Rate limit errorsToo many requestsImplement retry with exponential backoff

9. Key Takeaways

  1. Streaming transforms UX. Users perceive streaming responses as faster even when total time is the same.

  2. APIs are similar but not identical. OpenAI and Claude have different message formats—abstract early.

  3. SSE is the standard. Server-Sent Events are how you stream from backend to frontend.

  4. Frontend state is tricky. Append to the last message as chunks arrive; don't create new messages.

  5. Always implement retries. Rate limits and timeouts are normal—handle them gracefully.

  6. Provider abstraction pays off. Write to an interface, swap implementations easily.

10. What's Next

You've built your first real AI integration! In Lesson 6: Structured Data Extraction, we'll learn how to get reliable JSON output from LLMs—turning messy text into typed objects your code can actually use.

We'll cover:

  • JSON mode and response formats
  • Schema validation with Pydantic and Zod
  • Retry strategies for malformed output
  • Building a document parser that extracts structured entities

11. Additional Resources