Master AI Engineering
From APIs to Agents
7 structured modules with real Python code, 350 interview Q&As, and an AI-powered mock interview engine. Everything you need to go from zero to job-ready.
A Complete AI Engineering Curriculum
Sign in to unlock all 7 modules and track your progress
Learn. Practice. Get Hired.
Sign Up Free
Create your account in seconds. No credit card required. Instant access to all 7 modules.
Learn with Code
Work through structured modules with real Python examples you can run immediately.
Practice Interviews
Test your skills with 350 AI interview questions and get instant feedback on your answers.
Simulate Real AI Interviews
Our mock interview engine samples 10 random questions from our 350-question bank, lets you write answers, then shows model answers side by side. Rate yourself, review your history, and track improvement.
- ✅ 350 questions across all 7 topics
- ✅ Easy / Medium / Hard difficulty filters
- ✅ Timer mode for realistic pressure
- ✅ Self-rating (1–5 stars) with history tracking
- ✅ Browse full Q&A bank anytime
Simple, Transparent Pricing
All course content is free. Pay only for mock interview sessions.
- ✓ All 7 modules
- ✓ 40+ Python code examples
- ✓ Architecture diagrams
- ✓ Progress tracking
- ✓ Browse Q&A bank
- ✓ 10-question mock sessions
- ✓ Model answer comparisons
- ✓ Difficulty selection
- ✓ Session history
- ✓ Credits never expire
- ✓ 200 credits monthly
- ✓ 4 mock interviews/month
- ✓ All free features
- ✓ Priority support
- ✓ Best value for job prep
Ready to master AI engineering?
Join today — it's completely free to start.
Master Modern AI,
From APIs to Agents
A hands-on curriculum covering LLM usage, prompt engineering, fine-tuning, reinforcement learning, RAG, and autonomous agent systems — with real Python code examples.
Course Modules
How to Use Large Language Models
Learn to interact with LLM APIs programmatically — sending prompts, handling responses, streaming tokens, and controlling generation behavior with key parameters.
Learning Objectives
- Call the OpenAI / Anthropic API
- Understand chat message formats
- Stream tokens in real-time
- Tune temperature, top-p, max_tokens
- Count tokens & estimate cost
🧠 What is an LLM?
A Large Language Model (LLM) is a neural network trained on vast amounts of text data to predict the next token in a sequence. Models like GPT-4, Claude, and Llama have billions of parameters and can understand and generate human-like text.
Modern LLMs are accessed via an API — you send a structured request with your conversation history, and the model returns a completion.
⚡ Basic API Call
Start by installing the SDK and making your first call. The messages list contains the conversation history in order.
# pip install openai
from openai import OpenAI
client = OpenAI(api_key="sk-...") # or set OPENAI_API_KEY env var
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful AI tutor."},
{"role": "user", "content": "Explain what a neural network is in simple terms."}
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")
# pip install anthropic
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="You are a helpful AI tutor.",
messages=[
{"role": "user", "content": "Explain what a neural network is in simple terms."}
]
)
print(message.content[0].text)
print(f"\nInput tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
💬 Multi-turn Conversations
LLMs are stateless — each request must include the full conversation history. Maintain a messages list and append each turn manually.
from openai import OpenAI
client = OpenAI()
messages = [{"role": "system", "content": "You are a friendly tutor."}]
def chat(user_input: str) -> str:
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.7,
)
reply = response.choices[0].message.content
messages.append({"role": "assistant", "content": reply})
return reply
# Simulate a conversation
print(chat("What is gradient descent?"))
print(chat("Can you give me a real-world analogy?"))
print(chat("How does it relate to backpropagation?"))
# messages list now contains the full history
print(f"\nConversation length: {len(messages)} messages")
🌊 Streaming Responses
Streaming sends tokens to your app as they're generated, instead of waiting for the full response. This dramatically improves perceived latency for users.
from openai import OpenAI
client = OpenAI()
with client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about machine learning."}],
stream=True,
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True) # Print token by token
print() # Newline at end
🎛️ Key Generation Parameters
| Parameter | Range | Effect | Recommended |
|---|---|---|---|
temperature | 0.0–2.0 | Higher = more random/creative output | 0.0 for factual, 0.7 for creative |
top_p | 0.0–1.0 | Nucleus sampling; limits vocab to top-p probability mass | 0.9 (use either temp or top_p, not both) |
max_tokens | 1–model limit | Maximum output length in tokens | 256–2048 for most tasks |
frequency_penalty | -2.0–2.0 | Penalizes repeating tokens that appeared frequently | 0.3–0.5 to reduce repetition |
presence_penalty | -2.0–2.0 | Penalizes tokens that appeared at all (encourages new topics) | 0.5–1.0 for diverse outputs |
stop | list of strings | Stop generation when any sequence is produced | ["###", "\n\n"] |
📋 Structured JSON Output
Use response_format to guarantee valid JSON output, useful for building applications that parse model responses.
from openai import OpenAI
import json
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Respond only with valid JSON."},
{"role": "user", "content": (
"Extract entities from: 'Sam Altman founded OpenAI in 2015 in San Francisco.' "
"Return JSON with keys: people, organizations, locations, years."
)},
]
)
data = json.loads(response.choices[0].message.content)
print(data)
# {"people": ["Sam Altman"], "organizations": ["OpenAI"],
# "locations": ["San Francisco"], "years": [2015]}
Prompt Engineering
Prompt engineering is the art of crafting inputs that reliably elicit the best possible outputs from an LLM. Small changes in phrasing can dramatically affect quality.
Learning Objectives
- Apply zero-shot & role prompting
- Use Chain-of-Thought reasoning
- Implement self-consistency sampling
- Build reusable prompt templates
- Extract structured data reliably
🗺️ Prompting Techniques Overview
Complexity & effectiveness generally increase left → right
🎯 Zero-Shot Prompting
Zero-shot prompting relies on the model's pre-trained knowledge with no examples. Works well for common tasks. The key is a clear, specific instruction.
from openai import OpenAI
client = OpenAI()
def classify_sentiment(text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Classify the sentiment of the following text.
Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL.
Text: "{text}"
Sentiment:"""
}],
temperature=0, # Deterministic for classification
max_tokens=10,
)
return response.choices[0].message.content.strip()
texts = [
"The model training finished 10x faster than expected!",
"This API keeps returning errors and I can't figure out why.",
"The paper proposes a new attention mechanism."
]
for t in texts:
print(f"'{t[:40]}...' → {classify_sentiment(t)}")
🔗 Chain-of-Thought (CoT)
CoT prompts the model to show its reasoning step-by-step before giving an answer. This dramatically improves performance on math, logic, and multi-step reasoning tasks.
❌ Without CoT
prompt = """
Q: A train travels 150 km in 2 hours,
then 200 km in 3 hours. What is its
average speed for the whole trip?
A:"""
# Often gets wrong answer: 70 km/h (arithmetic mean)
✅ With CoT
prompt = """
Q: A train travels 150 km in 2 hours,
then 200 km in 3 hours. What is its
average speed for the whole trip?
A: Let me think step by step.
"""
# Gets correct answer: 350/5 = 70 km/h
# (total distance / total time)
from openai import OpenAI
client = OpenAI()
def solve_with_cot(problem: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a careful, logical problem solver."},
{"role": "user", "content": f"{problem}\n\nLet's think step by step:"}
],
temperature=0,
)
return response.choices[0].message.content
problem = """
If I have 5 apples and give away 2/5 of them, then receive 3 more,
and finally share equally with one friend, how many do I end up with?
"""
print(solve_with_cot(problem))
♻️ Self-Consistency
Generate multiple CoT responses with high temperature, then take the majority vote. This ensemble approach reduces errors by ~10-20% on reasoning tasks.
from openai import OpenAI
from collections import Counter
import re
client = OpenAI()
def self_consistency(question: str, num_samples: int = 5) -> str:
"""Sample multiple CoT paths and majority-vote the final answer."""
system = (
"Solve the problem step by step. "
"End your response with 'Final answer: '"
)
answers = []
for _ in range(num_samples):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": question}
],
temperature=0.8, # High temp for diverse reasoning paths
)
text = resp.choices[0].message.content
# Extract the final answer
match = re.search(r"Final answer:\s*(.+)", text, re.IGNORECASE)
if match:
answers.append(match.group(1).strip().lower())
if not answers:
return "Could not extract answers"
# Majority vote
most_common, count = Counter(answers).most_common(1)[0]
print(f"Votes: {dict(Counter(answers))}")
print(f"Majority ({count}/{num_samples}): {most_common}")
return most_common
result = self_consistency(
"A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. "
"How much does the ball cost?"
)
🎭 Role Prompting
Assigning a specific role or persona in the system prompt activates relevant knowledge patterns and adjusts the model's communication style.
from openai import OpenAI
client = OpenAI()
ROLES = {
"code_reviewer": """You are a senior software engineer with 15 years of experience.
Review code for: correctness, performance, security vulnerabilities, and maintainability.
Be specific and actionable in your feedback.""",
"socratic_tutor": """You are a Socratic tutor. Never give direct answers.
Instead, guide students to discover answers themselves through carefully crafted questions.
Ask one question at a time.""",
"ux_critic": """You are a UX researcher with expertise in cognitive load theory.
Evaluate designs from the user's perspective. Cite specific usability heuristics.""",
}
def ask_expert(role_key: str, question: str) -> str:
return client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": ROLES[role_key]},
{"role": "user", "content": question}
],
temperature=0.7,
).choices[0].message.content
# Example usage
review = ask_expert("code_reviewer", """
Review this Python function:
def get_user(id):
return db.execute(f"SELECT * FROM users WHERE id = {id}")
""")
db.execute("SELECT * FROM users WHERE id = ?", (id,))📝 Reusable Prompt Templates
Build parameterized templates to standardize prompts across your application and make them easier to iterate on.
from string import Template
from openai import OpenAI
client = OpenAI()
# Define reusable templates
SUMMARIZE_TEMPLATE = Template("""
You are an expert at summarizing $domain content.
Summarize the following text in exactly $num_points bullet points.
Focus on: $focus_areas.
Each bullet should be one clear, concise sentence.
TEXT:
$text
SUMMARY:
""")
EXTRACT_TEMPLATE = Template("""
Extract all $entity_type from the following text.
Return as a JSON array of strings.
If none found, return an empty array [].
TEXT: $text
""")
def summarize(text: str, domain="technical", points=3, focus="key findings"):
prompt = SUMMARIZE_TEMPLATE.substitute(
domain=domain,
num_points=points,
focus_areas=focus,
text=text
)
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
).choices[0].message.content
paper_abstract = """
We present GPT-4, a large multimodal model capable of processing image and text inputs
and producing text outputs. GPT-4 exhibits human-level performance on various professional
and academic benchmarks...
"""
print(summarize(paper_abstract, domain="AI research", points=4, focus="contributions, methods, results"))
Few-Shot Learning
Few-shot learning uses a small number of examples (shots) within the prompt to teach the model a new task without any gradient updates — this is called in-context learning.
Learning Objectives
- Understand in-context learning
- Format examples effectively
- Select high-quality shots
- Build dynamic few-shot retrieval
- Know when few-shot beats zero-shot
🧠 How In-Context Learning Works
Large language models develop the ability to learn new tasks by observing demonstrations in their context window. No weight updates occur — the model uses pattern matching and analogy from its pre-training.
Text: "Great product!" → Label: POSITIVE
Text: "Terrible experience." → Label: NEGATIVE
Text: "It works fine." → Label: NEUTRAL
Text: "Loved it!" → Label: ???
📋 Basic Few-Shot Format
from openai import OpenAI
client = OpenAI()
def few_shot_classify(examples: list[dict], query: str) -> str:
"""
examples: list of {"input": ..., "label": ...}
query: the text to classify
"""
# Build the few-shot prompt
shots = "\n".join([
f'Text: "{ex["input"]}"\nLabel: {ex["label"]}'
for ex in examples
])
prompt = f"""Classify the sentiment of text as POSITIVE, NEGATIVE, or NEUTRAL.
{shots}
Text: "{query}"
Label:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=10,
)
return response.choices[0].message.content.strip()
# Examples covering all three classes
examples = [
{"input": "The delivery was incredibly fast!", "label": "POSITIVE"},
{"input": "Completely broken on arrival.", "label": "NEGATIVE"},
{"input": "It does what it says.", "label": "NEUTRAL"},
{"input": "Best purchase I've made this year!", "label": "POSITIVE"},
{"input": "Won't be buying from them again.", "label": "NEGATIVE"},
]
queries = [
"Works as expected, nothing special.",
"Absolutely love this product!",
"Stopped working after one week."
]
for q in queries:
label = few_shot_classify(examples, q)
print(f"'{q}' → {label}")
🔄 Dynamic Example Selection
Instead of using fixed examples, retrieve the most semantically similar examples to the query. This improves performance especially for diverse or edge-case inputs.
# pip install openai numpy
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return resp.data[0].embedding
def cosine_similarity(a, b) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
class DynamicFewShot:
def __init__(self, examples: list[dict]):
"""examples: list of {"input": ..., "output": ..., "embedding": None}"""
self.examples = examples
# Pre-compute embeddings for all examples
for ex in self.examples:
ex["embedding"] = get_embedding(ex["input"])
def get_top_k(self, query: str, k: int = 3) -> list[dict]:
q_emb = get_embedding(query)
scored = [
(cosine_similarity(q_emb, ex["embedding"]), ex)
for ex in self.examples
]
scored.sort(key=lambda x: x[0], reverse=True)
return [ex for _, ex in scored[:k]]
def predict(self, query: str, k: int = 3) -> str:
top_k = self.get_top_k(query, k)
shots = "\n".join([
f'Input: {ex["input"]}\nOutput: {ex["output"]}'
for ex in top_k
])
prompt = f"Transform the input as shown:\n\n{shots}\nInput: {query}\nOutput:"
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return resp.choices[0].message.content.strip()
# Example: date format normalization
examples = [
{"input": "January 5th, 2024", "output": "2024-01-05"},
{"input": "March 22, 2023", "output": "2023-03-22"},
{"input": "Dec 31st 2022", "output": "2022-12-31"},
{"input": "15 August 2024", "output": "2024-08-15"},
{"input": "July 4, 2025", "output": "2025-07-04"},
{"input": "February 14th, 2024", "output": "2024-02-14"},
]
dfs = DynamicFewShot(examples)
print(dfs.predict("October 3rd, 2024")) # → 2024-10-03
print(dfs.predict("11 November 2025")) # → 2025-11-11
✅ Best Practices
| Principle | Do | Avoid |
|---|---|---|
| Diversity | Cover all output classes/formats in examples | Using examples that are all similar to each other |
| Format | Use identical I/O format for all shots | Inconsistent spacing, punctuation, or structure |
| Quality | Use high-quality, verified example pairs | Incorrect labels — they anchor the model to wrong patterns |
| Order | Put the most relevant example last (recency bias) | Random ordering for critical tasks |
| Count | 3–8 shots for most tasks | Filling the entire context window with shots |
Supervised Fine-Tuning (SFT)
SFT adapts a pre-trained LLM to a specific domain or task by training on labeled instruction-response pairs, updating model weights using gradient descent.
Learning Objectives
- Prepare instruction-tuning datasets
- Understand LoRA / QLoRA
- Run training with HuggingFace + TRL
- Choose hyperparameters
- Evaluate fine-tuned models
🤔 Why Fine-Tune?
Prompt engineering has limits. When you need consistent format/style, domain-specific knowledge, lower latency, or lower API cost, fine-tuning is the answer.
| Approach | When to Use | Trade-offs |
|---|---|---|
| Prompting | General tasks, quick iteration | High token cost, inconsistent format |
| Few-Shot | Format adaptation | Uses context window, can't go below 3-5 shots |
| SFT | Domain expertise, consistent behavior | Needs data & compute, risks overfitting |
📦 Data Preparation
SFT requires a dataset of (instruction, response) pairs in a chat format. Quality matters far more than quantity — 1,000 excellent examples beat 100,000 mediocre ones.
import json
from datasets import Dataset
# ShareGPT format (widely supported)
data = [
{
"conversations": [
{"from": "system", "value": "You are an expert Python tutor."},
{"from": "human", "value": "How do I read a CSV file in Python?"},
{"from": "gpt", "value": "Use the `csv` module or pandas:\n\n```python\nimport pandas as pd\ndf = pd.read_csv('file.csv')\nprint(df.head())\n```\n\nFor large files, use `chunksize=1000` to process in batches."}
]
},
{
"conversations": [
{"from": "system", "value": "You are an expert Python tutor."},
{"from": "human", "value": "What is a list comprehension?"},
{"from": "gpt", "value": "A list comprehension is a concise way to create lists:\n\n```python\n# Traditional\nresult = []\nfor x in range(10):\n if x % 2 == 0:\n result.append(x**2)\n\n# List comprehension\nresult = [x**2 for x in range(10) if x % 2 == 0]\n```\n\nThey're faster and more Pythonic than equivalent loops."}
]
},
# ... add thousands more examples
]
# Save as JSONL
with open("train.jsonl", "w") as f:
for item in data:
f.write(json.dumps(item) + "\n")
# Load with HuggingFace datasets
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.05)
print(f"Train: {len(dataset['train'])}, Val: {len(dataset['test'])}")
⚡ LoRA: Low-Rank Adaptation
Full fine-tuning updates all ~7B+ parameters — expensive and prone to catastrophic forgetting. LoRA freezes the original weights and injects small trainable matrices that capture task-specific updates.
A is rank×d, B is d×rank. Typical rank=8–64. Only A and B are trained.
🚀 Training with TRL + PEFT
# pip install transformers trl peft accelerate bitsandbytes datasets
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B"
# ── 4-bit Quantization (QLoRA) ──────────────────────────
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
# ── LoRA Configuration ───────────────────────────────────
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Alpha = 2x rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,278,400 || trainable%: 1.03%
# ── Training Arguments ───────────────────────────────────
training_args = SFTConfig(
output_dir="./llama3-sft",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch = 8
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_steps=100,
eval_strategy="steps",
eval_steps=100,
bf16=True,
max_seq_length=2048,
dataset_text_field="text", # Column containing formatted text
report_to="wandb", # Optional: experiment tracking
)
# ── Load Dataset ─────────────────────────────────────────
dataset = load_dataset("json", data_files={"train": "train.jsonl"}, split="train")
def format_conversation(example):
"""Convert ShareGPT format to training text."""
messages = example["conversations"]
text = ""
for msg in messages:
if msg["from"] == "system":
text += f"<|system|>\n{msg['value']}\n"
elif msg["from"] == "human":
text += f"<|user|>\n{msg['value']}\n"
elif msg["from"] == "gpt":
text += f"<|assistant|>\n{msg['value']}\n"
return {"text": text}
dataset = dataset.map(format_conversation)
# ── Start Training ───────────────────────────────────────
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./llama3-sft-final")
print("Training complete!")
🔗 Merging LoRA Weights & Inference
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
LORA_ADAPTER = "./llama3-sft-final"
# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
# Load and merge LoRA weights into base model
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
model = model.merge_and_unload() # Merge weights, remove LoRA modules
# Save merged model
model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")
# Run inference
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "<|system|>\nYou are an expert Python tutor.\n<|user|>\nExplain decorators.\n<|assistant|>\n"
output = pipe(prompt, max_new_tokens=256, temperature=0.7)
print(output[0]["generated_text"][len(prompt):])
RL Training with LLM-as-Judge
Reinforcement Learning from Human Feedback (RLHF) aligns LLMs with human preferences. Using another LLM as a judge automates the reward signal at scale — no human labelers needed.
Learning Objectives
- Understand RLHF pipeline
- Implement PPO for LLMs
- Apply DPO (simpler alternative)
- Use GRPO for reasoning tasks
- Build an LLM-as-judge reward model
🗺️ RLHF Overview
Classic RLHF has three stages. Using an LLM-as-judge replaces the expensive human preference collection step.
⚖️ LLM-as-Judge Reward Model
Instead of a trained reward model, use a capable LLM (e.g., GPT-4o) to score responses. This scales instantly and can evaluate nuanced qualities like helpfulness and harmlessness.
from openai import OpenAI
import json
client = OpenAI()
JUDGE_SYSTEM = """You are an expert AI judge evaluating the quality of AI assistant responses.
Score the response on a scale of 1-10 based on:
- Accuracy (is the information correct?)
- Helpfulness (does it fully address the question?)
- Clarity (is it easy to understand?)
- Safety (no harmful content?)
Respond with valid JSON only: {"score": <1-10>, "reasoning": ""}"""
def llm_judge(prompt: str, response: str) -> dict:
"""Score a response using GPT-4o as judge. Returns score and reasoning."""
result = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": JUDGE_SYSTEM},
{"role": "user", "content": f"Prompt: {prompt}\n\nResponse: {response}"}
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(result.choices[0].message.content)
# Pairwise comparison (preferred for DPO data collection)
def pairwise_judge(prompt: str, response_a: str, response_b: str) -> str:
"""Returns 'A', 'B', or 'tie'."""
pairwise_prompt = f"""Which response is better?
Prompt: {prompt}
Response A: {response_a}
Response B: {response_b}
Answer with JSON: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are an impartial AI judge."},
{"role": "user", "content": pairwise_prompt}
],
response_format={"type": "json_object"},
temperature=0,
)
data = json.loads(result.choices[0].message.content)
return data["winner"]
# Example usage
score = llm_judge(
prompt="What is the capital of France?",
response="Paris is the capital of France, known for the Eiffel Tower."
)
print(f"Score: {score['score']}/10 — {score['reasoning']}")
📐 PPO — Proximal Policy Optimization
PPO is the canonical RL algorithm for RLHF. It updates the policy (LLM) to maximize reward while staying close to the reference policy via a KL divergence penalty.
where r = π_θ(a|s)/π_ref(a|s) is the probability ratio, A is advantage, β controls KL penalty.
# pip install trl transformers peft
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
from datasets import Dataset
import torch
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Load model with value head (needed for PPO)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
) # Reference model (frozen)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
# PPO Config
ppo_config = PPOConfig(
model_name=MODEL_ID,
learning_rate=1.41e-5,
batch_size=16,
mini_batch_size=4,
gradient_accumulation_steps=1,
optimize_cuda_cache=True,
kl_penalty="kl", # Or "full" for full KL
init_kl_coef=0.2, # β: initial KL coefficient
target_kl=6.0, # Target KL divergence
gamma=1.0,
lam=0.95, # GAE lambda
cliprange=0.2, # ε: clip range
vf_coef=0.1, # Value function coefficient
)
trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop
prompts = ["Explain quantum computing", "Write a poem about AI", ...]
for epoch in range(3):
for batch_prompts in chunks(prompts, ppo_config.batch_size):
# 1. Tokenize prompts
query_tensors = [
tokenizer.encode(p, return_tensors="pt").squeeze()
for p in batch_prompts
]
# 2. Generate responses
response_tensors = trainer.generate(
query_tensors,
max_new_tokens=256,
temperature=0.9,
)
# 3. Score with LLM judge
rewards = []
for prompt, response_tensor in zip(batch_prompts, response_tensors):
response_text = tokenizer.decode(response_tensor)
score = llm_judge(prompt, response_text) # From previous code
rewards.append(torch.tensor(score["score"] / 10.0))
# 4. PPO update
stats = trainer.step(query_tensors, response_tensors, rewards)
print(f"Epoch {epoch} | Mean reward: {stats['ppo/mean_scores']:.3f} | KL: {stats['objective/kl']:.3f}")
🎯 DPO — Direct Preference Optimization
DPO eliminates the separate reward model and RL loop entirely. It directly optimizes the policy to prefer "chosen" responses over "rejected" ones using a simple cross-entropy loss.
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import torch
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
# DPO requires preference pairs: (prompt, chosen, rejected)
# Generate these using LLM judge for pairwise comparison
dpo_data = [
{
"prompt": "What is the best way to learn Python?",
"chosen": "Start with official tutorials, then build projects. Practice daily with small scripts before attempting large projects.",
"rejected": "Just watch YouTube videos."
},
{
"prompt": "Explain recursion",
"chosen": "Recursion is when a function calls itself. Example: factorial(n) = n * factorial(n-1). Every recursive function needs a base case to stop.",
"rejected": "It's a programming thing where functions call themselves."
},
# ... thousands more preference pairs
]
dataset = Dataset.from_list(dpo_data)
training_args = DPOConfig(
output_dir="./llama3-dpo",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-7, # Lower than SFT
beta=0.1, # KL penalty (higher = closer to reference)
max_length=1024,
max_prompt_length=512,
bf16=True,
logging_steps=10,
)
trainer = DPOTrainer(
model=model,
ref_model=None, # If None, uses a copy of model as reference
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./llama3-dpo-final")
🔄 GRPO — Group Relative Policy Optimization
GRPO (from DeepSeek-R1) improves on PPO for reasoning tasks. It samples a group of responses per prompt, computes relative rewards within the group, and uses them as baselines — eliminating the value function entirely.
from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import re, torch
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# ── Reward Functions ─────────────────────────────────────
# GRPO supports multiple composable reward functions
def correctness_reward(completions, ground_truth, **kwargs) -> list[float]:
"""Verify math answers against ground truth."""
rewards = []
for completion, gt in zip(completions, ground_truth):
# Extract answer from ... tags
match = re.search(r"(.*?) ", completion, re.DOTALL)
if match and match.group(1).strip() == str(gt).strip():
rewards.append(2.0) # Correct answer
else:
rewards.append(0.0) # Wrong
return rewards
def format_reward(completions, **kwargs) -> list[float]:
"""Reward responses that use the correct format."""
rewards = []
for c in completions:
has_thinking = "" in c and " " in c
has_answer = "" in c and " " in c
rewards.append(0.5 if (has_thinking and has_answer) else 0.0)
return rewards
def length_penalty(completions, **kwargs) -> list[float]:
"""Penalize overly short or long responses."""
rewards = []
for c in completions:
tokens = len(c.split())
if 50 <= tokens <= 500:
rewards.append(0.1)
elif tokens < 20 or tokens > 1000:
rewards.append(-0.2)
else:
rewards.append(0.0)
return rewards
# ── GRPO Config ──────────────────────────────────────────
config = GRPOConfig(
output_dir="./qwen-grpo-math",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=1e-6,
num_generations=8, # G: responses sampled per prompt
max_prompt_length=512,
max_completion_length=1024,
beta=0.04, # KL coefficient
bf16=True,
logging_steps=10,
reward_weights=[1.0, 0.5, 0.2], # Weights for reward functions
)
# Load math dataset (e.g., GSM8K)
dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.rename_column("answer", "ground_truth")
trainer = GRPOTrainer(
model=model,
tokenizer=tokenizer,
config=config,
train_dataset=dataset,
reward_funcs=[correctness_reward, format_reward, length_penalty],
)
trainer.train()
| Algorithm | Requires | Best For | Complexity |
|---|---|---|---|
| PPO | Reward model + value head | General alignment, chat | High |
| DPO | Preference pairs | Style/safety alignment | Low |
| GRPO | Verifiable reward function | Math, code, reasoning | Medium |
Retrieval-Augmented Generation (RAG)
RAG grounds LLM responses in your own data by retrieving relevant documents at query time. This reduces hallucination and enables knowledge-current, source-cited answers.
Learning Objectives
- Build an end-to-end RAG pipeline
- Choose chunking & embedding strategies
- Use vector databases
- Implement Graph RAG
- Evaluate retrieval quality
🏗️ Regular RAG Pipeline
⚡ Complete RAG Implementation
# pip install openai chromadb langchain-text-splitters pypdf
from openai import OpenAI
import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge_base")
# ── STEP 1: Load & Chunk Documents ──────────────────────
def load_and_chunk(file_path: str, chunk_size=512, overlap=64) -> list[str]:
text = Path(file_path).read_text(encoding="utf-8")
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
return splitter.split_text(text)
# ── STEP 2: Embed & Store ────────────────────────────────
def embed_texts(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [d.embedding for d in resp.data]
def index_document(file_path: str):
chunks = load_and_chunk(file_path)
embeddings = embed_texts(chunks)
ids = [f"{file_path}_{i}" for i in range(len(chunks))]
collection.add(
ids=ids,
embeddings=embeddings,
documents=chunks,
metadatas=[{"source": file_path, "chunk": i} for i in range(len(chunks))]
)
print(f"Indexed {len(chunks)} chunks from {file_path}")
# ── STEP 3: Retrieve ─────────────────────────────────────
def retrieve(query: str, k: int = 5) -> list[dict]:
query_embedding = embed_texts([query])[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=k,
include=["documents", "metadatas", "distances"]
)
chunks = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
chunks.append({
"text": doc,
"source": meta["source"],
"score": 1 - dist # Convert distance to similarity
})
return chunks
# ── STEP 4: Generate Answer ──────────────────────────────
def rag_query(question: str, k: int = 5) -> dict:
# Retrieve relevant chunks
chunks = retrieve(question, k=k)
# Build context
context = "\n\n---\n\n".join([
f"[Source: {c['source']}, Score: {c['score']:.2f}]\n{c['text']}"
for c in chunks
])
# Generate with context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """You are a helpful assistant.
Answer the user's question based ONLY on the provided context.
If the answer is not in the context, say "I don't have enough information to answer this."
Always cite your sources."""},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0,
)
return {
"answer": response.choices[0].message.content,
"sources": [c["source"] for c in chunks],
"chunks_used": len(chunks)
}
# ── Usage ─────────────────────────────────────────────────
index_document("company_docs.txt")
index_document("product_manual.pdf")
result = rag_query("What are the system requirements?")
print(result["answer"])
print(f"\nSources: {result['sources']}")
✂️ Chunking Strategies
| Strategy | Description | Best For |
|---|---|---|
| Fixed Size | Split every N tokens/chars with overlap | General text, simple documents |
| Recursive | Try paragraph → sentence → word splits | Prose, books, articles |
| Semantic | Split on topic/semantic boundaries using embeddings | Multi-topic docs, high accuracy needs |
| Document-aware | Markdown headers, HTML tags, code blocks | Structured docs, code files |
🕸️ Graph RAG
Graph RAG builds a knowledge graph from documents — extracting entities and relationships — then traverses the graph during retrieval to find non-obvious connections that pure vector search misses.
Extraction
Graph (Neo4j)
+ Vector Search
# pip install openai neo4j
from openai import OpenAI
from neo4j import GraphDatabase
import json
client = OpenAI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
# ── Extract Knowledge Graph from Text ────────────────────
def extract_knowledge(text: str) -> dict:
"""Extract entities and relationships using LLM."""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Extract a knowledge graph from the text.
Return JSON with:
- "entities": [{"name": str, "type": str, "description": str}]
- "relations": [{"from": str, "relation": str, "to": str}]"""},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(resp.choices[0].message.content)
# ── Store in Neo4j ────────────────────────────────────────
def store_graph(kg: dict):
with driver.session() as session:
# Create entity nodes
for entity in kg["entities"]:
session.run(
"MERGE (e:Entity {name: $name}) SET e.type=$type, e.description=$desc",
name=entity["name"], type=entity["type"], desc=entity["description"]
)
# Create relationship edges
for rel in kg["relations"]:
session.run(
"""MATCH (a:Entity {name: $from}), (b:Entity {name: $to})
MERGE (a)-[r:RELATION {type: $rel}]->(b)""",
**{"from": rel["from"], "to": rel["to"], "rel": rel["relation"]}
)
# ── Graph Traversal Query ────────────────────────────────
def graph_retrieve(query: str, hops: int = 2) -> str:
"""Retrieve a subgraph relevant to the query via entity matching + traversal."""
# Extract query entities
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"List the main entity names in this query as JSON array: {query}"
}],
response_format={"type": "json_object"},
temperature=0,
)
entities = json.loads(resp.choices[0].message.content).get("entities", [])
results = []
with driver.session() as session:
for entity in entities[:3]: # Limit to top 3
records = session.run(f"""
MATCH path = (start:Entity)-[*1..{hops}]-(connected)
WHERE start.name CONTAINS $name
RETURN [node in nodes(path) | node.name + ': ' + node.description] as chain,
[rel in relationships(path) | type(rel)] as rels
LIMIT 10
""", name=entity)
for r in records:
results.append(" → ".join(r["chain"]))
return "\n".join(results)
def graph_rag_query(question: str) -> str:
subgraph_context = graph_retrieve(question)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using the knowledge graph context provided."},
{"role": "user", "content": f"Knowledge Graph:\n{subgraph_context}\n\nQuestion: {question}"}
],
)
return resp.choices[0].message.content
# Example
text = "Apple was founded by Steve Jobs and Steve Wozniak in 1976. Jobs later launched the iPhone in 2007."
kg = extract_knowledge(text)
store_graph(kg)
answer = graph_rag_query("What did Steve Jobs create after founding Apple?")
print(answer)
📊 Evaluating RAG Quality
# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": ["What year was the company founded?"],
"answer": ["The company was founded in 1995."],
"contexts": [["Company History: Founded in 1995 by John Smith..."]],
"ground_truth": ["1995"],
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])
print(result)
Agent Systems
AI agents are LLM-powered systems that can reason, plan, use tools, and take actions in an environment. Agents can work alone or as part of collaborative multi-agent systems.
Learning Objectives
- Implement the ReAct reasoning loop
- Define and use tools / function calling
- Add memory to agents
- Build multi-agent workflows
- Use LangGraph for stateful agents
🤖 What is an Agent?
An agent is an LLM in an action loop: it perceives state, reasons about what to do, calls a tool, observes the result, and repeats until the task is complete.
What should I do?
Call tool / API
Tool result → context
Answer or loop again
🛠️ Single Agent with Tool Use
OpenAI's function calling API lets you define tools as JSON schemas. The model decides when to call a tool and what arguments to pass.
from openai import OpenAI
import json, math, datetime
client = OpenAI()
# ── Define Tools ─────────────────────────────────────────
TOOLS = [
{
"type": "function",
"function": {
"name": "calculator",
"description": "Evaluate a mathematical expression. Returns the numeric result.",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression, e.g. '2 ** 10 + sqrt(16)'"}
},
"required": ["expression"]
}
}
},
{
"type": "function",
"function": {
"name": "get_current_time",
"description": "Get the current date and time in ISO format.",
"parameters": {"type": "object", "properties": {}}
}
},
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for information. Returns snippets.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
}
]
# ── Tool Implementations ──────────────────────────────────
def calculator(expression: str) -> str:
try:
# Safe eval with math functions only
safe_env = {k: getattr(math, k) for k in dir(math) if not k.startswith('_')}
result = eval(expression, {"__builtins__": {}}, safe_env)
return str(result)
except Exception as e:
return f"Error: {e}"
def get_current_time() -> str:
return datetime.datetime.now().isoformat()
def web_search(query: str) -> str:
# Stub — replace with real search API (Tavily, SerpAPI, etc.)
return f"Search results for '{query}': [This is a demo stub. Integrate Tavily API for real results.]"
TOOL_MAP = {
"calculator": calculator,
"get_current_time": get_current_time,
"web_search": web_search,
}
# ── Agent Loop ────────────────────────────────────────────
def run_agent(user_goal: str, max_steps: int = 10) -> str:
messages = [
{"role": "system", "content": "You are a helpful AI agent. Use tools when needed to answer accurately."},
{"role": "user", "content": user_goal}
]
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=TOOLS,
tool_choice="auto",
)
msg = response.choices[0].message
messages.append(msg) # Add assistant message to history
# Check if done (no tool calls)
if not msg.tool_calls:
print(f"Completed in {step + 1} steps.")
return msg.content
# Execute each tool call
for tool_call in msg.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
print(f" [Step {step+1}] Calling {fn_name}({fn_args})")
fn_result = TOOL_MAP[fn_name](**fn_args)
# Add tool result to messages
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"name": fn_name,
"content": fn_result,
})
return "Max steps reached without completing the task."
# ── Run the Agent ─────────────────────────────────────────
result = run_agent("What is 2^32, and what time is it right now?")
print(result)
🧠 Adding Memory to Agents
from openai import OpenAI
import chromadb
from datetime import datetime
client = OpenAI()
chroma = chromadb.Client()
memory_store = chroma.create_collection("agent_memory")
class AgentWithMemory:
def __init__(self, agent_id: str):
self.agent_id = agent_id
self.short_term = [] # Recent conversation turns
self.max_short_term = 20
def remember(self, text: str, metadata: dict = None):
"""Store a memory in long-term vector store."""
embedding = client.embeddings.create(
model="text-embedding-3-small", input=text
).data[0].embedding
memory_store.add(
ids=[f"{self.agent_id}_{datetime.now().timestamp()}"],
embeddings=[embedding],
documents=[text],
metadatas=[{"agent": self.agent_id, "timestamp": str(datetime.now()), **(metadata or {})}]
)
def recall(self, query: str, k: int = 3) -> list[str]:
"""Retrieve relevant memories."""
q_emb = client.embeddings.create(
model="text-embedding-3-small", input=query
).data[0].embedding
results = memory_store.query(
query_embeddings=[q_emb],
n_results=k,
where={"agent": self.agent_id}
)
return results["documents"][0] if results["documents"] else []
def chat(self, user_input: str) -> str:
# Retrieve relevant memories
memories = self.recall(user_input)
memory_context = "\n".join([f"- {m}" for m in memories])
# Build messages with memory
system = f"""You are a helpful assistant with a persistent memory.
Relevant memories from past interactions:
{memory_context if memories else "No relevant memories yet."}"""
self.short_term.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": system}] + self.short_term[-self.max_short_term:],
)
reply = response.choices[0].message.content
self.short_term.append({"role": "assistant", "content": reply})
# Store important things in long-term memory
self.remember(f"User said: {user_input}")
self.remember(f"I responded: {reply[:200]}")
return reply
agent = AgentWithMemory("assistant_1")
print(agent.chat("My name is Alice and I'm working on a Python RAG project."))
print(agent.chat("What was I working on?")) # Should recall from memory
🕸️ Multi-Agent Systems
Multiple specialized agents collaborate, each focusing on what it does best. Common patterns: Orchestrator-Worker, Pipeline, and Debate.
Plans & delegates
Web search
Review & verify
Draft output
# pip install langgraph langchain-openai
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
# ── Define Shared State ───────────────────────────────────
class ResearchState(TypedDict):
messages: Annotated[list, add_messages]
research_notes: str
draft: str
critique: str
final_output: str
task: str
# ── Define Agents (Nodes) ─────────────────────────────────
def researcher_agent(state: ResearchState) -> dict:
"""Gathers information relevant to the task."""
response = llm.invoke([
SystemMessage(content="You are a research expert. Gather key facts and insights."),
HumanMessage(content=f"Research this topic thoroughly: {state['task']}")
])
return {"research_notes": response.content}
def writer_agent(state: ResearchState) -> dict:
"""Drafts content based on research."""
response = llm.invoke([
SystemMessage(content="You are an expert technical writer. Write clearly and accurately."),
HumanMessage(content=f"""
Task: {state['task']}
Research Notes: {state['research_notes']}
Write a comprehensive, well-structured response.""")
])
return {"draft": response.content}
def critic_agent(state: ResearchState) -> dict:
"""Reviews and critiques the draft."""
response = llm.invoke([
SystemMessage(content="You are a critical reviewer. Find factual errors, gaps, and improvements."),
HumanMessage(content=f"""
Original Task: {state['task']}
Draft to Review: {state['draft']}
Provide specific, actionable critique. Rate quality 1-10.""")
])
return {"critique": response.content}
def reviser_agent(state: ResearchState) -> dict:
"""Revises based on critique."""
response = llm.invoke([
SystemMessage(content="You are a skilled editor. Improve the draft based on critique."),
HumanMessage(content=f"""
Task: {state['task']}
Original Draft: {state['draft']}
Critique: {state['critique']}
Produce the final, polished version.""")
])
return {"final_output": response.content}
# ── Build Graph ───────────────────────────────────────────
def build_research_pipeline() -> StateGraph:
graph = StateGraph(ResearchState)
# Add nodes
graph.add_node("researcher", researcher_agent)
graph.add_node("writer", writer_agent)
graph.add_node("critic", critic_agent)
graph.add_node("reviser", reviser_agent)
# Define edges (pipeline flow)
graph.set_entry_point("researcher")
graph.add_edge("researcher", "writer")
graph.add_edge("writer", "critic")
graph.add_edge("critic", "reviser")
graph.add_edge("reviser", END)
return graph.compile()
# ── Run Pipeline ──────────────────────────────────────────
pipeline = build_research_pipeline()
result = pipeline.invoke({
"task": "Explain how transformers work in modern LLMs",
"messages": [],
"research_notes": "",
"draft": "",
"critique": "",
"final_output": "",
})
print("=== FINAL OUTPUT ===")
print(result["final_output"])
print("\n=== CRITIQUE ===")
print(result["critique"])
🔧 Agent Framework Comparison
| Framework | Best For | Key Feature | Complexity |
|---|---|---|---|
| LangGraph | Complex stateful workflows, DAGs | Graph-based state machines, cycles | Medium |
| CrewAI | Role-based multi-agent teams | Crew + Role abstractions, easy setup | Low |
| AutoGen | Conversational multi-agent | Agent conversations, code execution | Medium |
| Anthropic SDK | Production agents with Claude | Native tool use, streaming, vision | Low |
| Custom | Maximum control and performance | Build exactly what you need | High |
✅ Agent Design Best Practices
- Design for failure: Agents will sometimes call wrong tools or loop. Add max step limits, error handling, and fallbacks.
- Minimal tool surface: Give agents only the tools they need. Fewer tools = less confusion = more reliable behavior.
- Structured tool outputs: Return consistent JSON from tools. Unstructured output confuses agents.
- Observability: Log every tool call, reasoning step, and state transition. You need visibility to debug agents.
- Human-in-the-loop: For high-stakes actions (deleting data, sending emails), require human approval before execution.
- Idempotent tools: Design tools that can be safely retried without side effects (or track completed actions in state).