🎓 Practical AI Education

Master Modern AI,
From APIs to Agents

A hands-on curriculum covering LLM usage, prompt engineering, fine-tuning, reinforcement learning, RAG, and autonomous agent systems — with real Python code examples.

7

Modules

40+

Code Examples

100%

Free

Course Modules

🤖

Module 01

Using LLMs

Connect to LLM APIs, send prompts, handle responses, and understand key generation parameters.

OpenAIAnthropicStreamingParameters

✍️

Module 02

Prompt Engineering

Master techniques like chain-of-thought, self-consistency, and structured output to elicit better responses.

CoTZero-shotRole PromptingJSON Output

🎯

Module 03

Few-Shot Learning

Teach the model new tasks at inference time by providing curated examples in the prompt context.

In-ContextExample SelectionFormat

🔧

Module 04

Supervised Fine-Tuning

Adapt pre-trained LLMs to specific tasks using LoRA, QLoRA, and HuggingFace Transformers.

LoRAQLoRAPEFTHuggingFace

🏆

Module 05

RL with LLM-as-Judge

Align LLMs using PPO, DPO, and GRPO with LLM-based reward signals instead of human labelers.

PPODPOGRPORLHF

🔍

Module 06

RAG Systems

Build retrieval-augmented generation pipelines including vector search and knowledge graph approaches.

Vector DBEmbeddingsGraph RAGLangChain

🕸️

Module 07

Agent Systems

Design single and multi-agent systems with tool use, memory, planning, and inter-agent communication.

ReActTool UseMulti-AgentLangGraph

💡

How to use this course: Work through modules in order — each builds on concepts from the previous one. All code examples use Python and popular open-source libraries. Click a module card above or use the sidebar to navigate.

🤖 Module 01

How to Use Large Language Models

Learn to interact with LLM APIs programmatically — sending prompts, handling responses, streaming tokens, and controlling generation behavior with key parameters.

Learning Objectives

Call the OpenAI / Anthropic API
Understand chat message formats
Stream tokens in real-time
Tune temperature, top-p, max_tokens
Count tokens & estimate cost

🧠 What is an LLM?

A Large Language Model (LLM) is a neural network trained on vast amounts of text data to predict the next token in a sequence. Models like GPT-4, Claude, and Llama have billions of parameters and can understand and generate human-like text.

Modern LLMs are accessed via an API — you send a structured request with your conversation history, and the model returns a completion.

Context Window

Maximum number of tokens (input + output) the model can process at once. GPT-4 supports up to 128K tokens.

Token

A chunk of text (~4 chars on average). "chatbot" = 1 token; "AI is cool" ≈ 3 tokens.

Temperature

Controls randomness. 0 = deterministic, 1 = creative. Most tasks work well at 0.0–0.7.

System Prompt

A special instruction message that sets the model's role, persona, or constraints before the conversation.

ℹ️

Popular providers: OpenAI (GPT-4o, o1), Anthropic (Claude 3.5), Google (Gemini), Meta (Llama 3), Mistral AI. They all follow a similar chat-based API pattern.

⚡ Basic API Call

Start by installing the SDK and making your first call. The messages list contains the conversation history in order.

Python — OpenAI

# pip install openai
from openai import OpenAI

client = OpenAI(api_key="sk-...")  # or set OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful AI tutor."},
        {"role": "user",   "content": "Explain what a neural network is in simple terms."}
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")

Python — Anthropic (Claude)

# pip install anthropic
import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system="You are a helpful AI tutor.",
    messages=[
        {"role": "user", "content": "Explain what a neural network is in simple terms."}
    ]
)

print(message.content[0].text)
print(f"\nInput tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")

💬 Multi-turn Conversations

LLMs are stateless — each request must include the full conversation history. Maintain a messages list and append each turn manually.

Python — Conversation Loop

from openai import OpenAI

client = OpenAI()
messages = [{"role": "system", "content": "You are a friendly tutor."}]

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.7,
    )
    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

# Simulate a conversation
print(chat("What is gradient descent?"))
print(chat("Can you give me a real-world analogy?"))
print(chat("How does it relate to backpropagation?"))

# messages list now contains the full history
print(f"\nConversation length: {len(messages)} messages")

🌊 Streaming Responses

Streaming sends tokens to your app as they're generated, instead of waiting for the full response. This dramatically improves perceived latency for users.

Python — Streaming

from openai import OpenAI

client = OpenAI()

with client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about machine learning."}],
    stream=True,
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)  # Print token by token
    print()  # Newline at end

🎛️ Key Generation Parameters

Parameter	Range	Effect	Recommended
`temperature`	0.0–2.0	Higher = more random/creative output	0.0 for factual, 0.7 for creative
`top_p`	0.0–1.0	Nucleus sampling; limits vocab to top-p probability mass	0.9 (use either temp or top_p, not both)
`max_tokens`	1–model limit	Maximum output length in tokens	256–2048 for most tasks
`frequency_penalty`	-2.0–2.0	Penalizes repeating tokens that appeared frequently	0.3–0.5 to reduce repetition
`presence_penalty`	-2.0–2.0	Penalizes tokens that appeared at all (encourages new topics)	0.5–1.0 for diverse outputs
`stop`	list of strings	Stop generation when any sequence is produced	`["###", "\n\n"]`

📋 Structured JSON Output

Use response_format to guarantee valid JSON output, useful for building applications that parse model responses.

Python — JSON Mode

from openai import OpenAI
import json

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Respond only with valid JSON."},
        {"role": "user", "content": (
            "Extract entities from: 'Sam Altman founded OpenAI in 2015 in San Francisco.' "
            "Return JSON with keys: people, organizations, locations, years."
        )},
    ]
)

data = json.loads(response.choices[0].message.content)
print(data)
# {"people": ["Sam Altman"], "organizations": ["OpenAI"],
#  "locations": ["San Francisco"], "years": [2015]}

✍️ Module 02

Prompt Engineering

Prompt engineering is the art of crafting inputs that reliably elicit the best possible outputs from an LLM. Small changes in phrasing can dramatically affect quality.

Learning Objectives

Apply zero-shot & role prompting
Use Chain-of-Thought reasoning
Implement self-consistency sampling
Build reusable prompt templates
Extract structured data reliably

🗺️ Prompting Techniques Overview

Zero-shot

Few-shot

Chain-of-Thought

Self-Consistency

Tree of Thought

Role Prompting

Complexity & effectiveness generally increase left → right

🎯 Zero-Shot Prompting

Zero-shot prompting relies on the model's pre-trained knowledge with no examples. Works well for common tasks. The key is a clear, specific instruction.

Python — Zero-Shot Classification

from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Classify the sentiment of the following text.
Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL.

Text: "{text}"

Sentiment:"""
        }],
        temperature=0,  # Deterministic for classification
        max_tokens=10,
    )
    return response.choices[0].message.content.strip()

texts = [
    "The model training finished 10x faster than expected!",
    "This API keeps returning errors and I can't figure out why.",
    "The paper proposes a new attention mechanism."
]

for t in texts:
    print(f"'{t[:40]}...' → {classify_sentiment(t)}")

🔗 Chain-of-Thought (CoT)

CoT prompts the model to show its reasoning step-by-step before giving an answer. This dramatically improves performance on math, logic, and multi-step reasoning tasks.

❌ Without CoT

Prompt

prompt = """
Q: A train travels 150 km in 2 hours,
then 200 km in 3 hours. What is its
average speed for the whole trip?

A:"""
# Often gets wrong answer: 70 km/h (arithmetic mean)

✅ With CoT

Prompt

prompt = """
Q: A train travels 150 km in 2 hours,
then 200 km in 3 hours. What is its
average speed for the whole trip?

A: Let me think step by step.
"""
# Gets correct answer: 350/5 = 70 km/h
# (total distance / total time)

Python — Zero-Shot CoT ("Let's think step by step")

from openai import OpenAI

client = OpenAI()

def solve_with_cot(problem: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a careful, logical problem solver."},
            {"role": "user", "content": f"{problem}\n\nLet's think step by step:"}
        ],
        temperature=0,
    )
    return response.choices[0].message.content

problem = """
If I have 5 apples and give away 2/5 of them, then receive 3 more,
and finally share equally with one friend, how many do I end up with?
"""
print(solve_with_cot(problem))

♻️ Self-Consistency

Generate multiple CoT responses with high temperature, then take the majority vote. This ensemble approach reduces errors by ~10-20% on reasoning tasks.

Python — Self-Consistency Voting

from openai import OpenAI
from collections import Counter
import re

client = OpenAI()

def self_consistency(question: str, num_samples: int = 5) -> str:
    """Sample multiple CoT paths and majority-vote the final answer."""
    system = (
        "Solve the problem step by step. "
        "End your response with 'Final answer: '"
    )
    answers = []
    for _ in range(num_samples):
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system},
                {"role": "user",   "content": question}
            ],
            temperature=0.8,  # High temp for diverse reasoning paths
        )
        text = resp.choices[0].message.content
        # Extract the final answer
        match = re.search(r"Final answer:\s*(.+)", text, re.IGNORECASE)
        if match:
            answers.append(match.group(1).strip().lower())

    if not answers:
        return "Could not extract answers"

    # Majority vote
    most_common, count = Counter(answers).most_common(1)[0]
    print(f"Votes: {dict(Counter(answers))}")
    print(f"Majority ({count}/{num_samples}): {most_common}")
    return most_common

result = self_consistency(
    "A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. "
    "How much does the ball cost?"
)

🎭 Role Prompting

Assigning a specific role or persona in the system prompt activates relevant knowledge patterns and adjusts the model's communication style.

Python — Role-Based System Prompts

from openai import OpenAI

client = OpenAI()

ROLES = {
    "code_reviewer": """You are a senior software engineer with 15 years of experience.
Review code for: correctness, performance, security vulnerabilities, and maintainability.
Be specific and actionable in your feedback.""",

    "socratic_tutor": """You are a Socratic tutor. Never give direct answers.
Instead, guide students to discover answers themselves through carefully crafted questions.
Ask one question at a time.""",

    "ux_critic": """You are a UX researcher with expertise in cognitive load theory.
Evaluate designs from the user's perspective. Cite specific usability heuristics.""",
}

def ask_expert(role_key: str, question: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": ROLES[role_key]},
            {"role": "user",   "content": question}
        ],
        temperature=0.7,
    ).choices[0].message.content

# Example usage
review = ask_expert("code_reviewer", """
Review this Python function:
def get_user(id):
    return db.execute(f"SELECT * FROM users WHERE id = {id}")
""")

⚠️

The code above has a SQL injection vulnerability — that's intentional so the code reviewer can catch it! Always use parameterized queries: db.execute("SELECT * FROM users WHERE id = ?", (id,))

📝 Reusable Prompt Templates

Build parameterized templates to standardize prompts across your application and make them easier to iterate on.

Python — Template System

from string import Template
from openai import OpenAI

client = OpenAI()

# Define reusable templates
SUMMARIZE_TEMPLATE = Template("""
You are an expert at summarizing $domain content.
Summarize the following text in exactly $num_points bullet points.
Focus on: $focus_areas.
Each bullet should be one clear, concise sentence.

TEXT:
$text

SUMMARY:
""")

EXTRACT_TEMPLATE = Template("""
Extract all $entity_type from the following text.
Return as a JSON array of strings.
If none found, return an empty array [].

TEXT: $text
""")

def summarize(text: str, domain="technical", points=3, focus="key findings"):
    prompt = SUMMARIZE_TEMPLATE.substitute(
        domain=domain,
        num_points=points,
        focus_areas=focus,
        text=text
    )
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
    ).choices[0].message.content

paper_abstract = """
We present GPT-4, a large multimodal model capable of processing image and text inputs
and producing text outputs. GPT-4 exhibits human-level performance on various professional
and academic benchmarks...
"""
print(summarize(paper_abstract, domain="AI research", points=4, focus="contributions, methods, results"))

🎯 Module 03

Few-Shot Learning

Few-shot learning uses a small number of examples (shots) within the prompt to teach the model a new task without any gradient updates — this is called in-context learning.

Learning Objectives

Understand in-context learning
Format examples effectively
Select high-quality shots
Build dynamic few-shot retrieval
Know when few-shot beats zero-shot

🧠 How In-Context Learning Works

Large language models develop the ability to learn new tasks by observing demonstrations in their context window. No weight updates occur — the model uses pattern matching and analogy from its pre-training.

Prompt Structure

                [System: You are a sentiment classifier]

                Text: "Great product!" → Label: POSITIVE

                Text: "Terrible experience." → Label: NEGATIVE

                Text: "It works fine." → Label: NEUTRAL

                Text: "Loved it!" → Label: ???

3 examples (shots) teach the format → model predicts POSITIVE

🔮

Why it works: During pre-training on internet text, the model sees countless input→output patterns. Few-shot examples activate the right "circuit" for the task by providing clear format and semantics cues.

📋 Basic Few-Shot Format

Python — Few-Shot Classification

from openai import OpenAI

client = OpenAI()

def few_shot_classify(examples: list[dict], query: str) -> str:
    """
    examples: list of {"input": ..., "label": ...}
    query: the text to classify
    """
    # Build the few-shot prompt
    shots = "\n".join([
        f'Text: "{ex["input"]}"\nLabel: {ex["label"]}'
        for ex in examples
    ])

    prompt = f"""Classify the sentiment of text as POSITIVE, NEGATIVE, or NEUTRAL.

{shots}
Text: "{query}"
Label:"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=10,
    )
    return response.choices[0].message.content.strip()

# Examples covering all three classes
examples = [
    {"input": "The delivery was incredibly fast!", "label": "POSITIVE"},
    {"input": "Completely broken on arrival.",      "label": "NEGATIVE"},
    {"input": "It does what it says.",              "label": "NEUTRAL"},
    {"input": "Best purchase I've made this year!", "label": "POSITIVE"},
    {"input": "Won't be buying from them again.",   "label": "NEGATIVE"},
]

queries = [
    "Works as expected, nothing special.",
    "Absolutely love this product!",
    "Stopped working after one week."
]

for q in queries:
    label = few_shot_classify(examples, q)
    print(f"'{q}' → {label}")

🔄 Dynamic Example Selection

Instead of using fixed examples, retrieve the most semantically similar examples to the query. This improves performance especially for diverse or edge-case inputs.

Python — Semantic Example Retrieval

# pip install openai numpy
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return resp.data[0].embedding

def cosine_similarity(a, b) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

class DynamicFewShot:
    def __init__(self, examples: list[dict]):
        """examples: list of {"input": ..., "output": ..., "embedding": None}"""
        self.examples = examples
        # Pre-compute embeddings for all examples
        for ex in self.examples:
            ex["embedding"] = get_embedding(ex["input"])

    def get_top_k(self, query: str, k: int = 3) -> list[dict]:
        q_emb = get_embedding(query)
        scored = [
            (cosine_similarity(q_emb, ex["embedding"]), ex)
            for ex in self.examples
        ]
        scored.sort(key=lambda x: x[0], reverse=True)
        return [ex for _, ex in scored[:k]]

    def predict(self, query: str, k: int = 3) -> str:
        top_k = self.get_top_k(query, k)
        shots = "\n".join([
            f'Input: {ex["input"]}\nOutput: {ex["output"]}'
            for ex in top_k
        ])
        prompt = f"Transform the input as shown:\n\n{shots}\nInput: {query}\nOutput:"
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )
        return resp.choices[0].message.content.strip()

# Example: date format normalization
examples = [
    {"input": "January 5th, 2024",     "output": "2024-01-05"},
    {"input": "March 22, 2023",        "output": "2023-03-22"},
    {"input": "Dec 31st 2022",         "output": "2022-12-31"},
    {"input": "15 August 2024",        "output": "2024-08-15"},
    {"input": "July 4, 2025",          "output": "2025-07-04"},
    {"input": "February 14th, 2024",   "output": "2024-02-14"},
]

dfs = DynamicFewShot(examples)
print(dfs.predict("October 3rd, 2024"))   # → 2024-10-03
print(dfs.predict("11 November 2025"))    # → 2025-11-11

✅ Best Practices

Principle	Do	Avoid
Diversity	Cover all output classes/formats in examples	Using examples that are all similar to each other
Format	Use identical I/O format for all shots	Inconsistent spacing, punctuation, or structure
Quality	Use high-quality, verified example pairs	Incorrect labels — they anchor the model to wrong patterns
Order	Put the most relevant example last (recency bias)	Random ordering for critical tasks
Count	3–8 shots for most tasks	Filling the entire context window with shots

🔧 Module 04

Supervised Fine-Tuning (SFT)

SFT adapts a pre-trained LLM to a specific domain or task by training on labeled instruction-response pairs, updating model weights using gradient descent.

Learning Objectives

Prepare instruction-tuning datasets
Understand LoRA / QLoRA
Run training with HuggingFace + TRL
Choose hyperparameters
Evaluate fine-tuned models

🤔 Why Fine-Tune?

Prompt engineering has limits. When you need consistent format/style, domain-specific knowledge, lower latency, or lower API cost, fine-tuning is the answer.

Approach	When to Use	Trade-offs
Prompting	General tasks, quick iteration	High token cost, inconsistent format
Few-Shot	Format adaptation	Uses context window, can't go below 3-5 shots
SFT	Domain expertise, consistent behavior	Needs data & compute, risks overfitting

📦 Data Preparation

SFT requires a dataset of (instruction, response) pairs in a chat format. Quality matters far more than quantity — 1,000 excellent examples beat 100,000 mediocre ones.

Python — Dataset Preparation (ShareGPT format)

import json
from datasets import Dataset

# ShareGPT format (widely supported)
data = [
    {
        "conversations": [
            {"from": "system",    "value": "You are an expert Python tutor."},
            {"from": "human",     "value": "How do I read a CSV file in Python?"},
            {"from": "gpt",       "value": "Use the `csv` module or pandas:\n\n```python\nimport pandas as pd\ndf = pd.read_csv('file.csv')\nprint(df.head())\n```\n\nFor large files, use `chunksize=1000` to process in batches."}
        ]
    },
    {
        "conversations": [
            {"from": "system",    "value": "You are an expert Python tutor."},
            {"from": "human",     "value": "What is a list comprehension?"},
            {"from": "gpt",       "value": "A list comprehension is a concise way to create lists:\n\n```python\n# Traditional\nresult = []\nfor x in range(10):\n    if x % 2 == 0:\n        result.append(x**2)\n\n# List comprehension\nresult = [x**2 for x in range(10) if x % 2 == 0]\n```\n\nThey're faster and more Pythonic than equivalent loops."}
        ]
    },
    # ... add thousands more examples
]

# Save as JSONL
with open("train.jsonl", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

# Load with HuggingFace datasets
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.05)
print(f"Train: {len(dataset['train'])}, Val: {len(dataset['test'])}")

⚡ LoRA: Low-Rank Adaptation

Full fine-tuning updates all ~7B+ parameters — expensive and prone to catastrophic forgetting. LoRA freezes the original weights and injects small trainable matrices that capture task-specific updates.

Original Weight W

W₀

FROZEN ❄️

+

LoRA Adapter

B·A

TRAINABLE 🔥

=

Effective Weight

W₀ + BA

~1% parameters

A is rank×d, B is d×rank. Typical rank=8–64. Only A and B are trained.

LoRA rank (r)

Higher rank = more capacity but more parameters. r=8 for simple tasks, r=64 for complex ones.

Alpha (α)

Scaling factor = α/r. Keep α = 2×r for stable training (e.g., r=16, α=32).

Target modules

Which attention layers to apply LoRA to: q_proj, v_proj (at minimum), or all projection layers.

QLoRA

Quantize base model to 4-bit (NF4), then apply LoRA. Fits 70B models on a single consumer GPU.

🚀 Training with TRL + PEFT

Python — SFT Training (TRL SFTTrainer)

# pip install transformers trl peft accelerate bitsandbytes datasets
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B"

# ── 4-bit Quantization (QLoRA) ──────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# ── LoRA Configuration ───────────────────────────────────
lora_config = LoraConfig(
    r=16,                           # Rank
    lora_alpha=32,                  # Alpha = 2x rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,278,400 || trainable%: 1.03%

# ── Training Arguments ───────────────────────────────────
training_args = SFTConfig(
    output_dir="./llama3-sft",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,    # Effective batch = 8
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    bf16=True,
    max_seq_length=2048,
    dataset_text_field="text",        # Column containing formatted text
    report_to="wandb",                # Optional: experiment tracking
)

# ── Load Dataset ─────────────────────────────────────────
dataset = load_dataset("json", data_files={"train": "train.jsonl"}, split="train")

def format_conversation(example):
    """Convert ShareGPT format to training text."""
    messages = example["conversations"]
    text = ""
    for msg in messages:
        if msg["from"] == "system":
            text += f"<|system|>\n{msg['value']}\n"
        elif msg["from"] == "human":
            text += f"<|user|>\n{msg['value']}\n"
        elif msg["from"] == "gpt":
            text += f"<|assistant|>\n{msg['value']}\n"
    return {"text": text}

dataset = dataset.map(format_conversation)

# ── Start Training ───────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./llama3-sft-final")
print("Training complete!")

🔗 Merging LoRA Weights & Inference

Python — Merge & Run Inference

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
LORA_ADAPTER = "./llama3-sft-final"

# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

# Load and merge LoRA weights into base model
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
model = model.merge_and_unload()  # Merge weights, remove LoRA modules

# Save merged model
model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")

# Run inference
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "<|system|>\nYou are an expert Python tutor.\n<|user|>\nExplain decorators.\n<|assistant|>\n"
output = pipe(prompt, max_new_tokens=256, temperature=0.7)
print(output[0]["generated_text"][len(prompt):])

🏆 Module 05

RL Training with LLM-as-Judge

Reinforcement Learning from Human Feedback (RLHF) aligns LLMs with human preferences. Using another LLM as a judge automates the reward signal at scale — no human labelers needed.

Learning Objectives

Understand RLHF pipeline
Implement PPO for LLMs
Apply DPO (simpler alternative)
Use GRPO for reasoning tasks
Build an LLM-as-judge reward model

🗺️ RLHF Overview

Classic RLHF has three stages. Using an LLM-as-judge replaces the expensive human preference collection step.

Stage 1 — Supervised Fine-Tuning

Instruction Data

→

SFT on Base LLM

→

SFT Model (Policy)

Stage 2 — Reward Modeling

Prompt

→

LLM Judge

→

Reward Score

Stage 3 — RL Optimization (PPO / DPO / GRPO)

SFT Policy

+

Reward

→

RL Update

→

Aligned Model

⚖️ LLM-as-Judge Reward Model

Instead of a trained reward model, use a capable LLM (e.g., GPT-4o) to score responses. This scales instantly and can evaluate nuanced qualities like helpfulness and harmlessness.

Python — LLM Judge Implementation

from openai import OpenAI
import json

client = OpenAI()

JUDGE_SYSTEM = """You are an expert AI judge evaluating the quality of AI assistant responses.
Score the response on a scale of 1-10 based on:
- Accuracy (is the information correct?)
- Helpfulness (does it fully address the question?)
- Clarity (is it easy to understand?)
- Safety (no harmful content?)

Respond with valid JSON only: {"score": <1-10>, "reasoning": ""}"""

def llm_judge(prompt: str, response: str) -> dict:
    """Score a response using GPT-4o as judge. Returns score and reasoning."""
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user", "content": f"Prompt: {prompt}\n\nResponse: {response}"}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)

# Pairwise comparison (preferred for DPO data collection)
def pairwise_judge(prompt: str, response_a: str, response_b: str) -> str:
    """Returns 'A', 'B', or 'tie'."""
    pairwise_prompt = f"""Which response is better?
Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}

Answer with JSON: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an impartial AI judge."},
            {"role": "user",   "content": pairwise_prompt}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    data = json.loads(result.choices[0].message.content)
    return data["winner"]

# Example usage
score = llm_judge(
    prompt="What is the capital of France?",
    response="Paris is the capital of France, known for the Eiffel Tower."
)
print(f"Score: {score['score']}/10 — {score['reasoning']}")

📐 PPO — Proximal Policy Optimization

PPO is the canonical RL algorithm for RLHF. It updates the policy (LLM) to maximize reward while staying close to the reference policy via a KL divergence penalty.

📚

Loss function: L = E[min(r·A, clip(r, 1-ε, 1+ε)·A)] − β·KL(π_θ ∥ π_ref)
where r = π_θ(a|s)/π_ref(a|s) is the probability ratio, A is advantage, β controls KL penalty.

Python — PPO with TRL

# pip install trl transformers peft
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
from datasets import Dataset
import torch

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Load model with value head (needed for PPO)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)  # Reference model (frozen)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# PPO Config
ppo_config = PPOConfig(
    model_name=MODEL_ID,
    learning_rate=1.41e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    optimize_cuda_cache=True,
    kl_penalty="kl",        # Or "full" for full KL
    init_kl_coef=0.2,       # β: initial KL coefficient
    target_kl=6.0,          # Target KL divergence
    gamma=1.0,
    lam=0.95,               # GAE lambda
    cliprange=0.2,          # ε: clip range
    vf_coef=0.1,            # Value function coefficient
)

trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
prompts = ["Explain quantum computing", "Write a poem about AI", ...]

for epoch in range(3):
    for batch_prompts in chunks(prompts, ppo_config.batch_size):
        # 1. Tokenize prompts
        query_tensors = [
            tokenizer.encode(p, return_tensors="pt").squeeze()
            for p in batch_prompts
        ]

        # 2. Generate responses
        response_tensors = trainer.generate(
            query_tensors,
            max_new_tokens=256,
            temperature=0.9,
        )

        # 3. Score with LLM judge
        rewards = []
        for prompt, response_tensor in zip(batch_prompts, response_tensors):
            response_text = tokenizer.decode(response_tensor)
            score = llm_judge(prompt, response_text)  # From previous code
            rewards.append(torch.tensor(score["score"] / 10.0))

        # 4. PPO update
        stats = trainer.step(query_tensors, response_tensors, rewards)
        print(f"Epoch {epoch} | Mean reward: {stats['ppo/mean_scores']:.3f} | KL: {stats['objective/kl']:.3f}")

🎯 DPO — Direct Preference Optimization

DPO eliminates the separate reward model and RL loop entirely. It directly optimizes the policy to prefer "chosen" responses over "rejected" ones using a simple cross-entropy loss.

✅

DPO advantage: Much simpler than PPO — no value head, no separate reward model, no on-policy generation during training. Just a supervised loss on preference pairs.

Python — DPO Dataset + Training

from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import torch

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# DPO requires preference pairs: (prompt, chosen, rejected)
# Generate these using LLM judge for pairwise comparison
dpo_data = [
    {
        "prompt":   "What is the best way to learn Python?",
        "chosen":   "Start with official tutorials, then build projects. Practice daily with small scripts before attempting large projects.",
        "rejected": "Just watch YouTube videos."
    },
    {
        "prompt":   "Explain recursion",
        "chosen":   "Recursion is when a function calls itself. Example: factorial(n) = n * factorial(n-1). Every recursive function needs a base case to stop.",
        "rejected": "It's a programming thing where functions call themselves."
    },
    # ... thousands more preference pairs
]

dataset = Dataset.from_list(dpo_data)

training_args = DPOConfig(
    output_dir="./llama3-dpo",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,             # Lower than SFT
    beta=0.1,                       # KL penalty (higher = closer to reference)
    max_length=1024,
    max_prompt_length=512,
    bf16=True,
    logging_steps=10,
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,         # If None, uses a copy of model as reference
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./llama3-dpo-final")

🔄 GRPO — Group Relative Policy Optimization

GRPO (from DeepSeek-R1) improves on PPO for reasoning tasks. It samples a group of responses per prompt, computes relative rewards within the group, and uses them as baselines — eliminating the value function entirely.

💡

Key insight: Instead of learning a value function V(s), GRPO estimates the baseline by averaging rewards across G sampled outputs for the same prompt. This is simpler and works extremely well for verifiable tasks (math, code).

Python — GRPO with TRL

from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import re, torch

MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# ── Reward Functions ─────────────────────────────────────
# GRPO supports multiple composable reward functions

def correctness_reward(completions, ground_truth, **kwargs) -> list[float]:
    """Verify math answers against ground truth."""
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        # Extract answer from ... tags
        match = re.search(r"(.*?)", completion, re.DOTALL)
        if match and match.group(1).strip() == str(gt).strip():
            rewards.append(2.0)   # Correct answer
        else:
            rewards.append(0.0)   # Wrong
    return rewards

def format_reward(completions, **kwargs) -> list[float]:
    """Reward responses that use the correct format."""
    rewards = []
    for c in completions:
        has_thinking = "" in c and "" in c
        has_answer   = "" in c and "" in c
        rewards.append(0.5 if (has_thinking and has_answer) else 0.0)
    return rewards

def length_penalty(completions, **kwargs) -> list[float]:
    """Penalize overly short or long responses."""
    rewards = []
    for c in completions:
        tokens = len(c.split())
        if 50 <= tokens <= 500:
            rewards.append(0.1)
        elif tokens < 20 or tokens > 1000:
            rewards.append(-0.2)
        else:
            rewards.append(0.0)
    return rewards

# ── GRPO Config ──────────────────────────────────────────
config = GRPOConfig(
    output_dir="./qwen-grpo-math",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=1e-6,
    num_generations=8,              # G: responses sampled per prompt
    max_prompt_length=512,
    max_completion_length=1024,
    beta=0.04,                      # KL coefficient
    bf16=True,
    logging_steps=10,
    reward_weights=[1.0, 0.5, 0.2], # Weights for reward functions
)

# Load math dataset (e.g., GSM8K)
dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.rename_column("answer", "ground_truth")

trainer = GRPOTrainer(
    model=model,
    tokenizer=tokenizer,
    config=config,
    train_dataset=dataset,
    reward_funcs=[correctness_reward, format_reward, length_penalty],
)

trainer.train()

Algorithm	Requires	Best For	Complexity
PPO	Reward model + value head	General alignment, chat	High
DPO	Preference pairs	Style/safety alignment	Low
GRPO	Verifiable reward function	Math, code, reasoning	Medium

🔍 Module 06

Retrieval-Augmented Generation (RAG)

RAG grounds LLM responses in your own data by retrieving relevant documents at query time. This reduces hallucination and enables knowledge-current, source-cited answers.

Learning Objectives

Build an end-to-end RAG pipeline
Choose chunking & embedding strategies
Use vector databases
Implement Graph RAG
Evaluate retrieval quality

🏗️ Regular RAG Pipeline

Indexing Phase (offline)

Documents

→

Chunking

→

Embed Model

→

Vector DB

Query Phase (online)

User Query

→

Embed Query

→

Top-K Retrieve

+

LLM

→

Answer

⚡ Complete RAG Implementation

Python — RAG Pipeline from Scratch

# pip install openai chromadb langchain-text-splitters pypdf
from openai import OpenAI
import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path

client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge_base")

# ── STEP 1: Load & Chunk Documents ──────────────────────
def load_and_chunk(file_path: str, chunk_size=512, overlap=64) -> list[str]:
    text = Path(file_path).read_text(encoding="utf-8")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

# ── STEP 2: Embed & Store ────────────────────────────────
def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [d.embedding for d in resp.data]

def index_document(file_path: str):
    chunks = load_and_chunk(file_path)
    embeddings = embed_texts(chunks)
    ids = [f"{file_path}_{i}" for i in range(len(chunks))]
    collection.add(
        ids=ids,
        embeddings=embeddings,
        documents=chunks,
        metadatas=[{"source": file_path, "chunk": i} for i in range(len(chunks))]
    )
    print(f"Indexed {len(chunks)} chunks from {file_path}")

# ── STEP 3: Retrieve ─────────────────────────────────────
def retrieve(query: str, k: int = 5) -> list[dict]:
    query_embedding = embed_texts([query])[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        include=["documents", "metadatas", "distances"]
    )
    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        chunks.append({
            "text": doc,
            "source": meta["source"],
            "score": 1 - dist  # Convert distance to similarity
        })
    return chunks

# ── STEP 4: Generate Answer ──────────────────────────────
def rag_query(question: str, k: int = 5) -> dict:
    # Retrieve relevant chunks
    chunks = retrieve(question, k=k)

    # Build context
    context = "\n\n---\n\n".join([
        f"[Source: {c['source']}, Score: {c['score']:.2f}]\n{c['text']}"
        for c in chunks
    ])

    # Generate with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are a helpful assistant.
Answer the user's question based ONLY on the provided context.
If the answer is not in the context, say "I don't have enough information to answer this."
Always cite your sources."""},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0,
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [c["source"] for c in chunks],
        "chunks_used": len(chunks)
    }

# ── Usage ─────────────────────────────────────────────────
index_document("company_docs.txt")
index_document("product_manual.pdf")

result = rag_query("What are the system requirements?")
print(result["answer"])
print(f"\nSources: {result['sources']}")

✂️ Chunking Strategies

Strategy	Description	Best For
Fixed Size	Split every N tokens/chars with overlap	General text, simple documents
Recursive	Try paragraph → sentence → word splits	Prose, books, articles
Semantic	Split on topic/semantic boundaries using embeddings	Multi-topic docs, high accuracy needs
Document-aware	Markdown headers, HTML tags, code blocks	Structured docs, code files

🕸️ Graph RAG

Graph RAG builds a knowledge graph from documents — extracting entities and relationships — then traverses the graph during retrieval to find non-obvious connections that pure vector search misses.

Graph RAG Architecture

Documents

→

Entity & Relation
Extraction

→

Knowledge
Graph (Neo4j)

At query time:

Query

→

Graph Traversal
+ Vector Search

→

Sub-graph

→

Answer

Python — Graph RAG with LLM Extraction

# pip install openai neo4j
from openai import OpenAI
from neo4j import GraphDatabase
import json

client = OpenAI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# ── Extract Knowledge Graph from Text ────────────────────
def extract_knowledge(text: str) -> dict:
    """Extract entities and relationships using LLM."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Extract a knowledge graph from the text.
Return JSON with:
- "entities": [{"name": str, "type": str, "description": str}]
- "relations": [{"from": str, "relation": str, "to": str}]"""},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

# ── Store in Neo4j ────────────────────────────────────────
def store_graph(kg: dict):
    with driver.session() as session:
        # Create entity nodes
        for entity in kg["entities"]:
            session.run(
                "MERGE (e:Entity {name: $name}) SET e.type=$type, e.description=$desc",
                name=entity["name"], type=entity["type"], desc=entity["description"]
            )
        # Create relationship edges
        for rel in kg["relations"]:
            session.run(
                """MATCH (a:Entity {name: $from}), (b:Entity {name: $to})
                   MERGE (a)-[r:RELATION {type: $rel}]->(b)""",
                **{"from": rel["from"], "to": rel["to"], "rel": rel["relation"]}
            )

# ── Graph Traversal Query ────────────────────────────────
def graph_retrieve(query: str, hops: int = 2) -> str:
    """Retrieve a subgraph relevant to the query via entity matching + traversal."""
    # Extract query entities
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"List the main entity names in this query as JSON array: {query}"
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )
    entities = json.loads(resp.choices[0].message.content).get("entities", [])

    results = []
    with driver.session() as session:
        for entity in entities[:3]:  # Limit to top 3
            records = session.run(f"""
                MATCH path = (start:Entity)-[*1..{hops}]-(connected)
                WHERE start.name CONTAINS $name
                RETURN [node in nodes(path) | node.name + ': ' + node.description] as chain,
                       [rel in relationships(path) | type(rel)] as rels
                LIMIT 10
            """, name=entity)
            for r in records:
                results.append(" → ".join(r["chain"]))
    return "\n".join(results)

def graph_rag_query(question: str) -> str:
    subgraph_context = graph_retrieve(question)
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using the knowledge graph context provided."},
            {"role": "user", "content": f"Knowledge Graph:\n{subgraph_context}\n\nQuestion: {question}"}
        ],
    )
    return resp.choices[0].message.content

# Example
text = "Apple was founded by Steve Jobs and Steve Wozniak in 1976. Jobs later launched the iPhone in 2007."
kg = extract_knowledge(text)
store_graph(kg)
answer = graph_rag_query("What did Steve Jobs create after founding Apple?")
print(answer)

📊 Evaluating RAG Quality

Context Recall

Are all the facts needed to answer the question present in the retrieved chunks?

Context Precision

Are retrieved chunks relevant? Low precision = noisy context that confuses the LLM.

Faithfulness

Does the generated answer stay faithful to the retrieved context, or does it hallucinate?

Answer Relevance

Does the answer actually address the original question asked?

Python — RAG Evaluation with RAGAS

# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What year was the company founded?"],
    "answer":   ["The company was founded in 1995."],
    "contexts": [["Company History: Founded in 1995 by John Smith..."]],
    "ground_truth": ["1995"],
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])
print(result)

🕸️ Module 07

Agent Systems

AI agents are LLM-powered systems that can reason, plan, use tools, and take actions in an environment. Agents can work alone or as part of collaborative multi-agent systems.

Learning Objectives

Implement the ReAct reasoning loop
Define and use tools / function calling
Add memory to agents
Build multi-agent workflows
Use LangGraph for stateful agents

🤖 What is an Agent?

An agent is an LLM in an action loop: it perceives state, reasons about what to do, calls a tool, observes the result, and repeats until the task is complete.

Agent Loop (ReAct: Reason + Act)

User Goal

↓

Reason
What should I do?

↓

Act
Call tool / API

↓

Observe
Tool result → context

↓

Done?
Answer or loop again

Tools / Functions

External capabilities: web search, code execution, database queries, API calls, file I/O.

Memory

Short-term: conversation history. Long-term: vector DB. Episodic: past interactions.

Planning

Breaking complex goals into sub-tasks. ReAct, CoT, ToT, and plan-and-execute patterns.

State

What the agent knows about the world and its progress toward the goal.

🛠️ Single Agent with Tool Use

OpenAI's function calling API lets you define tools as JSON schemas. The model decides when to call a tool and what arguments to pass.

Python — Single Agent with Tools

from openai import OpenAI
import json, math, datetime

client = OpenAI()

# ── Define Tools ─────────────────────────────────────────
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Evaluate a mathematical expression. Returns the numeric result.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression, e.g. '2 ** 10 + sqrt(16)'"}
                },
                "required": ["expression"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Get the current date and time in ISO format.",
            "parameters": {"type": "object", "properties": {}}
        }
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for information. Returns snippets.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"}
                },
                "required": ["query"]
            }
        }
    }
]

# ── Tool Implementations ──────────────────────────────────
def calculator(expression: str) -> str:
    try:
        # Safe eval with math functions only
        safe_env = {k: getattr(math, k) for k in dir(math) if not k.startswith('_')}
        result = eval(expression, {"__builtins__": {}}, safe_env)
        return str(result)
    except Exception as e:
        return f"Error: {e}"

def get_current_time() -> str:
    return datetime.datetime.now().isoformat()

def web_search(query: str) -> str:
    # Stub — replace with real search API (Tavily, SerpAPI, etc.)
    return f"Search results for '{query}': [This is a demo stub. Integrate Tavily API for real results.]"

TOOL_MAP = {
    "calculator": calculator,
    "get_current_time": get_current_time,
    "web_search": web_search,
}

# ── Agent Loop ────────────────────────────────────────────
def run_agent(user_goal: str, max_steps: int = 10) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful AI agent. Use tools when needed to answer accurately."},
        {"role": "user", "content": user_goal}
    ]

    for step in range(max_steps):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto",
        )

        msg = response.choices[0].message
        messages.append(msg)  # Add assistant message to history

        # Check if done (no tool calls)
        if not msg.tool_calls:
            print(f"Completed in {step + 1} steps.")
            return msg.content

        # Execute each tool call
        for tool_call in msg.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)
            print(f"  [Step {step+1}] Calling {fn_name}({fn_args})")

            fn_result = TOOL_MAP[fn_name](**fn_args)

            # Add tool result to messages
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "name": fn_name,
                "content": fn_result,
            })

    return "Max steps reached without completing the task."

# ── Run the Agent ─────────────────────────────────────────
result = run_agent("What is 2^32, and what time is it right now?")
print(result)

🧠 Adding Memory to Agents

Python — Agent with Vector Memory

from openai import OpenAI
import chromadb
from datetime import datetime

client = OpenAI()
chroma = chromadb.Client()
memory_store = chroma.create_collection("agent_memory")

class AgentWithMemory:
    def __init__(self, agent_id: str):
        self.agent_id = agent_id
        self.short_term = []  # Recent conversation turns
        self.max_short_term = 20

    def remember(self, text: str, metadata: dict = None):
        """Store a memory in long-term vector store."""
        embedding = client.embeddings.create(
            model="text-embedding-3-small", input=text
        ).data[0].embedding

        memory_store.add(
            ids=[f"{self.agent_id}_{datetime.now().timestamp()}"],
            embeddings=[embedding],
            documents=[text],
            metadatas=[{"agent": self.agent_id, "timestamp": str(datetime.now()), **(metadata or {})}]
        )

    def recall(self, query: str, k: int = 3) -> list[str]:
        """Retrieve relevant memories."""
        q_emb = client.embeddings.create(
            model="text-embedding-3-small", input=query
        ).data[0].embedding

        results = memory_store.query(
            query_embeddings=[q_emb],
            n_results=k,
            where={"agent": self.agent_id}
        )
        return results["documents"][0] if results["documents"] else []

    def chat(self, user_input: str) -> str:
        # Retrieve relevant memories
        memories = self.recall(user_input)
        memory_context = "\n".join([f"- {m}" for m in memories])

        # Build messages with memory
        system = f"""You are a helpful assistant with a persistent memory.
Relevant memories from past interactions:
{memory_context if memories else "No relevant memories yet."}"""

        self.short_term.append({"role": "user", "content": user_input})

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "system", "content": system}] + self.short_term[-self.max_short_term:],
        )

        reply = response.choices[0].message.content
        self.short_term.append({"role": "assistant", "content": reply})

        # Store important things in long-term memory
        self.remember(f"User said: {user_input}")
        self.remember(f"I responded: {reply[:200]}")

        return reply

agent = AgentWithMemory("assistant_1")
print(agent.chat("My name is Alice and I'm working on a Python RAG project."))
print(agent.chat("What was I working on?"))  # Should recall from memory

🕸️ Multi-Agent Systems

Multiple specialized agents collaborate, each focusing on what it does best. Common patterns: Orchestrator-Worker, Pipeline, and Debate.

Orchestrator-Worker Pattern

User Goal

→

Orchestrator
Plans & delegates

Researcher
Web search

Coder
Write code

Critic
Review & verify

Writer
Draft output

Python — Multi-Agent with LangGraph

# pip install langgraph langchain-openai
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage

llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

# ── Define Shared State ───────────────────────────────────
class ResearchState(TypedDict):
    messages: Annotated[list, add_messages]
    research_notes: str
    draft: str
    critique: str
    final_output: str
    task: str

# ── Define Agents (Nodes) ─────────────────────────────────
def researcher_agent(state: ResearchState) -> dict:
    """Gathers information relevant to the task."""
    response = llm.invoke([
        SystemMessage(content="You are a research expert. Gather key facts and insights."),
        HumanMessage(content=f"Research this topic thoroughly: {state['task']}")
    ])
    return {"research_notes": response.content}

def writer_agent(state: ResearchState) -> dict:
    """Drafts content based on research."""
    response = llm.invoke([
        SystemMessage(content="You are an expert technical writer. Write clearly and accurately."),
        HumanMessage(content=f"""
Task: {state['task']}
Research Notes: {state['research_notes']}

Write a comprehensive, well-structured response.""")
    ])
    return {"draft": response.content}

def critic_agent(state: ResearchState) -> dict:
    """Reviews and critiques the draft."""
    response = llm.invoke([
        SystemMessage(content="You are a critical reviewer. Find factual errors, gaps, and improvements."),
        HumanMessage(content=f"""
Original Task: {state['task']}
Draft to Review: {state['draft']}

Provide specific, actionable critique. Rate quality 1-10.""")
    ])
    return {"critique": response.content}

def reviser_agent(state: ResearchState) -> dict:
    """Revises based on critique."""
    response = llm.invoke([
        SystemMessage(content="You are a skilled editor. Improve the draft based on critique."),
        HumanMessage(content=f"""
Task: {state['task']}
Original Draft: {state['draft']}
Critique: {state['critique']}

Produce the final, polished version.""")
    ])
    return {"final_output": response.content}

# ── Build Graph ───────────────────────────────────────────
def build_research_pipeline() -> StateGraph:
    graph = StateGraph(ResearchState)

    # Add nodes
    graph.add_node("researcher", researcher_agent)
    graph.add_node("writer",     writer_agent)
    graph.add_node("critic",     critic_agent)
    graph.add_node("reviser",    reviser_agent)

    # Define edges (pipeline flow)
    graph.set_entry_point("researcher")
    graph.add_edge("researcher", "writer")
    graph.add_edge("writer",     "critic")
    graph.add_edge("critic",     "reviser")
    graph.add_edge("reviser",    END)

    return graph.compile()

# ── Run Pipeline ──────────────────────────────────────────
pipeline = build_research_pipeline()

result = pipeline.invoke({
    "task": "Explain how transformers work in modern LLMs",
    "messages": [],
    "research_notes": "",
    "draft": "",
    "critique": "",
    "final_output": "",
})

print("=== FINAL OUTPUT ===")
print(result["final_output"])
print("\n=== CRITIQUE ===")
print(result["critique"])

🔧 Agent Framework Comparison

Framework	Best For	Key Feature	Complexity
LangGraph	Complex stateful workflows, DAGs	Graph-based state machines, cycles	Medium
CrewAI	Role-based multi-agent teams	Crew + Role abstractions, easy setup	Low
AutoGen	Conversational multi-agent	Agent conversations, code execution	Medium
Anthropic SDK	Production agents with Claude	Native tool use, streaming, vision	Low
Custom	Maximum control and performance	Build exactly what you need	High

✅ Agent Design Best Practices

Design for failure: Agents will sometimes call wrong tools or loop. Add max step limits, error handling, and fallbacks.
Minimal tool surface: Give agents only the tools they need. Fewer tools = less confusion = more reliable behavior.
Structured tool outputs: Return consistent JSON from tools. Unstructured output confuses agents.
Observability: Log every tool call, reasoning step, and state transition. You need visibility to debug agents.
Human-in-the-loop: For high-stakes actions (deleting data, sending emails), require human approval before execution.
Idempotent tools: Design tools that can be safely retried without side effects (or track completed actions in state).

⚠️

Security warning: Never let agents execute arbitrary code from untrusted sources. Sandbox code execution with tools like E2B or Docker. Validate all tool inputs and limit permissions.

Master AI EngineeringFrom APIs to Agents

A Complete AI Engineering Curriculum

Learn. Practice. Get Hired.

Sign Up Free

Learn with Code

Practice Interviews

Simulate Real AI Interviews

Simple, Transparent Pricing

Ready to master AI engineering?

Master Modern AI,From APIs to Agents

Course Modules

How to Use Large Language Models

Learning Objectives

🧠 What is an LLM?

⚡ Basic API Call

💬 Multi-turn Conversations

🌊 Streaming Responses

🎛️ Key Generation Parameters

📋 Structured JSON Output

Prompt Engineering

Learning Objectives

🗺️ Prompting Techniques Overview

🎯 Zero-Shot Prompting

🔗 Chain-of-Thought (CoT)

❌ Without CoT

✅ With CoT

♻️ Self-Consistency

🎭 Role Prompting

📝 Reusable Prompt Templates

Few-Shot Learning

Learning Objectives

🧠 How In-Context Learning Works

📋 Basic Few-Shot Format

🔄 Dynamic Example Selection

✅ Best Practices

Supervised Fine-Tuning (SFT)

Learning Objectives

🤔 Why Fine-Tune?

📦 Data Preparation

⚡ LoRA: Low-Rank Adaptation

🚀 Training with TRL + PEFT

🔗 Merging LoRA Weights & Inference

RL Training with LLM-as-Judge

Learning Objectives

🗺️ RLHF Overview

⚖️ LLM-as-Judge Reward Model

📐 PPO — Proximal Policy Optimization

🎯 DPO — Direct Preference Optimization

🔄 GRPO — Group Relative Policy Optimization

Retrieval-Augmented Generation (RAG)

Learning Objectives

🏗️ Regular RAG Pipeline

⚡ Complete RAG Implementation

✂️ Chunking Strategies

🕸️ Graph RAG

📊 Evaluating RAG Quality

Agent Systems

Learning Objectives

🤖 What is an Agent?

🛠️ Single Agent with Tool Use

🧠 Adding Memory to Agents

🕸️ Multi-Agent Systems

🔧 Agent Framework Comparison

✅ Agent Design Best Practices

Master AI Engineering
From APIs to Agents

Master Modern AI,
From APIs to Agents