🤖
Learn AI
Student's Complete Guide
💰 0 credits
0/7 modules
🎓 Practical AI Education

Master AI Engineering
From APIs to Agents

7 structured modules with real Python code, 350 interview Q&As, and an AI-powered mock interview engine. Everything you need to go from zero to job-ready.

7
Modules
350
Q&As
40+
Code Examples
Free
To Join
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user",
"content": "Explain RAG"}]
)
print(response.choices[0].message.content)
✅ No subscription required to read · ✅ Real Python code in every module · ✅ Up-to-date with latest models · ✅ Practice with mock interviews

A Complete AI Engineering Curriculum

Sign in to unlock all 7 modules and track your progress

🤖
🔒
Module 01
Using LLMs
Connect to LLM APIs, stream tokens, and control generation parameters.
OpenAIAnthropicStreaming
✍️
🔒
Module 02
Prompt Engineering
Master chain-of-thought, self-consistency, and structured output.
CoTZero-shotJSON Output
🎯
🔒
Module 03
Few-Shot Learning
Teach models new tasks at inference time with curated examples.
In-ContextExample Selection
🔧
🔒
Module 04
Supervised Fine-Tuning
Adapt LLMs with LoRA, QLoRA, and HuggingFace Transformers.
LoRAQLoRAPEFT
🏆
🔒
Module 05
RL with LLM-as-Judge
Align LLMs using PPO, DPO, and GRPO with LLM-based rewards.
PPODPOGRPO
🔍
🔒
Module 06
RAG Systems
Build retrieval-augmented pipelines with vector search and Graph RAG.
Vector DBGraph RAGLangChain
🕸️
🔒
Module 07
Agent Systems
Design single and multi-agent systems with tools, memory, and planning.
ReActMulti-AgentLangGraph
🚀
Unlock All Modules
Create a free account to access everything

Learn. Practice. Get Hired.

01

Sign Up Free

Create your account in seconds. No credit card required. Instant access to all 7 modules.

02

Learn with Code

Work through structured modules with real Python examples you can run immediately.

03

Practice Interviews

Test your skills with 350 AI interview questions and get instant feedback on your answers.

Simulate Real AI Interviews

Our mock interview engine samples 10 random questions from our 350-question bank, lets you write answers, then shows model answers side by side. Rate yourself, review your history, and track improvement.

  • ✅ 350 questions across all 7 topics
  • ✅ Easy / Medium / Hard difficulty filters
  • ✅ Timer mode for realistic pressure
  • ✅ Self-rating (1–5 stars) with history tracking
  • ✅ Browse full Q&A bank anytime
Medium · Module 06 Q 3 / 10
What is the difference between dense and sparse retrieval in RAG systems?
|

Simple, Transparent Pricing

All course content is free. Pay only for mock interview sessions.

📚
Course Content
Free
Forever
  • ✓ All 7 modules
  • ✓ 40+ Python code examples
  • ✓ Architecture diagrams
  • ✓ Progress tracking
  • ✓ Browse Q&A bank
Pro Monthly
$99/mo
200 credits · 30 days
  • ✓ 200 credits monthly
  • ✓ 4 mock interviews/month
  • ✓ All free features
  • ✓ Priority support
  • ✓ Best value for job prep

Ready to master AI engineering?

Join today — it's completely free to start.

No credit card · No spam · Cancel anytime
🎓 Practical AI Education

Master Modern AI,
From APIs to Agents

A hands-on curriculum covering LLM usage, prompt engineering, fine-tuning, reinforcement learning, RAG, and autonomous agent systems — with real Python code examples.

7
Modules
40+
Code Examples
100%
Free

Course Modules

🤖
Module 01
Using LLMs
Connect to LLM APIs, send prompts, handle responses, and understand key generation parameters.
OpenAIAnthropicStreamingParameters
✍️
Module 02
Prompt Engineering
Master techniques like chain-of-thought, self-consistency, and structured output to elicit better responses.
CoTZero-shotRole PromptingJSON Output
🎯
Module 03
Few-Shot Learning
Teach the model new tasks at inference time by providing curated examples in the prompt context.
In-ContextExample SelectionFormat
🔧
Module 04
Supervised Fine-Tuning
Adapt pre-trained LLMs to specific tasks using LoRA, QLoRA, and HuggingFace Transformers.
LoRAQLoRAPEFTHuggingFace
🏆
Module 05
RL with LLM-as-Judge
Align LLMs using PPO, DPO, and GRPO with LLM-based reward signals instead of human labelers.
PPODPOGRPORLHF
🔍
Module 06
RAG Systems
Build retrieval-augmented generation pipelines including vector search and knowledge graph approaches.
Vector DBEmbeddingsGraph RAGLangChain
🕸️
Module 07
Agent Systems
Design single and multi-agent systems with tool use, memory, planning, and inter-agent communication.
ReActTool UseMulti-AgentLangGraph
💡
How to use this course: Work through modules in order — each builds on concepts from the previous one. All code examples use Python and popular open-source libraries. Click a module card above or use the sidebar to navigate.
🤖 Module 01

How to Use Large Language Models

Learn to interact with LLM APIs programmatically — sending prompts, handling responses, streaming tokens, and controlling generation behavior with key parameters.

Learning Objectives

  • Call the OpenAI / Anthropic API
  • Understand chat message formats
  • Stream tokens in real-time
  • Tune temperature, top-p, max_tokens
  • Count tokens & estimate cost

🧠 What is an LLM?

A Large Language Model (LLM) is a neural network trained on vast amounts of text data to predict the next token in a sequence. Models like GPT-4, Claude, and Llama have billions of parameters and can understand and generate human-like text.

Modern LLMs are accessed via an API — you send a structured request with your conversation history, and the model returns a completion.

Context Window
Maximum number of tokens (input + output) the model can process at once. GPT-4 supports up to 128K tokens.
Token
A chunk of text (~4 chars on average). "chatbot" = 1 token; "AI is cool" ≈ 3 tokens.
Temperature
Controls randomness. 0 = deterministic, 1 = creative. Most tasks work well at 0.0–0.7.
System Prompt
A special instruction message that sets the model's role, persona, or constraints before the conversation.
ℹ️
Popular providers: OpenAI (GPT-4o, o1), Anthropic (Claude 3.5), Google (Gemini), Meta (Llama 3), Mistral AI. They all follow a similar chat-based API pattern.

⚡ Basic API Call

Start by installing the SDK and making your first call. The messages list contains the conversation history in order.

Python — OpenAI
# pip install openai
from openai import OpenAI

client = OpenAI(api_key="sk-...")  # or set OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful AI tutor."},
        {"role": "user",   "content": "Explain what a neural network is in simple terms."}
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")
Python — Anthropic (Claude)
# pip install anthropic
import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system="You are a helpful AI tutor.",
    messages=[
        {"role": "user", "content": "Explain what a neural network is in simple terms."}
    ]
)

print(message.content[0].text)
print(f"\nInput tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")

💬 Multi-turn Conversations

LLMs are stateless — each request must include the full conversation history. Maintain a messages list and append each turn manually.

Python — Conversation Loop
from openai import OpenAI

client = OpenAI()
messages = [{"role": "system", "content": "You are a friendly tutor."}]

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.7,
    )
    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

# Simulate a conversation
print(chat("What is gradient descent?"))
print(chat("Can you give me a real-world analogy?"))
print(chat("How does it relate to backpropagation?"))

# messages list now contains the full history
print(f"\nConversation length: {len(messages)} messages")

🌊 Streaming Responses

Streaming sends tokens to your app as they're generated, instead of waiting for the full response. This dramatically improves perceived latency for users.

Python — Streaming
from openai import OpenAI

client = OpenAI()

with client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about machine learning."}],
    stream=True,
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)  # Print token by token
    print()  # Newline at end

🎛️ Key Generation Parameters

ParameterRangeEffectRecommended
temperature0.0–2.0Higher = more random/creative output0.0 for factual, 0.7 for creative
top_p0.0–1.0Nucleus sampling; limits vocab to top-p probability mass0.9 (use either temp or top_p, not both)
max_tokens1–model limitMaximum output length in tokens256–2048 for most tasks
frequency_penalty-2.0–2.0Penalizes repeating tokens that appeared frequently0.3–0.5 to reduce repetition
presence_penalty-2.0–2.0Penalizes tokens that appeared at all (encourages new topics)0.5–1.0 for diverse outputs
stoplist of stringsStop generation when any sequence is produced["###", "\n\n"]

📋 Structured JSON Output

Use response_format to guarantee valid JSON output, useful for building applications that parse model responses.

Python — JSON Mode
from openai import OpenAI
import json

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Respond only with valid JSON."},
        {"role": "user", "content": (
            "Extract entities from: 'Sam Altman founded OpenAI in 2015 in San Francisco.' "
            "Return JSON with keys: people, organizations, locations, years."
        )},
    ]
)

data = json.loads(response.choices[0].message.content)
print(data)
# {"people": ["Sam Altman"], "organizations": ["OpenAI"],
#  "locations": ["San Francisco"], "years": [2015]}
✍️ Module 02

Prompt Engineering

Prompt engineering is the art of crafting inputs that reliably elicit the best possible outputs from an LLM. Small changes in phrasing can dramatically affect quality.

Learning Objectives

  • Apply zero-shot & role prompting
  • Use Chain-of-Thought reasoning
  • Implement self-consistency sampling
  • Build reusable prompt templates
  • Extract structured data reliably

🗺️ Prompting Techniques Overview

Zero-shot
Few-shot
Chain-of-Thought
Self-Consistency
Tree of Thought
Role Prompting

Complexity & effectiveness generally increase left → right

🎯 Zero-Shot Prompting

Zero-shot prompting relies on the model's pre-trained knowledge with no examples. Works well for common tasks. The key is a clear, specific instruction.

Python — Zero-Shot Classification
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Classify the sentiment of the following text.
Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL.

Text: "{text}"

Sentiment:"""
        }],
        temperature=0,  # Deterministic for classification
        max_tokens=10,
    )
    return response.choices[0].message.content.strip()

texts = [
    "The model training finished 10x faster than expected!",
    "This API keeps returning errors and I can't figure out why.",
    "The paper proposes a new attention mechanism."
]

for t in texts:
    print(f"'{t[:40]}...' → {classify_sentiment(t)}")

🔗 Chain-of-Thought (CoT)

CoT prompts the model to show its reasoning step-by-step before giving an answer. This dramatically improves performance on math, logic, and multi-step reasoning tasks.

❌ Without CoT

Prompt
prompt = """
Q: A train travels 150 km in 2 hours,
then 200 km in 3 hours. What is its
average speed for the whole trip?

A:"""
# Often gets wrong answer: 70 km/h (arithmetic mean)

✅ With CoT

Prompt
prompt = """
Q: A train travels 150 km in 2 hours,
then 200 km in 3 hours. What is its
average speed for the whole trip?

A: Let me think step by step.
"""
# Gets correct answer: 350/5 = 70 km/h
# (total distance / total time)
Python — Zero-Shot CoT ("Let's think step by step")
from openai import OpenAI

client = OpenAI()

def solve_with_cot(problem: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a careful, logical problem solver."},
            {"role": "user", "content": f"{problem}\n\nLet's think step by step:"}
        ],
        temperature=0,
    )
    return response.choices[0].message.content

problem = """
If I have 5 apples and give away 2/5 of them, then receive 3 more,
and finally share equally with one friend, how many do I end up with?
"""
print(solve_with_cot(problem))

♻️ Self-Consistency

Generate multiple CoT responses with high temperature, then take the majority vote. This ensemble approach reduces errors by ~10-20% on reasoning tasks.

Python — Self-Consistency Voting
from openai import OpenAI
from collections import Counter
import re

client = OpenAI()

def self_consistency(question: str, num_samples: int = 5) -> str:
    """Sample multiple CoT paths and majority-vote the final answer."""
    system = (
        "Solve the problem step by step. "
        "End your response with 'Final answer: '"
    )
    answers = []
    for _ in range(num_samples):
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system},
                {"role": "user",   "content": question}
            ],
            temperature=0.8,  # High temp for diverse reasoning paths
        )
        text = resp.choices[0].message.content
        # Extract the final answer
        match = re.search(r"Final answer:\s*(.+)", text, re.IGNORECASE)
        if match:
            answers.append(match.group(1).strip().lower())

    if not answers:
        return "Could not extract answers"

    # Majority vote
    most_common, count = Counter(answers).most_common(1)[0]
    print(f"Votes: {dict(Counter(answers))}")
    print(f"Majority ({count}/{num_samples}): {most_common}")
    return most_common

result = self_consistency(
    "A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. "
    "How much does the ball cost?"
)

🎭 Role Prompting

Assigning a specific role or persona in the system prompt activates relevant knowledge patterns and adjusts the model's communication style.

Python — Role-Based System Prompts
from openai import OpenAI

client = OpenAI()

ROLES = {
    "code_reviewer": """You are a senior software engineer with 15 years of experience.
Review code for: correctness, performance, security vulnerabilities, and maintainability.
Be specific and actionable in your feedback.""",

    "socratic_tutor": """You are a Socratic tutor. Never give direct answers.
Instead, guide students to discover answers themselves through carefully crafted questions.
Ask one question at a time.""",

    "ux_critic": """You are a UX researcher with expertise in cognitive load theory.
Evaluate designs from the user's perspective. Cite specific usability heuristics.""",
}

def ask_expert(role_key: str, question: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": ROLES[role_key]},
            {"role": "user",   "content": question}
        ],
        temperature=0.7,
    ).choices[0].message.content

# Example usage
review = ask_expert("code_reviewer", """
Review this Python function:
def get_user(id):
    return db.execute(f"SELECT * FROM users WHERE id = {id}")
""")
⚠️
The code above has a SQL injection vulnerability — that's intentional so the code reviewer can catch it! Always use parameterized queries: db.execute("SELECT * FROM users WHERE id = ?", (id,))

📝 Reusable Prompt Templates

Build parameterized templates to standardize prompts across your application and make them easier to iterate on.

Python — Template System
from string import Template
from openai import OpenAI

client = OpenAI()

# Define reusable templates
SUMMARIZE_TEMPLATE = Template("""
You are an expert at summarizing $domain content.
Summarize the following text in exactly $num_points bullet points.
Focus on: $focus_areas.
Each bullet should be one clear, concise sentence.

TEXT:
$text

SUMMARY:
""")

EXTRACT_TEMPLATE = Template("""
Extract all $entity_type from the following text.
Return as a JSON array of strings.
If none found, return an empty array [].

TEXT: $text
""")

def summarize(text: str, domain="technical", points=3, focus="key findings"):
    prompt = SUMMARIZE_TEMPLATE.substitute(
        domain=domain,
        num_points=points,
        focus_areas=focus,
        text=text
    )
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
    ).choices[0].message.content

paper_abstract = """
We present GPT-4, a large multimodal model capable of processing image and text inputs
and producing text outputs. GPT-4 exhibits human-level performance on various professional
and academic benchmarks...
"""
print(summarize(paper_abstract, domain="AI research", points=4, focus="contributions, methods, results"))
🎯 Module 03

Few-Shot Learning

Few-shot learning uses a small number of examples (shots) within the prompt to teach the model a new task without any gradient updates — this is called in-context learning.

Learning Objectives

  • Understand in-context learning
  • Format examples effectively
  • Select high-quality shots
  • Build dynamic few-shot retrieval
  • Know when few-shot beats zero-shot

🧠 How In-Context Learning Works

Large language models develop the ability to learn new tasks by observing demonstrations in their context window. No weight updates occur — the model uses pattern matching and analogy from its pre-training.

Prompt Structure
[System: You are a sentiment classifier]
Text: "Great product!" → Label: POSITIVE
Text: "Terrible experience." → Label: NEGATIVE
Text: "It works fine." → Label: NEUTRAL
Text: "Loved it!" → Label: ???
3 examples (shots) teach the format → model predicts POSITIVE
🔮
Why it works: During pre-training on internet text, the model sees countless input→output patterns. Few-shot examples activate the right "circuit" for the task by providing clear format and semantics cues.

📋 Basic Few-Shot Format

Python — Few-Shot Classification
from openai import OpenAI

client = OpenAI()

def few_shot_classify(examples: list[dict], query: str) -> str:
    """
    examples: list of {"input": ..., "label": ...}
    query: the text to classify
    """
    # Build the few-shot prompt
    shots = "\n".join([
        f'Text: "{ex["input"]}"\nLabel: {ex["label"]}'
        for ex in examples
    ])

    prompt = f"""Classify the sentiment of text as POSITIVE, NEGATIVE, or NEUTRAL.

{shots}
Text: "{query}"
Label:"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=10,
    )
    return response.choices[0].message.content.strip()

# Examples covering all three classes
examples = [
    {"input": "The delivery was incredibly fast!", "label": "POSITIVE"},
    {"input": "Completely broken on arrival.",      "label": "NEGATIVE"},
    {"input": "It does what it says.",              "label": "NEUTRAL"},
    {"input": "Best purchase I've made this year!", "label": "POSITIVE"},
    {"input": "Won't be buying from them again.",   "label": "NEGATIVE"},
]

queries = [
    "Works as expected, nothing special.",
    "Absolutely love this product!",
    "Stopped working after one week."
]

for q in queries:
    label = few_shot_classify(examples, q)
    print(f"'{q}' → {label}")

🔄 Dynamic Example Selection

Instead of using fixed examples, retrieve the most semantically similar examples to the query. This improves performance especially for diverse or edge-case inputs.

Python — Semantic Example Retrieval
# pip install openai numpy
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return resp.data[0].embedding

def cosine_similarity(a, b) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

class DynamicFewShot:
    def __init__(self, examples: list[dict]):
        """examples: list of {"input": ..., "output": ..., "embedding": None}"""
        self.examples = examples
        # Pre-compute embeddings for all examples
        for ex in self.examples:
            ex["embedding"] = get_embedding(ex["input"])

    def get_top_k(self, query: str, k: int = 3) -> list[dict]:
        q_emb = get_embedding(query)
        scored = [
            (cosine_similarity(q_emb, ex["embedding"]), ex)
            for ex in self.examples
        ]
        scored.sort(key=lambda x: x[0], reverse=True)
        return [ex for _, ex in scored[:k]]

    def predict(self, query: str, k: int = 3) -> str:
        top_k = self.get_top_k(query, k)
        shots = "\n".join([
            f'Input: {ex["input"]}\nOutput: {ex["output"]}'
            for ex in top_k
        ])
        prompt = f"Transform the input as shown:\n\n{shots}\nInput: {query}\nOutput:"
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )
        return resp.choices[0].message.content.strip()

# Example: date format normalization
examples = [
    {"input": "January 5th, 2024",     "output": "2024-01-05"},
    {"input": "March 22, 2023",        "output": "2023-03-22"},
    {"input": "Dec 31st 2022",         "output": "2022-12-31"},
    {"input": "15 August 2024",        "output": "2024-08-15"},
    {"input": "July 4, 2025",          "output": "2025-07-04"},
    {"input": "February 14th, 2024",   "output": "2024-02-14"},
]

dfs = DynamicFewShot(examples)
print(dfs.predict("October 3rd, 2024"))   # → 2024-10-03
print(dfs.predict("11 November 2025"))    # → 2025-11-11

✅ Best Practices

PrincipleDoAvoid
Diversity Cover all output classes/formats in examples Using examples that are all similar to each other
Format Use identical I/O format for all shots Inconsistent spacing, punctuation, or structure
Quality Use high-quality, verified example pairs Incorrect labels — they anchor the model to wrong patterns
Order Put the most relevant example last (recency bias) Random ordering for critical tasks
Count 3–8 shots for most tasks Filling the entire context window with shots
🔧 Module 04

Supervised Fine-Tuning (SFT)

SFT adapts a pre-trained LLM to a specific domain or task by training on labeled instruction-response pairs, updating model weights using gradient descent.

Learning Objectives

  • Prepare instruction-tuning datasets
  • Understand LoRA / QLoRA
  • Run training with HuggingFace + TRL
  • Choose hyperparameters
  • Evaluate fine-tuned models

🤔 Why Fine-Tune?

Prompt engineering has limits. When you need consistent format/style, domain-specific knowledge, lower latency, or lower API cost, fine-tuning is the answer.

ApproachWhen to UseTrade-offs
PromptingGeneral tasks, quick iterationHigh token cost, inconsistent format
Few-ShotFormat adaptationUses context window, can't go below 3-5 shots
SFTDomain expertise, consistent behaviorNeeds data & compute, risks overfitting

📦 Data Preparation

SFT requires a dataset of (instruction, response) pairs in a chat format. Quality matters far more than quantity — 1,000 excellent examples beat 100,000 mediocre ones.

Python — Dataset Preparation (ShareGPT format)
import json
from datasets import Dataset

# ShareGPT format (widely supported)
data = [
    {
        "conversations": [
            {"from": "system",    "value": "You are an expert Python tutor."},
            {"from": "human",     "value": "How do I read a CSV file in Python?"},
            {"from": "gpt",       "value": "Use the `csv` module or pandas:\n\n```python\nimport pandas as pd\ndf = pd.read_csv('file.csv')\nprint(df.head())\n```\n\nFor large files, use `chunksize=1000` to process in batches."}
        ]
    },
    {
        "conversations": [
            {"from": "system",    "value": "You are an expert Python tutor."},
            {"from": "human",     "value": "What is a list comprehension?"},
            {"from": "gpt",       "value": "A list comprehension is a concise way to create lists:\n\n```python\n# Traditional\nresult = []\nfor x in range(10):\n    if x % 2 == 0:\n        result.append(x**2)\n\n# List comprehension\nresult = [x**2 for x in range(10) if x % 2 == 0]\n```\n\nThey're faster and more Pythonic than equivalent loops."}
        ]
    },
    # ... add thousands more examples
]

# Save as JSONL
with open("train.jsonl", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

# Load with HuggingFace datasets
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.05)
print(f"Train: {len(dataset['train'])}, Val: {len(dataset['test'])}")

⚡ LoRA: Low-Rank Adaptation

Full fine-tuning updates all ~7B+ parameters — expensive and prone to catastrophic forgetting. LoRA freezes the original weights and injects small trainable matrices that capture task-specific updates.

Original Weight W
W0
FROZEN ❄️
+
LoRA Adapter
B·A
TRAINABLE 🔥
=
Effective Weight
W0 + BA
~1% parameters

A is rank×d, B is d×rank. Typical rank=8–64. Only A and B are trained.

LoRA rank (r)
Higher rank = more capacity but more parameters. r=8 for simple tasks, r=64 for complex ones.
Alpha (α)
Scaling factor = α/r. Keep α = 2×r for stable training (e.g., r=16, α=32).
Target modules
Which attention layers to apply LoRA to: q_proj, v_proj (at minimum), or all projection layers.
QLoRA
Quantize base model to 4-bit (NF4), then apply LoRA. Fits 70B models on a single consumer GPU.

🚀 Training with TRL + PEFT

Python — SFT Training (TRL SFTTrainer)
# pip install transformers trl peft accelerate bitsandbytes datasets
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B"

# ── 4-bit Quantization (QLoRA) ──────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# ── LoRA Configuration ───────────────────────────────────
lora_config = LoraConfig(
    r=16,                           # Rank
    lora_alpha=32,                  # Alpha = 2x rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,278,400 || trainable%: 1.03%

# ── Training Arguments ───────────────────────────────────
training_args = SFTConfig(
    output_dir="./llama3-sft",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,    # Effective batch = 8
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    bf16=True,
    max_seq_length=2048,
    dataset_text_field="text",        # Column containing formatted text
    report_to="wandb",                # Optional: experiment tracking
)

# ── Load Dataset ─────────────────────────────────────────
dataset = load_dataset("json", data_files={"train": "train.jsonl"}, split="train")

def format_conversation(example):
    """Convert ShareGPT format to training text."""
    messages = example["conversations"]
    text = ""
    for msg in messages:
        if msg["from"] == "system":
            text += f"<|system|>\n{msg['value']}\n"
        elif msg["from"] == "human":
            text += f"<|user|>\n{msg['value']}\n"
        elif msg["from"] == "gpt":
            text += f"<|assistant|>\n{msg['value']}\n"
    return {"text": text}

dataset = dataset.map(format_conversation)

# ── Start Training ───────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./llama3-sft-final")
print("Training complete!")

🔗 Merging LoRA Weights & Inference

Python — Merge & Run Inference
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
LORA_ADAPTER = "./llama3-sft-final"

# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

# Load and merge LoRA weights into base model
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
model = model.merge_and_unload()  # Merge weights, remove LoRA modules

# Save merged model
model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")

# Run inference
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "<|system|>\nYou are an expert Python tutor.\n<|user|>\nExplain decorators.\n<|assistant|>\n"
output = pipe(prompt, max_new_tokens=256, temperature=0.7)
print(output[0]["generated_text"][len(prompt):])
🏆 Module 05

RL Training with LLM-as-Judge

Reinforcement Learning from Human Feedback (RLHF) aligns LLMs with human preferences. Using another LLM as a judge automates the reward signal at scale — no human labelers needed.

Learning Objectives

  • Understand RLHF pipeline
  • Implement PPO for LLMs
  • Apply DPO (simpler alternative)
  • Use GRPO for reasoning tasks
  • Build an LLM-as-judge reward model

🗺️ RLHF Overview

Classic RLHF has three stages. Using an LLM-as-judge replaces the expensive human preference collection step.

Stage 1 — Supervised Fine-Tuning
Instruction Data
SFT on Base LLM
SFT Model (Policy)
Stage 2 — Reward Modeling
Prompt
LLM Judge
Reward Score
Stage 3 — RL Optimization (PPO / DPO / GRPO)
SFT Policy
+
Reward
RL Update
Aligned Model

⚖️ LLM-as-Judge Reward Model

Instead of a trained reward model, use a capable LLM (e.g., GPT-4o) to score responses. This scales instantly and can evaluate nuanced qualities like helpfulness and harmlessness.

Python — LLM Judge Implementation
from openai import OpenAI
import json

client = OpenAI()

JUDGE_SYSTEM = """You are an expert AI judge evaluating the quality of AI assistant responses.
Score the response on a scale of 1-10 based on:
- Accuracy (is the information correct?)
- Helpfulness (does it fully address the question?)
- Clarity (is it easy to understand?)
- Safety (no harmful content?)

Respond with valid JSON only: {"score": <1-10>, "reasoning": ""}"""

def llm_judge(prompt: str, response: str) -> dict:
    """Score a response using GPT-4o as judge. Returns score and reasoning."""
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user", "content": f"Prompt: {prompt}\n\nResponse: {response}"}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)

# Pairwise comparison (preferred for DPO data collection)
def pairwise_judge(prompt: str, response_a: str, response_b: str) -> str:
    """Returns 'A', 'B', or 'tie'."""
    pairwise_prompt = f"""Which response is better?
Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}

Answer with JSON: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an impartial AI judge."},
            {"role": "user",   "content": pairwise_prompt}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    data = json.loads(result.choices[0].message.content)
    return data["winner"]

# Example usage
score = llm_judge(
    prompt="What is the capital of France?",
    response="Paris is the capital of France, known for the Eiffel Tower."
)
print(f"Score: {score['score']}/10 — {score['reasoning']}")

📐 PPO — Proximal Policy Optimization

PPO is the canonical RL algorithm for RLHF. It updates the policy (LLM) to maximize reward while staying close to the reference policy via a KL divergence penalty.

📚
Loss function: L = E[min(r·A, clip(r, 1-ε, 1+ε)·A)] − β·KL(π_θ ∥ π_ref)
where r = π_θ(a|s)/π_ref(a|s) is the probability ratio, A is advantage, β controls KL penalty.
Python — PPO with TRL
# pip install trl transformers peft
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
from datasets import Dataset
import torch

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Load model with value head (needed for PPO)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)  # Reference model (frozen)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# PPO Config
ppo_config = PPOConfig(
    model_name=MODEL_ID,
    learning_rate=1.41e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    optimize_cuda_cache=True,
    kl_penalty="kl",        # Or "full" for full KL
    init_kl_coef=0.2,       # β: initial KL coefficient
    target_kl=6.0,          # Target KL divergence
    gamma=1.0,
    lam=0.95,               # GAE lambda
    cliprange=0.2,          # ε: clip range
    vf_coef=0.1,            # Value function coefficient
)

trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
prompts = ["Explain quantum computing", "Write a poem about AI", ...]

for epoch in range(3):
    for batch_prompts in chunks(prompts, ppo_config.batch_size):
        # 1. Tokenize prompts
        query_tensors = [
            tokenizer.encode(p, return_tensors="pt").squeeze()
            for p in batch_prompts
        ]

        # 2. Generate responses
        response_tensors = trainer.generate(
            query_tensors,
            max_new_tokens=256,
            temperature=0.9,
        )

        # 3. Score with LLM judge
        rewards = []
        for prompt, response_tensor in zip(batch_prompts, response_tensors):
            response_text = tokenizer.decode(response_tensor)
            score = llm_judge(prompt, response_text)  # From previous code
            rewards.append(torch.tensor(score["score"] / 10.0))

        # 4. PPO update
        stats = trainer.step(query_tensors, response_tensors, rewards)
        print(f"Epoch {epoch} | Mean reward: {stats['ppo/mean_scores']:.3f} | KL: {stats['objective/kl']:.3f}")

🎯 DPO — Direct Preference Optimization

DPO eliminates the separate reward model and RL loop entirely. It directly optimizes the policy to prefer "chosen" responses over "rejected" ones using a simple cross-entropy loss.

DPO advantage: Much simpler than PPO — no value head, no separate reward model, no on-policy generation during training. Just a supervised loss on preference pairs.
Python — DPO Dataset + Training
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import torch

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# DPO requires preference pairs: (prompt, chosen, rejected)
# Generate these using LLM judge for pairwise comparison
dpo_data = [
    {
        "prompt":   "What is the best way to learn Python?",
        "chosen":   "Start with official tutorials, then build projects. Practice daily with small scripts before attempting large projects.",
        "rejected": "Just watch YouTube videos."
    },
    {
        "prompt":   "Explain recursion",
        "chosen":   "Recursion is when a function calls itself. Example: factorial(n) = n * factorial(n-1). Every recursive function needs a base case to stop.",
        "rejected": "It's a programming thing where functions call themselves."
    },
    # ... thousands more preference pairs
]

dataset = Dataset.from_list(dpo_data)

training_args = DPOConfig(
    output_dir="./llama3-dpo",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,             # Lower than SFT
    beta=0.1,                       # KL penalty (higher = closer to reference)
    max_length=1024,
    max_prompt_length=512,
    bf16=True,
    logging_steps=10,
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,         # If None, uses a copy of model as reference
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./llama3-dpo-final")

🔄 GRPO — Group Relative Policy Optimization

GRPO (from DeepSeek-R1) improves on PPO for reasoning tasks. It samples a group of responses per prompt, computes relative rewards within the group, and uses them as baselines — eliminating the value function entirely.

💡
Key insight: Instead of learning a value function V(s), GRPO estimates the baseline by averaging rewards across G sampled outputs for the same prompt. This is simpler and works extremely well for verifiable tasks (math, code).
Python — GRPO with TRL
from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import re, torch

MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# ── Reward Functions ─────────────────────────────────────
# GRPO supports multiple composable reward functions

def correctness_reward(completions, ground_truth, **kwargs) -> list[float]:
    """Verify math answers against ground truth."""
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        # Extract answer from ... tags
        match = re.search(r"(.*?)", completion, re.DOTALL)
        if match and match.group(1).strip() == str(gt).strip():
            rewards.append(2.0)   # Correct answer
        else:
            rewards.append(0.0)   # Wrong
    return rewards

def format_reward(completions, **kwargs) -> list[float]:
    """Reward responses that use the correct format."""
    rewards = []
    for c in completions:
        has_thinking = "" in c and "" in c
        has_answer   = "" in c and "" in c
        rewards.append(0.5 if (has_thinking and has_answer) else 0.0)
    return rewards

def length_penalty(completions, **kwargs) -> list[float]:
    """Penalize overly short or long responses."""
    rewards = []
    for c in completions:
        tokens = len(c.split())
        if 50 <= tokens <= 500:
            rewards.append(0.1)
        elif tokens < 20 or tokens > 1000:
            rewards.append(-0.2)
        else:
            rewards.append(0.0)
    return rewards

# ── GRPO Config ──────────────────────────────────────────
config = GRPOConfig(
    output_dir="./qwen-grpo-math",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=1e-6,
    num_generations=8,              # G: responses sampled per prompt
    max_prompt_length=512,
    max_completion_length=1024,
    beta=0.04,                      # KL coefficient
    bf16=True,
    logging_steps=10,
    reward_weights=[1.0, 0.5, 0.2], # Weights for reward functions
)

# Load math dataset (e.g., GSM8K)
dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.rename_column("answer", "ground_truth")

trainer = GRPOTrainer(
    model=model,
    tokenizer=tokenizer,
    config=config,
    train_dataset=dataset,
    reward_funcs=[correctness_reward, format_reward, length_penalty],
)

trainer.train()
AlgorithmRequiresBest ForComplexity
PPOReward model + value headGeneral alignment, chatHigh
DPOPreference pairsStyle/safety alignmentLow
GRPOVerifiable reward functionMath, code, reasoningMedium
🔍 Module 06

Retrieval-Augmented Generation (RAG)

RAG grounds LLM responses in your own data by retrieving relevant documents at query time. This reduces hallucination and enables knowledge-current, source-cited answers.

Learning Objectives

  • Build an end-to-end RAG pipeline
  • Choose chunking & embedding strategies
  • Use vector databases
  • Implement Graph RAG
  • Evaluate retrieval quality

🏗️ Regular RAG Pipeline

Indexing Phase (offline)
Documents
Chunking
Embed Model
Vector DB
Query Phase (online)
User Query
Embed Query
Top-K Retrieve
+
LLM
Answer

⚡ Complete RAG Implementation

Python — RAG Pipeline from Scratch
# pip install openai chromadb langchain-text-splitters pypdf
from openai import OpenAI
import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path

client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge_base")

# ── STEP 1: Load & Chunk Documents ──────────────────────
def load_and_chunk(file_path: str, chunk_size=512, overlap=64) -> list[str]:
    text = Path(file_path).read_text(encoding="utf-8")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

# ── STEP 2: Embed & Store ────────────────────────────────
def embed_texts(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [d.embedding for d in resp.data]

def index_document(file_path: str):
    chunks = load_and_chunk(file_path)
    embeddings = embed_texts(chunks)
    ids = [f"{file_path}_{i}" for i in range(len(chunks))]
    collection.add(
        ids=ids,
        embeddings=embeddings,
        documents=chunks,
        metadatas=[{"source": file_path, "chunk": i} for i in range(len(chunks))]
    )
    print(f"Indexed {len(chunks)} chunks from {file_path}")

# ── STEP 3: Retrieve ─────────────────────────────────────
def retrieve(query: str, k: int = 5) -> list[dict]:
    query_embedding = embed_texts([query])[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        include=["documents", "metadatas", "distances"]
    )
    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        chunks.append({
            "text": doc,
            "source": meta["source"],
            "score": 1 - dist  # Convert distance to similarity
        })
    return chunks

# ── STEP 4: Generate Answer ──────────────────────────────
def rag_query(question: str, k: int = 5) -> dict:
    # Retrieve relevant chunks
    chunks = retrieve(question, k=k)

    # Build context
    context = "\n\n---\n\n".join([
        f"[Source: {c['source']}, Score: {c['score']:.2f}]\n{c['text']}"
        for c in chunks
    ])

    # Generate with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are a helpful assistant.
Answer the user's question based ONLY on the provided context.
If the answer is not in the context, say "I don't have enough information to answer this."
Always cite your sources."""},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0,
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [c["source"] for c in chunks],
        "chunks_used": len(chunks)
    }

# ── Usage ─────────────────────────────────────────────────
index_document("company_docs.txt")
index_document("product_manual.pdf")

result = rag_query("What are the system requirements?")
print(result["answer"])
print(f"\nSources: {result['sources']}")

✂️ Chunking Strategies

StrategyDescriptionBest For
Fixed SizeSplit every N tokens/chars with overlapGeneral text, simple documents
RecursiveTry paragraph → sentence → word splitsProse, books, articles
SemanticSplit on topic/semantic boundaries using embeddingsMulti-topic docs, high accuracy needs
Document-awareMarkdown headers, HTML tags, code blocksStructured docs, code files

🕸️ Graph RAG

Graph RAG builds a knowledge graph from documents — extracting entities and relationships — then traverses the graph during retrieval to find non-obvious connections that pure vector search misses.

Graph RAG Architecture
Documents
Entity & Relation
Extraction
Knowledge
Graph (Neo4j)
At query time:
Query
Graph Traversal
+ Vector Search
Sub-graph
Answer
Python — Graph RAG with LLM Extraction
# pip install openai neo4j
from openai import OpenAI
from neo4j import GraphDatabase
import json

client = OpenAI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# ── Extract Knowledge Graph from Text ────────────────────
def extract_knowledge(text: str) -> dict:
    """Extract entities and relationships using LLM."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Extract a knowledge graph from the text.
Return JSON with:
- "entities": [{"name": str, "type": str, "description": str}]
- "relations": [{"from": str, "relation": str, "to": str}]"""},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

# ── Store in Neo4j ────────────────────────────────────────
def store_graph(kg: dict):
    with driver.session() as session:
        # Create entity nodes
        for entity in kg["entities"]:
            session.run(
                "MERGE (e:Entity {name: $name}) SET e.type=$type, e.description=$desc",
                name=entity["name"], type=entity["type"], desc=entity["description"]
            )
        # Create relationship edges
        for rel in kg["relations"]:
            session.run(
                """MATCH (a:Entity {name: $from}), (b:Entity {name: $to})
                   MERGE (a)-[r:RELATION {type: $rel}]->(b)""",
                **{"from": rel["from"], "to": rel["to"], "rel": rel["relation"]}
            )

# ── Graph Traversal Query ────────────────────────────────
def graph_retrieve(query: str, hops: int = 2) -> str:
    """Retrieve a subgraph relevant to the query via entity matching + traversal."""
    # Extract query entities
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"List the main entity names in this query as JSON array: {query}"
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )
    entities = json.loads(resp.choices[0].message.content).get("entities", [])

    results = []
    with driver.session() as session:
        for entity in entities[:3]:  # Limit to top 3
            records = session.run(f"""
                MATCH path = (start:Entity)-[*1..{hops}]-(connected)
                WHERE start.name CONTAINS $name
                RETURN [node in nodes(path) | node.name + ': ' + node.description] as chain,
                       [rel in relationships(path) | type(rel)] as rels
                LIMIT 10
            """, name=entity)
            for r in records:
                results.append(" → ".join(r["chain"]))
    return "\n".join(results)

def graph_rag_query(question: str) -> str:
    subgraph_context = graph_retrieve(question)
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using the knowledge graph context provided."},
            {"role": "user", "content": f"Knowledge Graph:\n{subgraph_context}\n\nQuestion: {question}"}
        ],
    )
    return resp.choices[0].message.content

# Example
text = "Apple was founded by Steve Jobs and Steve Wozniak in 1976. Jobs later launched the iPhone in 2007."
kg = extract_knowledge(text)
store_graph(kg)
answer = graph_rag_query("What did Steve Jobs create after founding Apple?")
print(answer)

📊 Evaluating RAG Quality

Context Recall
Are all the facts needed to answer the question present in the retrieved chunks?
Context Precision
Are retrieved chunks relevant? Low precision = noisy context that confuses the LLM.
Faithfulness
Does the generated answer stay faithful to the retrieved context, or does it hallucinate?
Answer Relevance
Does the answer actually address the original question asked?
Python — RAG Evaluation with RAGAS
# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What year was the company founded?"],
    "answer":   ["The company was founded in 1995."],
    "contexts": [["Company History: Founded in 1995 by John Smith..."]],
    "ground_truth": ["1995"],
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])
print(result)
🕸️ Module 07

Agent Systems

AI agents are LLM-powered systems that can reason, plan, use tools, and take actions in an environment. Agents can work alone or as part of collaborative multi-agent systems.

Learning Objectives

  • Implement the ReAct reasoning loop
  • Define and use tools / function calling
  • Add memory to agents
  • Build multi-agent workflows
  • Use LangGraph for stateful agents

🤖 What is an Agent?

An agent is an LLM in an action loop: it perceives state, reasons about what to do, calls a tool, observes the result, and repeats until the task is complete.

Agent Loop (ReAct: Reason + Act)
User Goal
Reason
What should I do?
Act
Call tool / API
Observe
Tool result → context
Done?
Answer or loop again
Tools / Functions
External capabilities: web search, code execution, database queries, API calls, file I/O.
Memory
Short-term: conversation history. Long-term: vector DB. Episodic: past interactions.
Planning
Breaking complex goals into sub-tasks. ReAct, CoT, ToT, and plan-and-execute patterns.
State
What the agent knows about the world and its progress toward the goal.

🛠️ Single Agent with Tool Use

OpenAI's function calling API lets you define tools as JSON schemas. The model decides when to call a tool and what arguments to pass.

Python — Single Agent with Tools
from openai import OpenAI
import json, math, datetime

client = OpenAI()

# ── Define Tools ─────────────────────────────────────────
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Evaluate a mathematical expression. Returns the numeric result.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression, e.g. '2 ** 10 + sqrt(16)'"}
                },
                "required": ["expression"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Get the current date and time in ISO format.",
            "parameters": {"type": "object", "properties": {}}
        }
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for information. Returns snippets.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"}
                },
                "required": ["query"]
            }
        }
    }
]

# ── Tool Implementations ──────────────────────────────────
def calculator(expression: str) -> str:
    try:
        # Safe eval with math functions only
        safe_env = {k: getattr(math, k) for k in dir(math) if not k.startswith('_')}
        result = eval(expression, {"__builtins__": {}}, safe_env)
        return str(result)
    except Exception as e:
        return f"Error: {e}"

def get_current_time() -> str:
    return datetime.datetime.now().isoformat()

def web_search(query: str) -> str:
    # Stub — replace with real search API (Tavily, SerpAPI, etc.)
    return f"Search results for '{query}': [This is a demo stub. Integrate Tavily API for real results.]"

TOOL_MAP = {
    "calculator": calculator,
    "get_current_time": get_current_time,
    "web_search": web_search,
}

# ── Agent Loop ────────────────────────────────────────────
def run_agent(user_goal: str, max_steps: int = 10) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful AI agent. Use tools when needed to answer accurately."},
        {"role": "user", "content": user_goal}
    ]

    for step in range(max_steps):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto",
        )

        msg = response.choices[0].message
        messages.append(msg)  # Add assistant message to history

        # Check if done (no tool calls)
        if not msg.tool_calls:
            print(f"Completed in {step + 1} steps.")
            return msg.content

        # Execute each tool call
        for tool_call in msg.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)
            print(f"  [Step {step+1}] Calling {fn_name}({fn_args})")

            fn_result = TOOL_MAP[fn_name](**fn_args)

            # Add tool result to messages
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "name": fn_name,
                "content": fn_result,
            })

    return "Max steps reached without completing the task."

# ── Run the Agent ─────────────────────────────────────────
result = run_agent("What is 2^32, and what time is it right now?")
print(result)

🧠 Adding Memory to Agents

Python — Agent with Vector Memory
from openai import OpenAI
import chromadb
from datetime import datetime

client = OpenAI()
chroma = chromadb.Client()
memory_store = chroma.create_collection("agent_memory")

class AgentWithMemory:
    def __init__(self, agent_id: str):
        self.agent_id = agent_id
        self.short_term = []  # Recent conversation turns
        self.max_short_term = 20

    def remember(self, text: str, metadata: dict = None):
        """Store a memory in long-term vector store."""
        embedding = client.embeddings.create(
            model="text-embedding-3-small", input=text
        ).data[0].embedding

        memory_store.add(
            ids=[f"{self.agent_id}_{datetime.now().timestamp()}"],
            embeddings=[embedding],
            documents=[text],
            metadatas=[{"agent": self.agent_id, "timestamp": str(datetime.now()), **(metadata or {})}]
        )

    def recall(self, query: str, k: int = 3) -> list[str]:
        """Retrieve relevant memories."""
        q_emb = client.embeddings.create(
            model="text-embedding-3-small", input=query
        ).data[0].embedding

        results = memory_store.query(
            query_embeddings=[q_emb],
            n_results=k,
            where={"agent": self.agent_id}
        )
        return results["documents"][0] if results["documents"] else []

    def chat(self, user_input: str) -> str:
        # Retrieve relevant memories
        memories = self.recall(user_input)
        memory_context = "\n".join([f"- {m}" for m in memories])

        # Build messages with memory
        system = f"""You are a helpful assistant with a persistent memory.
Relevant memories from past interactions:
{memory_context if memories else "No relevant memories yet."}"""

        self.short_term.append({"role": "user", "content": user_input})

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "system", "content": system}] + self.short_term[-self.max_short_term:],
        )

        reply = response.choices[0].message.content
        self.short_term.append({"role": "assistant", "content": reply})

        # Store important things in long-term memory
        self.remember(f"User said: {user_input}")
        self.remember(f"I responded: {reply[:200]}")

        return reply

agent = AgentWithMemory("assistant_1")
print(agent.chat("My name is Alice and I'm working on a Python RAG project."))
print(agent.chat("What was I working on?"))  # Should recall from memory

🕸️ Multi-Agent Systems

Multiple specialized agents collaborate, each focusing on what it does best. Common patterns: Orchestrator-Worker, Pipeline, and Debate.

Orchestrator-Worker Pattern
User Goal
Orchestrator
Plans & delegates
Researcher
Web search
Coder
Write code
Critic
Review & verify
Writer
Draft output
Python — Multi-Agent with LangGraph
# pip install langgraph langchain-openai
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage

llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

# ── Define Shared State ───────────────────────────────────
class ResearchState(TypedDict):
    messages: Annotated[list, add_messages]
    research_notes: str
    draft: str
    critique: str
    final_output: str
    task: str

# ── Define Agents (Nodes) ─────────────────────────────────
def researcher_agent(state: ResearchState) -> dict:
    """Gathers information relevant to the task."""
    response = llm.invoke([
        SystemMessage(content="You are a research expert. Gather key facts and insights."),
        HumanMessage(content=f"Research this topic thoroughly: {state['task']}")
    ])
    return {"research_notes": response.content}

def writer_agent(state: ResearchState) -> dict:
    """Drafts content based on research."""
    response = llm.invoke([
        SystemMessage(content="You are an expert technical writer. Write clearly and accurately."),
        HumanMessage(content=f"""
Task: {state['task']}
Research Notes: {state['research_notes']}

Write a comprehensive, well-structured response.""")
    ])
    return {"draft": response.content}

def critic_agent(state: ResearchState) -> dict:
    """Reviews and critiques the draft."""
    response = llm.invoke([
        SystemMessage(content="You are a critical reviewer. Find factual errors, gaps, and improvements."),
        HumanMessage(content=f"""
Original Task: {state['task']}
Draft to Review: {state['draft']}

Provide specific, actionable critique. Rate quality 1-10.""")
    ])
    return {"critique": response.content}

def reviser_agent(state: ResearchState) -> dict:
    """Revises based on critique."""
    response = llm.invoke([
        SystemMessage(content="You are a skilled editor. Improve the draft based on critique."),
        HumanMessage(content=f"""
Task: {state['task']}
Original Draft: {state['draft']}
Critique: {state['critique']}

Produce the final, polished version.""")
    ])
    return {"final_output": response.content}

# ── Build Graph ───────────────────────────────────────────
def build_research_pipeline() -> StateGraph:
    graph = StateGraph(ResearchState)

    # Add nodes
    graph.add_node("researcher", researcher_agent)
    graph.add_node("writer",     writer_agent)
    graph.add_node("critic",     critic_agent)
    graph.add_node("reviser",    reviser_agent)

    # Define edges (pipeline flow)
    graph.set_entry_point("researcher")
    graph.add_edge("researcher", "writer")
    graph.add_edge("writer",     "critic")
    graph.add_edge("critic",     "reviser")
    graph.add_edge("reviser",    END)

    return graph.compile()

# ── Run Pipeline ──────────────────────────────────────────
pipeline = build_research_pipeline()

result = pipeline.invoke({
    "task": "Explain how transformers work in modern LLMs",
    "messages": [],
    "research_notes": "",
    "draft": "",
    "critique": "",
    "final_output": "",
})

print("=== FINAL OUTPUT ===")
print(result["final_output"])
print("\n=== CRITIQUE ===")
print(result["critique"])

🔧 Agent Framework Comparison

FrameworkBest ForKey FeatureComplexity
LangGraph Complex stateful workflows, DAGs Graph-based state machines, cycles Medium
CrewAI Role-based multi-agent teams Crew + Role abstractions, easy setup Low
AutoGen Conversational multi-agent Agent conversations, code execution Medium
Anthropic SDK Production agents with Claude Native tool use, streaming, vision Low
Custom Maximum control and performance Build exactly what you need High

✅ Agent Design Best Practices

  • Design for failure: Agents will sometimes call wrong tools or loop. Add max step limits, error handling, and fallbacks.
  • Minimal tool surface: Give agents only the tools they need. Fewer tools = less confusion = more reliable behavior.
  • Structured tool outputs: Return consistent JSON from tools. Unstructured output confuses agents.
  • Observability: Log every tool call, reasoning step, and state transition. You need visibility to debug agents.
  • Human-in-the-loop: For high-stakes actions (deleting data, sending emails), require human approval before execution.
  • Idempotent tools: Design tools that can be safely retried without side effects (or track completed actions in state).
⚠️
Security warning: Never let agents execute arbitrary code from untrusted sources. Sandbox code execution with tools like E2B or Docker. Validate all tool inputs and limit permissions.