Skip to content

🧠 Lesson 1-5: Model APIs – Brains of Your Agent

Learning Objectives

By the end of this lesson, you will be able to:

  • Describe the major commercial and open-source LLM APIs (OpenAI, Anthropic, Hugging Face, Cohere)
  • Compare model capabilities: context window, multi-modality, safety, and instruction tuning
  • Evaluate trade-offs among cost, latency, and output quality for different providers
  • Programmatically switch between model APIs in your agent code
  • Implement A/B benchmarking to select the optimal model for each agent task

🌐 1. The LLM API Ecosystem

1.1 OpenAI GPT Family

GPT-4 / GPT-4o

  • Up to 128K token context, multimodal (vision, audio) support
  • Function-calling API for structured tool invocation
  • Pricing: $0.03/1K prompt tokens, $0.06/1K completion tokens (approx.)

1.2 Anthropic Claude

Claude 4

  • Stronger safety mitigations, high-quality long-form reasoning
  • Context up to 100K tokens, chat-first interface
  • Pricing: $0.10/1K input, $0.20/1K output (approx.)

1.3 Hugging Face Inference API

Vicuna, LLaMA-3, Mistral

  • Variety of open models hosted, free/community tiers and paid options
  • Context windows vary (8K to 128K) depending on model
  • Pricing: model-specific; can be as low as $0.001/1K tokens

1.4 Cohere Command Models

Command Rerank, Command Light

  • Optimized for classification and summarization tasks
  • Context windows ~16K tokens
  • Pricing: competitive, per-request bundles

βš–οΈ 2. Model Selection Criteria

Model Comparison Table

Criterion Considerations
Context Window Longer windows allow larger prompts, document-grounded tasks
Instruction Tuning Models fine-tuned for following instructions yield more reliable tool planning
Multi-Modality Vision/audio support enables multimodal agent use cases
Safety & Alignment Inherent guardrails, risk of harmful outputs
Latency & Throughput Response time per token; important for real-time agents
Pricing Model Per-token vs. fixed-price, streaming discounts
Function Calling Native support for structured tool calls

πŸ”„ 3. Programmatic Model Switching

Abstract your agent's "LLM client" behind a common interface:

LLM Client Abstraction

class LLMClient:
    def __init__(self, provider: str, model: str, api_key: str):
        self.provider = provider
        if provider == "openai":
            import openai
            self.client = openai
            self.model = model
        elif provider == "anthropic":
            import anthropic
            self.client = anthropic.Client(api_key)
            self.model = model
        # Add Hugging Face, Cohere similarly

    def generate(self, prompt: str, **kwargs) -> str:
        if self.provider == "openai":
            resp = self.client.ChatCompletion.create(
                model=self.model, messages=[{"role": "user", "content": prompt}], **kwargs
            )
            return resp.choices.message.content
        elif self.provider == "anthropic":
            resp = self.client.completions.create(
                model=self.model, prompt=prompt, **kwargs
            )
            return resp.completion
        # Add other providers

Best Practices

  • Authentication: Store API keys in environment variables or secret manager
  • Error Handling: Catch rate-limit and authentication errors, implement exponential backoff

⚑ 4. Streaming vs. Non-Streaming Calls

4.1 Non-Streaming

Non-Streaming Characteristics

  • Simplest: entire completion returned at once
  • Use for batch or background tasks

4.2 Streaming

Streaming Characteristics

  • Yields tokens as they are generated
  • Improves perceived responsiveness in UIs or chat agents
  • OpenAI Example (Python):
    for chunk in openai.ChatCompletion.create(
        model="gpt-4o", messages=[...], stream=True
    ):
        print(chunk.choices.delta.get("content", ""), end="")
    

5. A/B Benchmarking & Cost Analysis

5.1 Designing a Benchmark Harness

Benchmark Harness Details

  • Fixed Prompt Set: 20–50 representative prompts for your agent’s domain
  • Metrics:
  • Quality: human-rated accuracy, relevance, coherence
  • Latency: average ms per token or per response
  • Cost: tokens used Γ— per-token price

5.2 Automating Measurements

Benchmarking Script

import time
import csv

models = [("openai", "gpt-4o"), ("anthropic", "claude-4")]
results = []

for provider, model in models:
    client = LLMClient(provider, model, api_key=os.getenv("API_KEY"))
    for prompt in test_prompts:
        start = time.time()
        resp = client.generate(prompt)
        latency = time.time() - start
        tokens = estimate_tokens(prompt + resp)
        cost = cost_per_token(provider, model) * tokens
        results.append([provider, model, prompt, resp, latency, cost])

with open("benchmark.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["Provider", "Model", "Prompt", "Response", "Latency", "Cost"])
    writer.writerows(results)

πŸ’» 6. Mini-Project: Model Comparison Script

Model Comparison Challenge

Build a script that: 1. Reads a set of prompts from prompts.txt 2. Sends each prompt to two different models (e.g., GPT-4o vs. Claude 4) 3. Records response, latency, and token usage 4. Outputs a summary table of average latency and cost per prompt


❓ 7. Self-Check Questions

Knowledge Check

  1. What factors would lead you to choose a cheaper open-source model over a premium API?
  2. How does streaming output change your agent's architecture?
  3. In your LLMClient abstraction, how would you add support for a new provider?
  4. Which metrics are most critical when benchmarking summarization versus conversational tasks?

Phase Complete!

Phase 2: Agentic Workflows & Reliability β†’

Phase 2 begins: diving into Agentic Workflows & Reliabilityβ€”building robust ReAct loops, advanced RAG pipelines, and systematic evaluation for production-grade agents.