Stop Optimizing Prompts. Optimize Context Instead.

Last month I was debugging a support chatbot that was giving wrong answers about refund policies. The team had spent days tweaking the system prompt—trying "You are a helpful assistant" vs "You are a world-class expert", adding "Chain of Thought" triggers, testing magic phrases.

The problem wasn't the prompt. The problem was that the model never saw the actual refund policy document. It was guessing.

They are optimizing the wrong end of the pipe.

In 2025, the model is rarely the bottleneck. The bottleneck is what you feed it.

If you give a frontier model garbage context, no amount of "prompt engineering" will save you. If you give a mediocre model perfect, structured context, it will outperform the state-of-the-art.

This is the shift from Prompt Engineering (optimizing the instruction) to Context Engineering (optimizing the state).

Injecting precise, typed state. The model has no choice but to be correct.

{ "role": "support_agent", "user": { "id": "u_123", "tier": "premium", "ltv": 4500 }, "session": { "intent": "refund_request", "sentiment_score": -0.8 }, "allowed_actions": ["issue_refund", "escalate_to_human"] }

Defining the Terms (Because Words Matter)

Before diving into implementation, let's define our terms. The industry is messy—everyone uses different labels. I prefer the mental model used by labs like Anthropic and OpenAI in their technical documentation, even if they don't always use these exact labels.

Why does this matter? Because if you're optimizing prompts when you should be optimizing context, you're wasting time. Understanding the distinction helps you focus on what actually moves the needle.

1. Prompt Engineering

The art of instruction.

This is the static logic. It's the function definition. It includes the tone, the output format instructions (XML/JSON), and the few-shot examples that teach the behavior.

Example: "You are a helpful assistant. Always respond in JSON format. Here are three examples of good responses: [examples]"

Goal: Compliance and format.
Tooling: String templates, Jinja2.
When it matters: Getting consistent output structure, enforcing tone, teaching patterns through examples.
When it doesn't: When the model lacks the facts it needs. No amount of instruction will help if the model doesn't know the refund policy.

2. Context Engineering

The engineering of state.

This is the dynamic data. It's the function arguments. It includes the user profile, the retrieved documents (RAG), the conversation history, and the current state of the world.

Example: Before calling the LLM, you fetch the user's order history, their active support ticket, and the relevant refund policy document. You structure this into a typed object and inject it into the prompt.

Goal: Accuracy and grounding.
Tooling: Vector DBs, SQL, Redis, ETL pipelines.
When it matters: When the model needs facts it doesn't have in its training data. When answers depend on current state (user's orders, system errors, live data).
When it doesn't: For simple queries that don't need external data ("What's the weather?"). Sometimes a good prompt is enough.

Who Coined "Context Engineering"?

While the shift has been happening organically, Tobi Lütke (CEO of Shopify) crystallized it perfectly in a tweet that captured the industry's mood:

"I prefer the term context construction (or engineering). It's the art of supplying exactly the context so that the task is plausibly solvable by the LLM." — Tobi Lütke (@tobi)

He hit the nail on the head. It's not about asking nicely. It's about supplying the state that makes the solution inevitable.

This aligns with what I see in production: the hard part isn't the prompt template, it's the pipeline that fills it. Most teams spend days tweaking prompts when they should spend that time building a proper context hydrator.

The Architecture of Context

Now that we've defined the terms, let's see what a "Context Pipeline" actually looks like in production. This is a high-level overview, but it captures the core components.

Production Architecture

Query

Intent Classification

HYDRATOR

Hydrator

Cache

Fetch

Timeout

Error

Context

JSON Schema

LLM

Response

PostgresUser Data

Vector DBDocs Search

Redis CacheTTL: 5min

ObservabilityError Logs

1. Query Classification

Intent detection → determines which data sources to fetch

2. Parallel Fetching

All data sources queried simultaneously with timeouts & error handling

3. Schema Validation

Structured JSON validated before sending to LLM

Figure 1: Production Context Engineering Pipeline

The pipeline has four distinct stages:

The Query: The raw intent from the user. Often ambiguous ("fix it", "what's wrong?", "help me"). This is where most systems start, but it's not enough.

Example: User types "I want a refund" — ambiguous. Is it for an order? A subscription? Which order? The query alone doesn't tell you.

The Hydrator: This is the engine. It's not just a database query. It's a logic layer that decides what knowledge is needed based on the query and user state.

The hydrator asks:
- Does it need the User Profile? (Postgres) — Yes, if query is about "my orders" or "my account"
- Does it need documentation? (Vector Store) — Yes, if query is about policies or procedures
- Does it need the last 5 errors? (Observability API) — Yes, if query is about "why did X fail?"
- Does it need order history? (Postgres) — Only if query mentions orders
- Does it need feature flags? (Redis) — Only if query depends on enabled features
The key insight: The hydrator is where you encode your domain knowledge. It's not a dumb data fetcher—it's a decision engine.
Structured Context: The output of the hydrator isn't text. It's a strict JSON schema or a set of typed objects. We don't feed the LLM raw database rows; we feed it a view.

Why structured? Because models parse JSON better than prose. Because you can validate it. Because you can test it. Because you can version it.
Intelligence: Only then do we invoke the model. The model is just the runtime that executes the logic over the context.

The shift: The model isn't "thinking" anymore—it's processing structured data. It's more like a template engine than a reasoning engine.

The Three Laws of Context

With the architecture in mind, here are the three fundamental principles that guide effective context engineering:

1. Structure beats Prose

Models love JSON. They love schemas. They tolerate prose.

When you force your context into strict structures (using Pydantic or Zod), you force yourself to decide what matters. You aren't just dumping database rows into the prompt; you are designing an interface for the intelligence.

Bad Context (What I Actually Saw):

Unstructured Data

text

1User ID 12345, created account 2023-03-15, last login 2025-11-20, subscription active, plan Pro, billing cycle monthly, payment method card ending 4242, last invoice paid 2025-10-15, support tickets: ticket-789 (resolved, shipping delay), ticket-456 (open, refund request), ticket-123 (closed, feature request), order history: order-001 (Nike Air, delivered 2025-11-10), order-002 (Adidas Ultraboost, processing), order-003 (Puma RS-X, cancelled), preferences: email notifications enabled, SMS disabled, newsletter subscribed, marketing emails opted out, language English, timezone UTC+1, address: Hauptstraße 45, Berlin, 10115, Germany, phone +49 30 12345678...

This is what happens when you dump a database row into the prompt. I've seen this exact pattern—someone concatenates all user fields into a string and sends it to the model. The model drowns in noise. In one case, accuracy dropped from 85% to 62% because the signal was buried.

Good Context:

json

123456789101112{
  "user": {
    "age": 30,
    "location": "Berlin",
    "segments": ["churn_risk", "high_value"]
  },
  "last_order": {
    "item": "Nike Air",
    "status": "delivered",
    "ticket": {
      "sentiment": "negative",
      "topic": "shipping_delay"
    }
  }
}

2. Dynamic Injection (The JIT Prompt)

Static system prompts are a smell. Your system prompt should be a template that gets hydrated at runtime.

Most engineers treat the system prompt as a static constant. In reality, it should be the final output of a complex data pipeline.

Why dynamic? Because different queries need different context. A billing question needs different data than a technical support question. A premium user might get different instructions than a free user.

The Pattern:

typescript

123456789// src/ai/context.ts
async function buildContext(userId: string, query: string) {
    // Classify the query first
    const intent = classifyQuery(query); // "billing" | "support" | "technical"
    
    // Fetch only what this intent needs
    const [profile, history, relevantDocs] = await Promise.all([
        getUserProfile(userId),
        intent === "billing" ? getRecentHistory(userId) : null,
        intent === "support" ? searchDocs(query) : [] // RAG, but selective
    ]);

    // Build context based on intent
    return `
    Role: ${intent === "billing" ? "Billing Support Agent" : "Technical Support Agent"}
    User Profile: ${JSON.stringify(profile)}
    ${history ? `History Summary: ${summarize(history)}` : ''}
    ${relevantDocs.length > 0 ? `Reference Material: ${formatDocs(relevantDocs)}` : ''}
    `;
}

This looks like standard software engineering. Because it is. The "AI" part is just the final function call. The engineering is in data fetching, classification, and aggregation.

The benefit: Your prompt adapts to the query. You're not sending irrelevant context. You're not wasting tokens. You're not confusing the model with noise.

3. The "Information Gain" Metric

Every token costs money and latency. Every token dilutes the attention mechanism.

I measure context by Information Gain per Token.

If you inject a 5000-word terms of service agreement just to answer "what is the refund policy?", your information gain is near zero.

Information Gain per Token

Every token costs money and dilutes attention. Measure context by signal-to-noise ratio.

What is the refund policy?

Dumping entire ToS document

ZERO

5,000 tokens

What is the refund policy?

Extracted refund section only

HIGH

200 tokens

Why did payment fail?

10 error logs, only 3 relevant

LOW

1,500 tokens

Why did payment fail?

3 most recent unique errors

HIGH

450 tokens

Rule: A well-pruned 2k token context beats a lazy 50k token dump every time

Context Pruning Strategy:

Summarize first: Don't pass raw chat logs. Pass a summary of the last session.

Example: Instead of 50 messages (2000 tokens), pass "User asked about refunds twice in the last session. First query was about order #12345, second was about subscription cancellation."
Filter fields: Don't pass the whole User object. Pass only what the query needs.

Example: For "What are my recent orders?", you need User.id and User.email. You don't need User.preferences, User.marketingOptIn, or User.timezone. That's 200 tokens saved.
Rank relevance: If you have 10 error logs, pass the most recent unique 3.

Example: Instead of all 10 errors (some duplicates, some irrelevant), pass the 3 most recent unique errors that match the query pattern. If the query is about "payment failed", filter to payment-related errors only.

Why Prune? The "Needle in the Haystack" Fallacy

"But wait," you say. "Gemini 1.5 Pro has a 2 million token context window. Why do I need to prune?"

Because attention is not infinite.

Even if the model can fit 10 books in its context, its ability to reason across that context degrades. This is the "Lost in the Middle" phenomenon documented in research papers like Lost in the Middle: How Language Models Use Long Contexts. The model performs best on information at the beginning and end of the context window, and worst in the middle.

More importantly, latency.

Sending 10k tokens: ~500ms processing.
Sending 1M tokens: ~10-60 seconds processing.

If you are building a real-time application, you cannot afford lazy context engineering. You must curate.

But there is a third reason: cost. Every token you send costs money, and every token dilutes the signal-to-noise ratio. A well-pruned 2k token context beats a lazy 50k token dump every time.

The Numbers: Before and After

Theory is nice, but what about the actual impact? I don't have a perfect A/B test to share (production systems are messy), but here is what I've measured when teams switched from prompt-tuning to context engineering:

Before vs After: The Numbers

Typical metrics I've measured when teams switched from prompt-tuning to context engineering (production systems vary)

Accuracy

62-68%

85-92%

+24pp

Latency

800-1200ms

1200-1800ms

400ms

Cost per Query

$0.002-0.004

$0.008-0.012

200%

Hallucination Rate

~15%

~3%

+12pp

User Satisfaction

6.2/10

8.4/10

+2.2pts

Trade-off: Increased latency and cost, but significantly reduced hallucination rate and improved accuracy

The latency increase is the hydrator overhead. The cost increase is from:

More tokens sent to the model (structured context)
Database queries (Postgres, Redis, Vector DB)
Caching infrastructure

For most production systems, this trade-off is worth it. But not always.

The Cost of Context

Let's break down the cost of a typical context-engineered query using GPT-5.1 as an illustrative example (pricing as of November 2025: $1.25 per million input tokens, $10 per million output tokens):

Cost Breakdown: "Why did my payment fail?"

Illustrative example using GPT-5.1 pricing ($1.25/M input, $10/M output). Costs vary by model and usage.

User Lookup

Postgres (cached)

$0.00001

Feature Flags

Redis (cached)

$0.00001

Vector DB Search

Pinecone/Qdrant

$0.0001-0.0005

Error Logs

Observability API

$0.0001

LLM Call

GPT-5.1 (2k in, 500 out)

$0.0075

Total Cost per Query$0.0078-0.0082

Without Context Engineering:~$0.0015

Accuracy impact:Significant drop

Example calculation (GPT-5.1):

Typical query: 2,000 input tokens (structured context) + 500 output tokens (response)
Input cost: 2,000 × $0.00000125 = $0.0025
Output cost: 500 × $0.00001 = $0.005
LLM total: $0.0075 per query
Infrastructure overhead (DB queries, vector search): ~$0.0003-0.0007
Total: ~$0.0078-0.0082 per query

ROI Calculation (illustrative):

Cost increase vs. minimal context: +$0.006-0.007 per query
Accuracy increase: significant (typically 20-30 percentage points in my experience)
For a system with 10k queries/day: +$60-70/day, but significantly fewer incorrect answers

Important: These numbers are illustrative examples using GPT-5.1 pricing. Actual costs vary significantly by model choice (GPT-4o, Claude, etc.), token usage, and infrastructure setup. Always measure your own metrics.

For most production systems, this trade-off is worth it. But you need to measure your own metrics. I've seen teams where the cost increase wasn't worth it—usually when they're doing millions of queries per day and the accuracy gain was marginal.

Deep Dive: The Context Object Pattern

Now that we understand the principles and the impact, let's see how to implement them. In production, we don't just concatenate strings. We build a Context Object. This is a typed interface that represents everything the model needs to know.

The Context Object Pattern

Typed interface that decouples hydration logic from prompt rendering

interface AIContext { user: { id: string; role: 'admin' | 'user'; technicalLevel: 'novice' | 'expert'; }; environment: { time: string; featureFlags: Record<string, boolean>; }; knowledge: { documents: ScoredDocument[]; activeTicket?: TicketDetails; }; }

User

• id
• role
• technicalLevel

Environment

• time
• featureFlags

Knowledge

• documents[]
• activeTicket?

By defining this interface, you decouple the Hydration Logic from the Prompt Rendering.

The Hydrator

The Hydrator is responsible for populating this object. It should be robust, parallel, fail-safe, cached, and observable.

typescript

12345678// src/ai/hydrator.ts
import { cache } from '@/lib/cache';
import { logger } from '@/lib/logger';
import { metrics } from '@/lib/metrics';

async function hydrateContext(req: Request): Promise<AIContext> {
  const startTime = Date.now();
  const user = await getCurrentUser(req);
  
  // 1. Identity (Fast, cached)
  // User data rarely changes, so cache aggressively
  const cachedUser = await cache.getOrSet(
    `user:${user.id}`,
    () => Promise.resolve(user),
    { ttl: 300 } // 5 min cache - balance between freshness and performance
  );
  
  // 2. Query Classification
  // Decide what context this query actually needs
  const intent = classifyQueryIntent(req.query);
  const needsOrderHistory = intent === 'billing' || intent === 'support';
  const needsDocs = intent === 'support' || intent === 'technical';
  const needsErrors = intent === 'technical' && req.query.includes('fail');
  
  // 3. Parallel Data Fetching (Slower, with timeouts)
  // Fetch only what's needed, in parallel, with fail-safes
  const [flags, ticket, docs, errors] = await Promise.allSettled([
    // Feature flags: timeout after 1s (non-critical, don't block)
    Promise.race([
      getFeatureFlags(cachedUser.id),
      new Promise<never>((_, reject) => setTimeout(() => reject(new Error('timeout')), 1000))
    ]),
    // Active ticket: only if support-related query
    needsOrderHistory ? getActiveTicket(cachedUser.id).catch(() => null) : Promise.resolve(null),
    // Documentation: only if query needs it
    needsDocs ? searchVectorDB(req.query).catch(() => []) : Promise.resolve([]),
    // Error logs: only if technical query mentions failures
    needsErrors ? getRecentErrors(cachedUser.id, 3).catch(() => []) : Promise.resolve([])
  ]).then(results => [
    results[0].status === 'fulfilled' ? results[0].value : {},
    results[1].status === 'fulfilled' ? results[1].value : null,
    results[2].status === 'fulfilled' ? results[2].value : [],
    results[3].status === 'fulfilled' ? results[3].value : []
  ]);

  const latency = Date.now() - startTime;
  metrics.histogram('context_hydration_ms', latency);
  logger.info('Context hydrated', { 
    userId: cachedUser.id, 
    intent,
    latency, 
    docsCount: docs.length,
    fetchedSources: { flags: !!flags, ticket: !!ticket, docs: docs.length, errors: errors.length }
  });

  return {
    user: cachedUser,
    environment: { 
      time: new Date().toISOString(),
      featureFlags: flags 
    },
    knowledge: {
      documents: docs,
      activeTicket: ticket,
      recentErrors: errors
    }
  };
}

Notice the patterns:

Query Classification: We classify the intent first, then fetch only what's needed. A billing query doesn't fetch error logs. A technical query doesn't fetch order history. This saves latency and tokens.
Caching: User data cached for 5 minutes (rarely changes). Feature flags cached longer (they change infrequently). Vector DB results aren't cached (they're query-specific).
Timeouts: Feature flags timeout after 1s (don't block on non-critical data). If feature flags are slow, we continue without them. The LLM can still answer.
Promise.allSettled: All fetches run in parallel, failures don't cascade. If the vector DB is down, we still return user data and feature flags. The LLM gets partial context, which is better than no context.
Observability: Latency metrics and structured logging. We log what we fetched, how long it took, and what failed. This helps debug production issues.
Graceful degradation: If vector DB is down, return empty array. If feature flags timeout, return empty object. The LLM should still try to answer with whatever context we have.

Context hydration should never crash the request. This is graceful degradation applied to AI. Your system should degrade gracefully, not fail catastrophically.

Testing Your Context

One of the biggest benefits of Context Engineering is that it makes your AI system testable. Unlike prompt engineering, which is probabilistic and hard to verify, context engineering gives you deterministic inputs to test.

Why this matters: When your AI system fails in production, you need to know why. Was it the prompt? Was it the model? Or was it the context? With context engineering, you can test the context independently.

You can't easily unit test "does the model write a good poem?". But you can unit test "does the hydrator retrieve the correct refund policy when the user asks about refunds?".

The shift: Instead of testing the model's output (probabilistic, flaky), you test the model's input (deterministic, reliable). If the input is correct, the output is usually correct.

Unit Tests (Hydrator Logic)

typescript

12345678// tests/hydrator.test.ts
describe('hydrateContext', () => {
  test('includes refund policy for billing queries', async () => {
    const query = "I want my money back";
    const context = await hydrateContext(mockUser, query);
    
    expect(context.knowledge.documents).toContainEqual(
      expect.objectContaining({ title: "Refund Policy" })
    );
  });

  test('includes user profile for personalized queries', async () => {
    const query = "What are my recent orders?";
    const context = await hydrateContext(mockUser, query);
    
    expect(context.user.id).toBe(mockUser.id);
    expect(context.user.role).toBeDefined();
  });

  test('gracefully degrades when vector DB fails', async () => {
    mockVectorDB.mockRejectedValue(new Error('DB down'));
    const context = await hydrateContext(mockUser, "test query");
    
    expect(context.knowledge.documents).toEqual([]);
    expect(context.user).toBeDefined(); // Other data still works
  });

  test('respects timeout for feature flags', async () => {
    mockFeatureFlags.mockImplementation(() => 
      new Promise(resolve => setTimeout(resolve, 2000))
    );
    const start = Date.now();
    const context = await hydrateContext(mockUser, "test");
    const elapsed = Date.now() - start;
    
    expect(elapsed).toBeLessThan(1500); // Should timeout before 2s
    expect(context.environment.featureFlags).toEqual({});
  });
});

Integration Tests (Full Pipeline)

typescript

12345678// tests/integration/context-pipeline.test.ts
describe('Context Pipeline Integration', () => {
  test('end-to-end: refund query retrieves correct context', async () => {
    const query = "I want a refund for order #12345";
    const context = await hydrateContext(mockRequest(query));
    
    // Verify context structure
    expect(context.knowledge.documents.length).toBeGreaterThan(0);
    expect(context.knowledge.documents[0].score).toBeGreaterThan(0.7);
    
    // Verify context contains order info
    const orderDoc = context.knowledge.documents.find(
      doc => doc.title.includes('Order')
    );
    expect(orderDoc).toBeDefined();
    
    // Verify user context
    expect(context.user.id).toBe('user-123');
  });
});

Regression Tests (Context Changes)

typescript

1234567// tests/regression/context-schema.test.ts
test('context schema remains stable', () => {
  const context = createMockContext();
  const schema = z.object({
    user: z.object({
      id: z.string(),
      role: z.enum(['admin', 'user']),
      technicalLevel: z.enum(['novice', 'expert'])
    }),
    environment: z.object({
      time: z.string(),
      featureFlags: z.record(z.boolean())
    }),
    knowledge: z.object({
      documents: z.array(z.any()),
      activeTicket: z.any().optional()
    })
  });
  
  expect(() => schema.parse(context)).not.toThrow();
});

This is deterministic testing for a probabilistic system. It ensures that the input to the model is correct, which solves most "the AI is hallucinating" problems. In my experience, when the model gives wrong answers, it's usually not hallucinating—it just wasn't given the info it needed.

When Context Engineering Fails

Before you rush to implement this everywhere, remember: Context Engineering is not a silver bullet. Here are the failure modes I've actually hit in production:

1. Over-Engineering the Context

You can build a perfect context hydrator that fetches 15 different data sources, but if your queries are simple ("What's the weather?"), you're wasting latency and money.

What I've seen: A fintech chatbot was fetching user profile, order history, support tickets, feature flags, and documentation for every single query—even "What time do you close?". Average latency was 2.3 seconds. After adding query classification (simple queries skip hydration entirely), latency dropped to 800ms for about 60% of requests. The other 40% still needed full context, but at least they weren't blocking simple queries.

Rule: Only hydrate what the query needs. Use query classification to decide what to fetch.

2. Context Too Specific

If your context is hyper-specific to one use case, it won't generalize. You end up with a brittle system that breaks on edge cases.

Example:

typescript

123456789101112// ❌ BAD: Too specific
context.knowledge = {
  refundPolicy: "Section 4.2: Refunds processed within 7-10 business days..."
};

// ✅ GOOD: Structured but flexible
context.knowledge = {
  documents: [
    { type: "policy", section: "refunds", content: "..." },
    { type: "faq", topic: "refunds", content: "..." }
  ]
};

3. Latency Budget Exceeded

If your hydrator takes 3 seconds and your SLA is 2 seconds, context engineering won't help. You need to optimize the hydrator or reduce context scope.

What happened: A support chatbot had a 2s SLA but the hydrator was taking 2.8s. The bottleneck was Postgres queries for user profiles (400ms) and feature flags (300ms). They moved user profiles to Redis (now 15ms), cached feature flags aggressively (5min TTL → 1hr), and reduced vector search from top-10 to top-3 chunks. Latency dropped to 1.4s. Still over SLA, but close enough that they could negotiate with product.

Solutions:

Aggressive caching (Redis, in-memory)
Parallel fetching (already covered)
Lazy loading (fetch only what's needed)
Pre-computation (background jobs)

4. Cost Prohibitive

If you're making 10 database queries per request and your traffic is high, context engineering can break your budget.

What I've seen: A SaaS product was doing 5M queries per day. Each query made 8 database calls (user, orders, tickets, docs, flags, errors, logs, preferences). At $0.0001 per query, that's $500/day just for context hydration. They reduced it to 2-3 calls per query (with aggressive caching) and cut costs to $150/day.

Solutions:

Cache everything possible (user data, feature flags, documentation summaries)
Batch queries (fetch user + orders in one query instead of two)
Use cheaper data sources (Redis instead of Postgres for hot data)
Reduce context scope for high-traffic endpoints (simple queries skip hydration)
Pre-compute expensive operations (background jobs that summarize chat logs)

5. The Context is Wrong

If your hydrator retrieves the wrong documents, the model will confidently hallucinate based on bad context. This is worse than no context.

What I've seen: A support chatbot was retrieving documentation about "API rate limits" when users asked about "payment limits". The embeddings were similar (both mention "limits"), but the content was wrong. The model confidently answered about API rate limits, confusing users. After adding re-ranking and better query classification, retrieval accuracy improved from 65% to 89%.

Mitigation:

Test your hydrator (covered above) — verify it retrieves the right docs for common queries
Monitor retrieval quality (log document scores, track user feedback)
Add human-in-the-loop for critical queries (escalate when confidence is low)
Use re-ranking for RAG (not just cosine similarity) — semantic similarity ≠ logical relevance
Add query expansion (synonyms, related terms) to improve retrieval

Conclusion

Stop treating LLMs like magic boxes that need the right spell. Treat them like functions.

Functions need valid arguments. Context is your argument. Structure it. Validate it. Prune it. Test it.

The best prompt is usually the shortest one, followed by the best data.

Context Engineering is not free. It costs latency, money, and complexity. Use it when accuracy matters more than speed, when hallucinations are costly, and when you have the infrastructure to support it.

For simple queries ("What's the weather?"), a well-crafted prompt might be enough. For production systems where accuracy matters, context engineering is usually the right choice.

Start small: Don't build a perfect hydrator on day one. Start with one data source (user profile). Add more as you need them. Measure the impact. Iterate.

The goal isn't perfect context engineering. The goal is better answers. If context engineering improves accuracy significantly, that's a win. Even if it adds latency.

EOF

Stop Optimizing Prompts. Optimize Context Instead.

The problem wasn't the prompt. The problem was that the model never saw the actual refund policy document. It was guessing.

They are optimizing the wrong end of the pipe.

In 2025, the model is rarely the bottleneck. The bottleneck is what you feed it.

If you give a frontier model garbage context, no amount of "prompt engineering" will save you. If you give a mediocre model perfect, structured context, it will outperform the state-of-the-art.

This is the shift from Prompt Engineering (optimizing the instruction) to Context Engineering (optimizing the state).

Injecting precise, typed state. The model has no choice but to be correct.

Defining the Terms (Because Words Matter)

Why does this matter? Because if you're optimizing prompts when you should be optimizing context, you're wasting time. Understanding the distinction helps you focus on what actually moves the needle.

1. Prompt Engineering

The art of instruction.

This is the static logic. It's the function definition. It includes the tone, the output format instructions (XML/JSON), and the few-shot examples that teach the behavior.

Example: "You are a helpful assistant. Always respond in JSON format. Here are three examples of good responses: [examples]"

Goal: Compliance and format.
Tooling: String templates, Jinja2.
When it matters: Getting consistent output structure, enforcing tone, teaching patterns through examples.
When it doesn't: When the model lacks the facts it needs. No amount of instruction will help if the model doesn't know the refund policy.

2. Context Engineering

The engineering of state.

This is the dynamic data. It's the function arguments. It includes the user profile, the retrieved documents (RAG), the conversation history, and the current state of the world.

Goal: Accuracy and grounding.
Tooling: Vector DBs, SQL, Redis, ETL pipelines.
When it matters: When the model needs facts it doesn't have in its training data. When answers depend on current state (user's orders, system errors, live data).
When it doesn't: For simple queries that don't need external data ("What's the weather?"). Sometimes a good prompt is enough.

Who Coined "Context Engineering"?

While the shift has been happening organically, Tobi Lütke (CEO of Shopify) crystallized it perfectly in a tweet that captured the industry's mood:

"I prefer the term context construction (or engineering). It's the art of supplying exactly the context so that the task is plausibly solvable by the LLM." — Tobi Lütke (@tobi)

He hit the nail on the head. It's not about asking nicely. It's about supplying the state that makes the solution inevitable.

The Architecture of Context

Now that we've defined the terms, let's see what a "Context Pipeline" actually looks like in production. This is a high-level overview, but it captures the core components.

Production Architecture

Query

Intent Classification

HYDRATOR

Hydrator

Cache

Fetch

Timeout

Error

Context

JSON Schema

LLM

Response

PostgresUser Data

Vector DBDocs Search

Redis CacheTTL: 5min

ObservabilityError Logs

1. Query Classification

Intent detection → determines which data sources to fetch

2. Parallel Fetching

All data sources queried simultaneously with timeouts & error handling

3. Schema Validation

Structured JSON validated before sending to LLM

Figure 1: Production Context Engineering Pipeline

The pipeline has four distinct stages:

The Query: The raw intent from the user. Often ambiguous ("fix it", "what's wrong?", "help me"). This is where most systems start, but it's not enough.

Example: User types "I want a refund" — ambiguous. Is it for an order? A subscription? Which order? The query alone doesn't tell you.

The Hydrator: This is the engine. It's not just a database query. It's a logic layer that decides what knowledge is needed based on the query and user state.

The hydrator asks:
- Does it need the User Profile? (Postgres) — Yes, if query is about "my orders" or "my account"
- Does it need documentation? (Vector Store) — Yes, if query is about policies or procedures
- Does it need the last 5 errors? (Observability API) — Yes, if query is about "why did X fail?"
- Does it need order history? (Postgres) — Only if query mentions orders
- Does it need feature flags? (Redis) — Only if query depends on enabled features
The key insight: The hydrator is where you encode your domain knowledge. It's not a dumb data fetcher—it's a decision engine.
Structured Context: The output of the hydrator isn't text. It's a strict JSON schema or a set of typed objects. We don't feed the LLM raw database rows; we feed it a view.

Why structured? Because models parse JSON better than prose. Because you can validate it. Because you can test it. Because you can version it.
Intelligence: Only then do we invoke the model. The model is just the runtime that executes the logic over the context.

The shift: The model isn't "thinking" anymore—it's processing structured data. It's more like a template engine than a reasoning engine.

The Three Laws of Context

With the architecture in mind, here are the three fundamental principles that guide effective context engineering:

1. Structure beats Prose

Models love JSON. They love schemas. They tolerate prose.

Bad Context (What I Actually Saw):

Unstructured Data

text

1User ID 12345, created account 2023-03-15, last login 2025-11-20, subscription active, plan Pro, billing cycle monthly, payment method card ending 4242, last invoice paid 2025-10-15, support tickets: ticket-789 (resolved, shipping delay), ticket-456 (open, refund request), ticket-123 (closed, feature request), order history: order-001 (Nike Air, delivered 2025-11-10), order-002 (Adidas Ultraboost, processing), order-003 (Puma RS-X, cancelled), preferences: email notifications enabled, SMS disabled, newsletter subscribed, marketing emails opted out, language English, timezone UTC+1, address: Hauptstraße 45, Berlin, 10115, Germany, phone +49 30 12345678...

Good Context:

json

123456789101112{
  "user": {
    "age": 30,
    "location": "Berlin",
    "segments": ["churn_risk", "high_value"]
  },
  "last_order": {
    "item": "Nike Air",
    "status": "delivered",
    "ticket": {
      "sentiment": "negative",
      "topic": "shipping_delay"
    }
  }
}

2. Dynamic Injection (The JIT Prompt)

Static system prompts are a smell. Your system prompt should be a template that gets hydrated at runtime.

Most engineers treat the system prompt as a static constant. In reality, it should be the final output of a complex data pipeline.

The Pattern:

typescript

123456789// src/ai/context.ts
async function buildContext(userId: string, query: string) {
    // Classify the query first
    const intent = classifyQuery(query); // "billing" | "support" | "technical"
    
    // Fetch only what this intent needs
    const [profile, history, relevantDocs] = await Promise.all([
        getUserProfile(userId),
        intent === "billing" ? getRecentHistory(userId) : null,
        intent === "support" ? searchDocs(query) : [] // RAG, but selective
    ]);

    // Build context based on intent
    return `
    Role: ${intent === "billing" ? "Billing Support Agent" : "Technical Support Agent"}
    User Profile: ${JSON.stringify(profile)}
    ${history ? `History Summary: ${summarize(history)}` : ''}
    ${relevantDocs.length > 0 ? `Reference Material: ${formatDocs(relevantDocs)}` : ''}
    `;
}

This looks like standard software engineering. Because it is. The "AI" part is just the final function call. The engineering is in data fetching, classification, and aggregation.

The benefit: Your prompt adapts to the query. You're not sending irrelevant context. You're not wasting tokens. You're not confusing the model with noise.

3. The "Information Gain" Metric

Every token costs money and latency. Every token dilutes the attention mechanism.

I measure context by Information Gain per Token.

If you inject a 5000-word terms of service agreement just to answer "what is the refund policy?", your information gain is near zero.

Information Gain per Token

Every token costs money and dilutes attention. Measure context by signal-to-noise ratio.

What is the refund policy?

Dumping entire ToS document

ZERO

5,000 tokens

What is the refund policy?

Extracted refund section only

HIGH

200 tokens

Why did payment fail?

10 error logs, only 3 relevant

LOW

1,500 tokens

Why did payment fail?

3 most recent unique errors

HIGH

450 tokens

Rule: A well-pruned 2k token context beats a lazy 50k token dump every time

Context Pruning Strategy:

Summarize first: Don't pass raw chat logs. Pass a summary of the last session.

Example: Instead of 50 messages (2000 tokens), pass "User asked about refunds twice in the last session. First query was about order #12345, second was about subscription cancellation."
Filter fields: Don't pass the whole User object. Pass only what the query needs.

Example: For "What are my recent orders?", you need User.id and User.email. You don't need User.preferences, User.marketingOptIn, or User.timezone. That's 200 tokens saved.
Rank relevance: If you have 10 error logs, pass the most recent unique 3.

Example: Instead of all 10 errors (some duplicates, some irrelevant), pass the 3 most recent unique errors that match the query pattern. If the query is about "payment failed", filter to payment-related errors only.

Why Prune? The "Needle in the Haystack" Fallacy

"But wait," you say. "Gemini 1.5 Pro has a 2 million token context window. Why do I need to prune?"

Because attention is not infinite.

More importantly, latency.

Sending 10k tokens: ~500ms processing.
Sending 1M tokens: ~10-60 seconds processing.

If you are building a real-time application, you cannot afford lazy context engineering. You must curate.

But there is a third reason: cost. Every token you send costs money, and every token dilutes the signal-to-noise ratio. A well-pruned 2k token context beats a lazy 50k token dump every time.

The Numbers: Before and After

Before vs After: The Numbers

Typical metrics I've measured when teams switched from prompt-tuning to context engineering (production systems vary)

Accuracy

62-68%

85-92%

+24pp

Latency

800-1200ms

1200-1800ms

400ms

Cost per Query

$0.002-0.004

$0.008-0.012

200%

Hallucination Rate

~15%

~3%

+12pp

User Satisfaction

6.2/10

8.4/10

+2.2pts

Trade-off: Increased latency and cost, but significantly reduced hallucination rate and improved accuracy

The latency increase is the hydrator overhead. The cost increase is from:

More tokens sent to the model (structured context)
Database queries (Postgres, Redis, Vector DB)
Caching infrastructure

For most production systems, this trade-off is worth it. But not always.

The Cost of Context

Cost Breakdown: "Why did my payment fail?"

Illustrative example using GPT-5.1 pricing ($1.25/M input, $10/M output). Costs vary by model and usage.

User Lookup

Postgres (cached)

$0.00001

Feature Flags

Redis (cached)

$0.00001

Vector DB Search

Pinecone/Qdrant

$0.0001-0.0005

Error Logs

Observability API

$0.0001

LLM Call

GPT-5.1 (2k in, 500 out)

$0.0075

Total Cost per Query$0.0078-0.0082

Without Context Engineering:~$0.0015

Accuracy impact:Significant drop

Example calculation (GPT-5.1):

Typical query: 2,000 input tokens (structured context) + 500 output tokens (response)
Input cost: 2,000 × $0.00000125 = $0.0025
Output cost: 500 × $0.00001 = $0.005
LLM total: $0.0075 per query
Infrastructure overhead (DB queries, vector search): ~$0.0003-0.0007
Total: ~$0.0078-0.0082 per query

ROI Calculation (illustrative):

Cost increase vs. minimal context: +$0.006-0.007 per query
Accuracy increase: significant (typically 20-30 percentage points in my experience)
For a system with 10k queries/day: +$60-70/day, but significantly fewer incorrect answers

Deep Dive: The Context Object Pattern

The Context Object Pattern

Typed interface that decouples hydration logic from prompt rendering

User

• id
• role
• technicalLevel

Environment

• time
• featureFlags

Knowledge

• documents[]
• activeTicket?

By defining this interface, you decouple the Hydration Logic from the Prompt Rendering.

The Hydrator

The Hydrator is responsible for populating this object. It should be robust, parallel, fail-safe, cached, and observable.

typescript

12345678// src/ai/hydrator.ts
import { cache } from '@/lib/cache';
import { logger } from '@/lib/logger';
import { metrics } from '@/lib/metrics';

async function hydrateContext(req: Request): Promise<AIContext> {
  const startTime = Date.now();
  const user = await getCurrentUser(req);
  
  // 1. Identity (Fast, cached)
  // User data rarely changes, so cache aggressively
  const cachedUser = await cache.getOrSet(
    `user:${user.id}`,
    () => Promise.resolve(user),
    { ttl: 300 } // 5 min cache - balance between freshness and performance
  );
  
  // 2. Query Classification
  // Decide what context this query actually needs
  const intent = classifyQueryIntent(req.query);
  const needsOrderHistory = intent === 'billing' || intent === 'support';
  const needsDocs = intent === 'support' || intent === 'technical';
  const needsErrors = intent === 'technical' && req.query.includes('fail');
  
  // 3. Parallel Data Fetching (Slower, with timeouts)
  // Fetch only what's needed, in parallel, with fail-safes
  const [flags, ticket, docs, errors] = await Promise.allSettled([
    // Feature flags: timeout after 1s (non-critical, don't block)
    Promise.race([
      getFeatureFlags(cachedUser.id),
      new Promise<never>((_, reject) => setTimeout(() => reject(new Error('timeout')), 1000))
    ]),
    // Active ticket: only if support-related query
    needsOrderHistory ? getActiveTicket(cachedUser.id).catch(() => null) : Promise.resolve(null),
    // Documentation: only if query needs it
    needsDocs ? searchVectorDB(req.query).catch(() => []) : Promise.resolve([]),
    // Error logs: only if technical query mentions failures
    needsErrors ? getRecentErrors(cachedUser.id, 3).catch(() => []) : Promise.resolve([])
  ]).then(results => [
    results[0].status === 'fulfilled' ? results[0].value : {},
    results[1].status === 'fulfilled' ? results[1].value : null,
    results[2].status === 'fulfilled' ? results[2].value : [],
    results[3].status === 'fulfilled' ? results[3].value : []
  ]);

  const latency = Date.now() - startTime;
  metrics.histogram('context_hydration_ms', latency);
  logger.info('Context hydrated', { 
    userId: cachedUser.id, 
    intent,
    latency, 
    docsCount: docs.length,
    fetchedSources: { flags: !!flags, ticket: !!ticket, docs: docs.length, errors: errors.length }
  });

  return {
    user: cachedUser,
    environment: { 
      time: new Date().toISOString(),
      featureFlags: flags 
    },
    knowledge: {
      documents: docs,
      activeTicket: ticket,
      recentErrors: errors
    }
  };
}

Notice the patterns:

Query Classification: We classify the intent first, then fetch only what's needed. A billing query doesn't fetch error logs. A technical query doesn't fetch order history. This saves latency and tokens.
Caching: User data cached for 5 minutes (rarely changes). Feature flags cached longer (they change infrequently). Vector DB results aren't cached (they're query-specific).
Timeouts: Feature flags timeout after 1s (don't block on non-critical data). If feature flags are slow, we continue without them. The LLM can still answer.
Promise.allSettled: All fetches run in parallel, failures don't cascade. If the vector DB is down, we still return user data and feature flags. The LLM gets partial context, which is better than no context.
Observability: Latency metrics and structured logging. We log what we fetched, how long it took, and what failed. This helps debug production issues.
Graceful degradation: If vector DB is down, return empty array. If feature flags timeout, return empty object. The LLM should still try to answer with whatever context we have.

Context hydration should never crash the request. This is graceful degradation applied to AI. Your system should degrade gracefully, not fail catastrophically.

Testing Your Context

You can't easily unit test "does the model write a good poem?". But you can unit test "does the hydrator retrieve the correct refund policy when the user asks about refunds?".

The shift: Instead of testing the model's output (probabilistic, flaky), you test the model's input (deterministic, reliable). If the input is correct, the output is usually correct.

Unit Tests (Hydrator Logic)

typescript

12345678// tests/hydrator.test.ts
describe('hydrateContext', () => {
  test('includes refund policy for billing queries', async () => {
    const query = "I want my money back";
    const context = await hydrateContext(mockUser, query);
    
    expect(context.knowledge.documents).toContainEqual(
      expect.objectContaining({ title: "Refund Policy" })
    );
  });

  test('includes user profile for personalized queries', async () => {
    const query = "What are my recent orders?";
    const context = await hydrateContext(mockUser, query);
    
    expect(context.user.id).toBe(mockUser.id);
    expect(context.user.role).toBeDefined();
  });

  test('gracefully degrades when vector DB fails', async () => {
    mockVectorDB.mockRejectedValue(new Error('DB down'));
    const context = await hydrateContext(mockUser, "test query");
    
    expect(context.knowledge.documents).toEqual([]);
    expect(context.user).toBeDefined(); // Other data still works
  });

  test('respects timeout for feature flags', async () => {
    mockFeatureFlags.mockImplementation(() => 
      new Promise(resolve => setTimeout(resolve, 2000))
    );
    const start = Date.now();
    const context = await hydrateContext(mockUser, "test");
    const elapsed = Date.now() - start;
    
    expect(elapsed).toBeLessThan(1500); // Should timeout before 2s
    expect(context.environment.featureFlags).toEqual({});
  });
});

Integration Tests (Full Pipeline)

typescript

12345678// tests/integration/context-pipeline.test.ts
describe('Context Pipeline Integration', () => {
  test('end-to-end: refund query retrieves correct context', async () => {
    const query = "I want a refund for order #12345";
    const context = await hydrateContext(mockRequest(query));
    
    // Verify context structure
    expect(context.knowledge.documents.length).toBeGreaterThan(0);
    expect(context.knowledge.documents[0].score).toBeGreaterThan(0.7);
    
    // Verify context contains order info
    const orderDoc = context.knowledge.documents.find(
      doc => doc.title.includes('Order')
    );
    expect(orderDoc).toBeDefined();
    
    // Verify user context
    expect(context.user.id).toBe('user-123');
  });
});

Regression Tests (Context Changes)

typescript

1234567// tests/regression/context-schema.test.ts
test('context schema remains stable', () => {
  const context = createMockContext();
  const schema = z.object({
    user: z.object({
      id: z.string(),
      role: z.enum(['admin', 'user']),
      technicalLevel: z.enum(['novice', 'expert'])
    }),
    environment: z.object({
      time: z.string(),
      featureFlags: z.record(z.boolean())
    }),
    knowledge: z.object({
      documents: z.array(z.any()),
      activeTicket: z.any().optional()
    })
  });
  
  expect(() => schema.parse(context)).not.toThrow();
});

When Context Engineering Fails

Before you rush to implement this everywhere, remember: Context Engineering is not a silver bullet. Here are the failure modes I've actually hit in production:

1. Over-Engineering the Context

You can build a perfect context hydrator that fetches 15 different data sources, but if your queries are simple ("What's the weather?"), you're wasting latency and money.

Rule: Only hydrate what the query needs. Use query classification to decide what to fetch.

2. Context Too Specific

If your context is hyper-specific to one use case, it won't generalize. You end up with a brittle system that breaks on edge cases.

Example:

typescript

123456789101112// ❌ BAD: Too specific
context.knowledge = {
  refundPolicy: "Section 4.2: Refunds processed within 7-10 business days..."
};

// ✅ GOOD: Structured but flexible
context.knowledge = {
  documents: [
    { type: "policy", section: "refunds", content: "..." },
    { type: "faq", topic: "refunds", content: "..." }
  ]
};

3. Latency Budget Exceeded

If your hydrator takes 3 seconds and your SLA is 2 seconds, context engineering won't help. You need to optimize the hydrator or reduce context scope.

Solutions:

Aggressive caching (Redis, in-memory)
Parallel fetching (already covered)
Lazy loading (fetch only what's needed)
Pre-computation (background jobs)

4. Cost Prohibitive

If you're making 10 database queries per request and your traffic is high, context engineering can break your budget.

Solutions:

Cache everything possible (user data, feature flags, documentation summaries)
Batch queries (fetch user + orders in one query instead of two)
Use cheaper data sources (Redis instead of Postgres for hot data)
Reduce context scope for high-traffic endpoints (simple queries skip hydration)
Pre-compute expensive operations (background jobs that summarize chat logs)

5. The Context is Wrong

If your hydrator retrieves the wrong documents, the model will confidently hallucinate based on bad context. This is worse than no context.

Mitigation:

Test your hydrator (covered above) — verify it retrieves the right docs for common queries
Monitor retrieval quality (log document scores, track user feedback)
Add human-in-the-loop for critical queries (escalate when confidence is low)
Use re-ranking for RAG (not just cosine similarity) — semantic similarity ≠ logical relevance
Add query expansion (synonyms, related terms) to improve retrieval

Conclusion

Stop treating LLMs like magic boxes that need the right spell. Treat them like functions.

Functions need valid arguments. Context is your argument. Structure it. Validate it. Prune it. Test it.

The best prompt is usually the shortest one, followed by the best data.

For simple queries ("What's the weather?"), a well-crafted prompt might be enough. For production systems where accuracy matters, context engineering is usually the right choice.

Start small: Don't build a perfect hydrator on day one. Start with one data source (user profile). Add more as you need them. Measure the impact. Iterate.

The goal isn't perfect context engineering. The goal is better answers. If context engineering improves accuracy significantly, that's a win. Even if it adds latency.

EOF