agentagent-looparchitectureoptimizationdeepseekinferencecost-controlcontext-management type: concept 创建: 2026-05-25 更新: 2026-05-25

Cache-First Agent Loop

A context-management strategy for LLM agent loops that maximizes prefix-cache hit rates by structuring the prompt into immutable, append-only, and volatile regions.

Problem

DeepSeek bills cached input tokens at ~10% of uncached input. Automatic prefix caching activates only when the exact byte prefix of the previous request matches. Most agent frameworks achieve <20% cache hit rates in practice because they:

  • Reorder or rewrite history each turn
  • Inject fresh timestamps or session IDs into the system prompt
  • Inline full tool results into the conversation, shifting byte offsets
  • Rewrite the entire context on compaction

A coding agent that burns $200/month on background projects is one nobody uses. The cache-first approach makes the agent “cheap enough to leave on.”

Three-Region Partition

┌─────────────────────────────────────────┐
│ IMMUTABLE PREFIX                        │ ← fixed for session
│   system + tool_specs + few_shots       │   cache hit candidate
├─────────────────────────────────────────┤
│ APPEND-ONLY LOG                         │ ← grows monotonically
│   [assistant₁][tool₁][assistant₂]...    │   preserves prefix of prior turns
├─────────────────────────────────────────┤
│ VOLATILE SCRATCH                        │ ← reset each turn
│   R1 thought, transient plan state      │   never sent upstream
└─────────────────────────────────────────┘

Immutable Prefix

  • Computed once per session, hashed, and pinned
  • Contains: system prompt, tool specifications, few-shot examples
  • Any change (add/remove tool, edit system prompt) invalidates the fingerprint and accepts a cache miss
  • In Reasonix: ImmutablePrefix class with SHA-1 fingerprint verification

Append-Only Log

  • Grows monotonically; entries serialized in strict order
  • No rewrites, no reordering, no in-place edits
  • Each new turn appends assistant message + tool results to the log
  • The log’s prefix is always a superset of the previous turn’s prefix → cache hit

Volatile Scratch

  • Chain-of-thought, reasoning content, transient plan state
  • Reset each turn; never folded into the upstream request unless explicitly distilled
  • Prevents per-turn reasoning from poisoning the cacheable prefix

Four Mechanisms

MechanismWhat it doesCache impact
ImmutablePrefixFreezes system + tools + few-shots at session startEliminates prefix churn from spec changes
AppendOnlyLogStrict append-only historyEnsures each turn’s prefix is a superset of the last
VolatileScratchIsolates per-turn reasoningPrevents reasoning bytes from entering the cached prefix
Auto-compactFolds old turns into summary appended to prefixSurvives context-window pressure without rewriting prefix

Real-World Results

Reasonix benchmark (2026-05-01):

  • 435M input tokens, 99.82% cache hit
  • Cost: ~$12 (with cache) vs ~$61 (without cache) on v4-flash
  • Savings: 97.7%

Comparison with other clients:

  • DeepSeek web chat: 60-80% within conversation, 0% on new session
  • Generic OpenAI-shape SDKs: 30-60% — history reordered, tool specs re-serialized
  • XML-tool-call clients (Cline/Continue): lower still — tool results inline into conversation

Implementation Notes

Parallel Tool Dispatch

Tools declare parallelSafe?: boolean. Consecutive parallel-safe calls are grouped and raced via Promise.allSettled. The first non-parallel-safe call ends the chunk (serial barrier). Tool-result yields still land in declared order regardless of settlement order, so the model sees the same shape as fully serial dispatch.

Turn-End Auto-Compaction

Every tool result exceeding a token cap (e.g., 3000) is shrunk to that cap when the turn ends. The model had the full text for the turn that read it; subsequent turns see a compact summary. A proactive threshold (e.g., 40% context ratio) runs pre-emptively before the emergency threshold (e.g., 80%) fires.

Fingerprint Verification

The immutable prefix maintains a cached fingerprint (SHA-1 hash of its canonical serialization). Any mutation path that bypasses addTool()/removeTool()/replaceSystem() causes fingerprint drift — detected in dev/test mode via verifyFingerprint(), which throws to catch cache-invalidating mutations.

Trade-offs

ProsCons
Dramatic cost reduction (10-50× cheaper)Coupled to a specific provider’s cache mechanic
Simpler reasoning about context stabilityCannot reorder history for relevance
Append-only log is easy to debug/replayTool spec changes are expensive (cache miss)
Volatile scratch requires explicit distillation

References