Configuration¶

Headroom can be configured via the SDK, proxy command line, or per-request overrides.

SDK Configuration¶

from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),

    # Mode: "audit" (observe only) or "optimize" (apply transforms)
    default_mode="optimize",

    # Enable provider-specific cache optimization
    enable_cache_optimizer=True,

    # Enable query-level semantic caching
    enable_semantic_cache=False,

    # Override default context limits per model
    model_context_limits={
        "gpt-4o": 128000,
        "gpt-4o-mini": 128000,
    },

    # Database location (defaults to temp directory)
    # store_url="sqlite:////absolute/path/to/headroom.db",
)

Proxy Configuration¶

Command Line Options¶

headroom proxy \
  --port 8787 \              # Port to listen on
  --host 0.0.0.0 \           # Host to bind to
  --budget 10.00 \           # Daily budget limit in USD
  --log-file headroom.jsonl  # Log file path

Feature Flags¶

# Disable optimization (passthrough mode)
headroom proxy --no-optimize

# Disable semantic caching
headroom proxy --no-cache

# Disable CCR response handling
headroom proxy --no-ccr-responses

# Disable proactive expansion
headroom proxy --no-ccr-expansion

# (The earlier --llmlingua flag was retired in 0.9.x and replaced by
# Kompress (ModernBERT). See `wiki/transforms.md` for the current
# opt-in path via the `[ml]` extra.)

All Options¶

headroom proxy --help

Kompress backend selection¶

Kompress (the model-based compressor) can run on two engines:

ONNX Runtime — lightweight, CPU-first. Installed with pip install headroom-ai[proxy]. Optionally uses the CoreML execution provider on macOS.
PyTorch — heavier, supports CUDA and Apple-Silicon MPS acceleration. Installed with pip install headroom-ai[ml]. With device=auto it selects cuda, then mps, then cpu.

Select the backend via the HEADROOM_KOMPRESS_BACKEND environment variable:

Value	Behavior
`auto`	Default. ONNX CPU first (stable, lightweight), PyTorch as fallback.
`onnx` / `onnx_cpu`	Force ONNX Runtime on CPU.
`onnx_coreml`	Force ONNX Runtime with the CoreML provider (CPU fallback).
`pytorch`	Force PyTorch with automatic device selection (CUDA → MPS → CPU).
`pytorch_mps`	Force PyTorch on Apple-Silicon MPS; falls back to ONNX CPU on failure.

Values are case-insensitive and hyphens are accepted (onnx-cpu == onnx_cpu). Shorthand aliases: cpu → onnx_cpu, coreml → onnx_coreml, mps / torch_mps → pytorch_mps, torch → pytorch. Unrecognized values log a warning and fall back to auto.

Example — opt in to MPS on an Apple-Silicon machine:

export HEADROOM_KOMPRESS_BACKEND=mps
headroom proxy ...

The default deliberately stays on ONNX CPU so existing installs keep their compression quality and performance characteristics; accelerator backends are opt-in.

Per-Request Overrides¶

Override configuration for specific requests:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],

    # Override mode for this request
    headroom_mode="audit",

    # Reserve more tokens for output
    headroom_output_buffer_tokens=8000,

    # Keep last N turns (don't compress)
    headroom_keep_turns=5,

    # Skip compression for specific tools
    headroom_tool_profiles={
        "important_tool": {"skip_compression": True}
    }
)

Modes¶

Mode	Behavior	Use Case
`audit`	Observes and logs, no modifications	Production monitoring, baseline measurement
`optimize`	Applies safe, deterministic transforms	Production optimization
`simulate`	Returns plan without API call	Testing, cost estimation

Simulate Mode¶

Preview what would happen without making an API call:

plan = client.chat.completions.simulate(
    model="gpt-4o",
    messages=large_conversation,
)

print(f"Would save {plan.tokens_saved} tokens")
print(f"Transforms: {plan.transforms}")
print(f"Estimated savings: {plan.estimated_savings}")

SmartCrusher Configuration¶

Fine-tune JSON compression behavior:

from headroom.transforms import SmartCrusherConfig

config = SmartCrusherConfig(
    # Maximum items to keep after compression
    max_items_after_crush=15,

    # Minimum tokens before applying compression
    min_tokens_to_crush=200,

    # Relevance scoring tier: "bm25" (fast) or "embedding" (accurate)
    relevance_tier="bm25",

    # Always keep items with these field values
    preserve_fields=["error", "warning", "failure"],
)

Cache Aligner Configuration¶

Control prefix stabilization:

from headroom.transforms import CacheAlignerConfig

config = CacheAlignerConfig(
    # Enable/disable cache alignment
    enabled=True,

    # Patterns to extract from system prompt
    dynamic_patterns=[
        r"Today is \w+ \d+, \d{4}",
        r"Current time: .*",
    ],
)

Rolling Window Configuration¶

Control context window management:

from headroom.transforms import RollingWindowConfig

config = RollingWindowConfig(
    # Minimum turns to always keep
    min_keep_turns=3,

    # Reserve tokens for output
    output_buffer_tokens=4000,

    # Drop oldest tool outputs first
    prefer_drop_tool_outputs=True,
)

Intelligent Context Manager Configuration¶

For semantic-aware context management with importance scoring:

from headroom.config import IntelligentContextConfig, ScoringWeights

# Customize scoring weights (must sum to 1.0, or will be normalized)
weights = ScoringWeights(
    recency=0.20,              # Newer messages score higher
    semantic_similarity=0.20,  # Similarity to recent context
    toin_importance=0.25,      # TOIN-learned retrieval patterns
    error_indicator=0.15,      # TOIN-learned error field types
    forward_reference=0.15,    # Messages referenced by later messages
    token_density=0.05,        # Information density
)

config = IntelligentContextConfig(
    # Enable/disable the manager
    enabled=True,

    # Protection settings
    keep_system=True,           # Never drop system messages
    keep_last_turns=2,          # Protect last N user turns

    # Token budget
    output_buffer_tokens=4000,  # Reserve for model output

    # Scoring settings
    use_importance_scoring=True,    # Use semantic scoring (vs position-only)
    scoring_weights=weights,        # Custom weights
    toin_integration=True,          # Use TOIN patterns if available
    recency_decay_rate=0.1,         # Exponential decay lambda

    # Strategy thresholds
    compress_threshold=0.1,     # Try compression first if <10% over budget
)

CCR Integration¶

When IntelligentContext drops messages, they're stored in CCR for potential retrieval:

from headroom.telemetry import get_toin

# Pass TOIN for bidirectional integration
toin = get_toin()
manager = IntelligentContextManager(config=config, toin=toin)

# Dropped messages are:
# 1. Stored in CCR (so LLM can retrieve if needed)
# 2. Recorded to TOIN (so it learns which patterns matter)
# 3. Marked with CCR reference in the inserted message

The marker inserted when messages are dropped includes the CCR reference:

[Earlier context compressed: 14 message(s) dropped by importance scoring.
Full content available via ccr_retrieve tool with reference 'abc123def456'.]

Scoring Weights¶

The ScoringWeights class controls how messages are scored:

Weight	Default	Description
`recency`	0.20	Exponential decay from conversation end
`semantic_similarity`	0.20	Embedding cosine similarity to recent context
`toin_importance`	0.25	TOIN retrieval_rate (high retrieval = important)
`error_indicator`	0.15	TOIN field_semantics error detection
`forward_reference`	0.15	Count of later messages referencing this one
`token_density`	0.05	Unique tokens / total tokens

Weights are automatically normalized to sum to 1.0:

weights = ScoringWeights(recency=1.0, toin_importance=1.0)
normalized = weights.normalized()
# recency=0.5, toin_importance=0.5, others=0.0

Environment Variables¶

Some settings can be configured via environment variables:

Variable	Description	Default
`HEADROOM_MODEL_LIMITS`	Custom model config (JSON string or file path)	-
`HEADROOM_CONFIG_DIR`	Canonical config (read-mostly) root. Derives `models.json` and per-plugin config paths when set.	`~/.headroom/config`
`HEADROOM_WORKSPACE_DIR`	Canonical workspace (read-write state) root. Derives savings ledger, memory DB, logs, TOIN, subscription state, and more when set.	`~/.headroom`
`HEADROOM_SAVINGS_PATH`	Full path to the proxy savings JSON ledger. Always wins when set.	derived from `${HEADROOM_WORKSPACE_DIR}`
`HEADROOM_TOIN_PATH`	Full path to the TOIN telemetry JSON file. Always wins when set.	derived from `${HEADROOM_WORKSPACE_DIR}`
`HEADROOM_SUBSCRIPTION_STATE_PATH`	Full path to the subscription tracker state. Always wins when set.	derived from `${HEADROOM_WORKSPACE_DIR}`
`HEADROOM_EMBEDDER_RUNTIME`	Set to `pytorch_mps` to run the memory embedder via the torch sentence-transformers backend on the Apple GPU (MPS). Only engages when Apple MPS is actually available; otherwise it logs a warning and uses the existing default embedder selection path. `pytorch_mps` is the only accepted value. Requires the `[pytorch-mps]` extra. See Memory.	default embedder selection
`HEADROOM_BETA_HEADER_STICKY`	Controls per-session `anthropic-beta` / `OpenAI-Beta` re-echo. `enabled` (default): the proxy unions beta tokens across turns within a session — if the client sends a token in turn N and omits it in turn N+1, the proxy re-injects it to preserve prefix-cache stability. `disabled`: the client's value is forwarded verbatim with no accumulation. Any other value raises at request time. See Session Beta Header Tracking.	`enabled`
`HEADROOM_BETA_TRACKER_MAX_SESSIONS`	LRU capacity of the in-memory session beta tracker. Once full, the oldest session entry is evicted.	`1000`

Session Beta Header Tracking¶

When running as a proxy, Headroom maintains a per-session union of anthropic-beta (and OpenAI-Beta) tokens via SessionBetaTracker. The session key is derived from the x-headroom-session-id header if present, otherwise from md5(model + system_prompt[:500])[:16] — stable across turns of the same conversation.

Why: clients such as Claude Code and Codex CLI may drop a beta token between consecutive turns. Because anthropic-beta is part of the request bytes that determine the upstream prefix-cache key, a dropped token would bust the cache mid-conversation. The tracker re-injects any token seen earlier in the session so the cache key stays stable.

Trade-off: once the proxy has seen a beta token in a session it will continue re-sending it for the rest of that session, even if the client stops including it. Stopping the token on the client side alone is not sufficient — the proxy re-injects it. Set HEADROOM_BETA_HEADER_STICKY=disabled to pass the client's anthropic-beta value verbatim and bypass this accumulation.

# Disable sticky beta re-echo
export HEADROOM_BETA_HEADER_STICKY=disabled
headroom proxy ...

Note: disabling sticky mode may reduce prefix-cache hit rates for clients that legitimately drop-and-re-add beta tokens across turns.

Filesystem Contract¶

Headroom resolves every on-disk resource through a two-root model:

HEADROOM_CONFIG_DIR (default ~/.headroom/config) — read-mostly configuration
HEADROOM_WORKSPACE_DIR (default ~/.headroom) — read-write state

Precedence for each resource is: explicit argument > per-resource env var > derived from canonical root > default. Every legacy env var continues to work unchanged.

See Filesystem Contract for the full bucket table, plugin-author guidance, and the Docker naming overlap note (HEADROOM_WORKSPACE is not the same as HEADROOM_WORKSPACE_DIR).

Custom Model Configuration¶

Configure context limits and pricing for new or custom models. Useful when: - A new model is released before Headroom is updated - You're using fine-tuned or custom models - You want to override built-in limits

Configuration Methods¶

Settings are resolved in this order (later overrides earlier): 1. Built-in defaults 2. ${HEADROOM_CONFIG_DIR}/models.json (defaults to ~/.headroom/config/models.json); falls back to the legacy location ~/.headroom/models.json when the canonical file is absent 3. HEADROOM_MODEL_LIMITS environment variable 4. SDK constructor arguments

Config File Format¶

Create ~/.headroom/models.json:

{
  "anthropic": {
    "context_limits": {
      "claude-4-opus-20250301": 200000,
      "claude-custom-finetune": 128000
    },
    "pricing": {
      "claude-4-opus-20250301": {
        "input": 15.00,
        "output": 75.00,
        "cached_input": 1.50
      }
    }
  },
  "openai": {
    "context_limits": {
      "gpt-5": 256000,
      "ft:gpt-4o:my-org": 128000
    },
    "pricing": {
      "gpt-5": [5.00, 15.00]
    }
  }
}

Environment Variable¶

Set HEADROOM_MODEL_LIMITS as a JSON string or file path:

# JSON string
export HEADROOM_MODEL_LIMITS='{"anthropic":{"context_limits":{"claude-new":200000}}}'

# File path
export HEADROOM_MODEL_LIMITS=/path/to/models.json

Pattern-Based Inference¶

Unknown models are automatically inferred from naming patterns:

Pattern	Inferred Settings
`opus`	200K context, Opus-tier pricing
`sonnet`	200K context, Sonnet-tier pricing
`haiku`	200K context, Haiku-tier pricing
`gpt-4o*`	128K context, GPT-4o pricing
`o1`, `o3`	200K context, reasoning model pricing

This means new models like claude-4-sonnet-20251201 will work automatically with Sonnet-tier defaults.

SDK Override¶

Override in code for specific models:

from headroom import HeadroomClient, AnthropicProvider

client = HeadroomClient(
    original_client=Anthropic(),
    provider=AnthropicProvider(
        context_limits={
            "claude-new-model": 300000,
        }
    ),
)

Provider-Specific Settings¶

OpenAI¶

from headroom import OpenAIProvider

provider = OpenAIProvider(
    # Enable automatic prefix caching
    enable_prefix_caching=True,
)

Anthropic¶

from headroom import AnthropicProvider

provider = AnthropicProvider(
    # Enable cache_control blocks
    enable_cache_control=True,
)

Google¶

from headroom import GoogleProvider

provider = GoogleProvider(
    # Enable context caching
    enable_context_caching=True,
)

Configuration Precedence¶

Settings are applied in this order (later overrides earlier):

Default values
Environment variables
SDK constructor arguments
Per-request overrides

Validation¶

Validate your configuration:

result = client.validate_setup()

if not result["valid"]:
    print("Configuration issues:")
    for issue in result["issues"]:
        print(f"  - {issue}")

TypeScript SDK Configuration¶

The TypeScript SDK is configured via environment variables or constructor options.

Environment Variables¶

Variable	Description	Default
`HEADROOM_BASE_URL`	Base URL of the Headroom proxy or cloud API	`http://localhost:8787`
`HEADROOM_API_KEY`	API key for Headroom Cloud authentication	-

Usage¶

export HEADROOM_BASE_URL=http://localhost:8787
export HEADROOM_API_KEY=your-api-key

import { HeadroomClient } from 'headroom-ai';

// Reads from HEADROOM_BASE_URL and HEADROOM_API_KEY automatically
const client = new HeadroomClient();

// Or configure explicitly
const client = new HeadroomClient({
  baseUrl: 'http://localhost:8787',
  apiKey: 'your-api-key',
});

See the TypeScript SDK Guide for full configuration options.