OpenClaw Cost Optimization: How to Cut Your Monthly Bill by 90%
A complete guide to reducing your OpenClaw AI agent costs from $600 to under $20/month. Covers model routing, context management, caching, and more — with step-by-step config examples.

OpenClaw is one of the hottest open-source AI agent projects of 2026. It lives on your server and does real work for you — browses the web, manages files, sends messages, checks your calendar, and more.
But there’s a problem most people don’t talk about: it can get expensive fast.
Many users run OpenClaw with default settings and end up spending $100 to $600+ per month on API tokens alone. The worst part? Most of that money is wasted on tasks that don’t need expensive models.
This guide will show you exactly how to cut your OpenClaw costs by up to 90%. We’ll go step by step, from quick 5-minute fixes to advanced optimizations. By the end, you should be able to run a fully working OpenClaw agent for under $20-60 per month.
I run my OpenClaw on xCloud — a managed hosting platform that handles all the server stuff for you. Most of the tips in this guide work no matter where you host, but I’ll point out xCloud-specific things where they help.
Let’s dive in.
Understanding Where Your Money Goes
Before you can save money, you need to know where it’s going. Here’s how a typical OpenClaw instance spends its tokens:
| Category | % of Total Cost |
|---|---|
| Conversation history (context) | 40-50% |
| Tool call outputs | 20-30% |
| System prompts | 10-15% |
| Model responses (output tokens) | 8-12% |
| Retries and errors | 3-5% |
The biggest surprise for most people? Context history eats almost half your budget. Every time you send a new message, OpenClaw sends the entire conversation history back to the AI model. So by your 10th message, all 9 previous messages get sent again. This adds up fast.
The second surprise: output tokens cost 3 to 5 times more than input tokens across all providers. A chatty model that writes 2,000 words when 400 would do is burning your money.

If you use OpenRouter for your API keys, you can see a similar breakdown in your OpenRouter Activity dashboard. For direct API keys, check the Anthropic Console or OpenAI Dashboard.
Understanding this breakdown is key. The strategies below target each of these cost areas directly.
Quick Wins — Save 50% in 5 Minutes
These three changes take less than 5 minutes each. Together, they can cut your bill in half immediately.
1. Switch Your Default Model
This is the single fastest way to save money. Most people set up OpenClaw with a powerful (and expensive) model like Claude Opus or GPT-4o as their default. But here’s the truth: 80-90% of your daily tasks don’t need a premium model.
Summarizing a document? Formatting text? Looking up a file? Setting a reminder? A cheaper model handles all of these just fine.
Here’s a quick price comparison:
| Model | Input Cost (per 1M tokens) | Best For |
|---|---|---|
| Claude Opus 4.6 | ~$15 | Complex reasoning only |
| Claude Sonnet 4.6 | ~$3 | Good balance of quality and cost |
| Claude Haiku 4.5 | ~$1 | Everyday tasks, 5x cheaper than Opus |
| GPT-4o-mini | ~$0.15 | Simple tasks, 10-20x cheaper than GPT-4o |
| Gemini 2.0 Flash | ~$0.10 | Budget tasks |
What to do: Change your default model to Claude Haiku or GPT-4o-mini. Use Sonnet or Opus only when you actually need deep reasoning.
In your OpenClaw config (~/.openclaw/openclaw.json):
{
"agent": {
"model": "anthropic/claude-haiku-4-5-20251001"
}
}2. Cap Output Token Length
By default, OpenClaw lets the model generate as many tokens as it wants. This means a simple “yes or no” question might get a 500-word response. Since output tokens are the most expensive kind, this adds up.
Add a max output token limit to your config:
{
"agent": {
"model": "anthropic/claude-haiku-4-5-20251001",
"max_output_tokens": 2048
}
}For most tasks, 2,048 tokens (about 1,500 words) is more than enough. You can always increase it for specific tasks that need longer responses.
3. Enable Prompt Caching
Every time OpenClaw sends a request, it includes your system prompt, tool definitions, and other static content. If this content is the same across requests (and it usually is), you’re paying full price for the same text over and over.
Prompt caching stores this repeated content so you only pay a fraction of the price on subsequent calls:
- Claude: Cached tokens cost only 10% of the normal price (90% discount)
- OpenAI: Cached tokens cost 50% of the normal price
For Claude, caching is automatic when your prompt exceeds 1,024 tokens. For the best results, make sure your static content (system prompt, tool definitions) comes at the beginning of each request so the cache can kick in.
If you’re using a proxy service like OpenRouter, make sure you’re using the Anthropic native format (anthropic-messages API) rather than the OpenAI compatibility mode. The OpenAI compatibility mode often can’t access Claude’s prompt caching, which means you miss out on 90% savings on system prompts.
These three changes alone can drop your bill by 50% or more. But we’re just getting started.
Model Routing — The Biggest Money Saver
Model routing is the idea of using different models for different types of tasks automatically. Instead of one expensive model doing everything, you set up a chain:
- Default (cheap model) handles 70-80% of tasks — simple Q&A, formatting, reminders, lookups
- Mid-tier model handles 15-20% — moderate reasoning, code review, summarization
- Premium model handles 5-10% — complex analysis, multi-step reasoning, critical decisions
This matches how most teams actually work. Not every email needs a senior engineer to write it.
Setting Up a Failover Chain
Here’s an example configuration with a smart failover chain:
{
"agent": {
"model": "anthropic/claude-haiku-4-5-20251001",
"fallback": [
{
"model": "anthropic/claude-sonnet-4-6",
"condition": "complexity > 0.7"
},
{
"model": "anthropic/claude-opus-4-6",
"condition": "complexity > 0.9"
}
]
}
}This tells OpenClaw: “Use Haiku by default. If the task is moderately complex, switch to Sonnet. Only use Opus for the really hard stuff.”
Assign Cheap Models to Background Work
Your cron jobs, heartbeat checks, and background agents don’t need premium models. A heartbeat check that runs every 30 minutes on Claude Opus can cost $30-100 per month by itself. Switch it to Haiku or Gemini Flash and that drops to under $3.
{
"cron": {
"heartbeat": {
"model": "google/gemini-2.0-flash",
"every": "30m",
"session": "isolated"
}
}
}The same goes for sub-agents. Each sub-agent you spawn starts a new session with its own context overhead. If you have a search agent, a file processing agent, and a summarizer — run them all on cheaper models:
| Agent Type | Recommended Model | Why |
|---|---|---|
| Search agent | GPT-4o-mini | Simple lookups don’t need reasoning power |
| File processor | Claude Haiku | Fast, handles formatting well |
| Summarizer | GPT-4o-mini | Summarization is a strength of smaller models |
| Main conversation | Claude Sonnet | Good balance for interactive work |
| Complex coding | Claude Opus (on demand) | Only when you actually need it |
Real-world savings: 60-80% cost reduction versus using a single premium model for everything.
Context Management — Stop Paying for Old Messages
Remember that context history eats 40-50% of your budget? Here’s how to fix that.
The Problem: Context Compounding
OpenClaw sends the full conversation history with every new message. Here’s what that looks like in practice:
- Message 1: Sends 1x your system prompt + 1 message
- Message 5: Sends 1x system prompt + 5 messages
- Message 20: Sends 1x system prompt + 20 messages
- Message 40: Sends 1x system prompt + 40 messages (your early messages have been sent 40 times!)
This is why long conversations get exponentially more expensive.
Fix 1: Set Context Token Limits
Put a ceiling on how much context gets sent with each request:
{
"conversation": {
"max_context_tokens": 50000,
"auto_summarize": true,
"summary_threshold": 30000
}
}This tells OpenClaw: “When the conversation history reaches 30,000 tokens, summarize the older parts. Never send more than 50,000 tokens of context.”
For most use cases, 50,000-100,000 tokens is the sweet spot. Beyond that, costs go up fast and the model’s response quality actually gets worse (too much context can confuse it).
Fix 2: Keep Workspace Files Lean
Your MEMORY.md, AGENTS.md, SOUL.md, and other workspace files get included in every single request. If they’re bloated, you’re paying for that bloat on every message.
Target: Keep all workspace files combined under 3,000 tokens.
Tips:
- Remove old or outdated entries from MEMORY.md
- Keep agent descriptions short and focused
- Don’t dump entire project docs into workspace files — reference them instead
Fix 3: Enable QMD (Quick Memory Database)
QMD is a newer feature (OpenClaw v2026.2.2+) that builds a local search index of your conversations and documents. Instead of sending everything to the model, it searches for the most relevant pieces and sends only those.
{
"qmd": {
"enabled": true,
"index_path": "./qmd_index",
"embedding_model": "local",
"search_top_k": 5,
"auto_index": true
}
}QMD uses a local embedding model, so there’s zero API cost for the search itself. It can reduce your context token usage by 60-97% depending on how much history you have.
Fix 4: Use Isolated Sessions for Cron Jobs
Every cron job should use "session": "isolated". This starts a clean session for each run and closes it when done. Without this, your cron jobs accumulate context over time, getting more expensive with each run.
Use the Batch API for Non-Urgent Work
If you have tasks that don’t need an instant response, the Batch API gives you a 50% discount.
You submit your requests in a batch, and the API processes them within 24 hours. This is perfect for:
- Content generation — writing blog drafts, emails, social posts
- Data processing — classifying documents, extracting data
- Bulk summarization — processing many files at once
- Evaluation pipelines — testing prompt quality across many inputs
The trade-off is simple: you wait longer, you pay half. For anything that’s not a live conversation, this is free money.
Most API providers support batching:
- OpenAI Batch API: 50% discount on both input and output tokens
- Anthropic: Available through their Message Batches API
Disable Expensive Features You Don’t Need
Some OpenClaw features burn tokens without you realizing. Here’s what to check:
Thinking/Reasoning Mode
Extended thinking mode makes the model “think out loud” before answering. This is great for complex problems, but those thinking tokens cost 3 to 5 times more than normal tokens.
If your OpenClaw bill is surprisingly high, check your reasoning mode usage first. Disable it for simple tasks:
{
"thinking": {
"type": "disabled"
}
}Enable it only when you specifically need deep reasoning (complex code, multi-step analysis).
Heartbeat Frequency
OpenClaw’s heartbeat is a background check that runs by default every 30 minutes. Each check consumes 8,000-15,000 input tokens. On a premium model, that’s $30-100/month just for heartbeats.
Fix: Either disable heartbeats if you don’t need them, reduce the frequency, or (best option) route them through a cheap model like Gemini Flash.
Browser Automation
A single web scraping session can cost $0.10 to $0.50 in tokens. If your agent is browsing the web frequently, consider:
- Using direct API calls instead of browser automation where possible
- Caching web results so the same page isn’t fetched twice
- Setting result limits (e.g.,
maxResults: 3for web searches)
Unnecessary Sub-Agents
Each sub-agent spawns a new session with its own context overhead. Before spawning a sub-agent, ask: “Could the main agent handle this inline?” If the task is simple, skip the sub-agent.
Monitor Your Spending
You can’t optimize what you don’t measure. Here’s how to keep track of your costs:
Set Budget Caps
Every API provider lets you set spending limits. Do this on day one. A runaway loop in development can generate thousands of requests before you notice.
- Anthropic Console (console.anthropic.com): Set monthly spend caps under Usage
- OpenRouter (openrouter.ai): Set limits under Activity
- OpenAI Dashboard: Configure usage limits per API key
Set Alert Thresholds
Don’t wait until you hit your limit. Set alerts at 50%, 75%, and 90% of your budget:
{
"budget": {
"monthly_limit": 30,
"alert_threshold": 0.8,
"action_on_limit": "downgrade"
}
}The "action_on_limit": "downgrade" option automatically switches to a cheaper model when you’re close to your limit, instead of cutting off your agent completely.
Track What Matters
Key metrics to review monthly:
- Tokens per conversation — Are conversations getting longer over time?
- Cost per request — Which tasks are the most expensive?
- Model distribution — What percentage of requests go to each model?
- Cache hit rate — Is caching actually working?
Tools like Helicone (one-line integration) or LiteLLM (self-hosted proxy) can give you detailed dashboards for all of these.
Real Cost Examples
Here’s what real setups look like after optimization:
Budget Setup: $20-30/month
- Default model: Claude Haiku or GPT-4o-mini
- Complex tasks: Claude Sonnet (on demand only)
- Cron jobs: Gemini Flash
- Context limit: 50,000 tokens
- QMD: Enabled
- Best for: Personal assistant, light daily use
Balanced Setup: $50-80/month
- Default model: Claude Sonnet
- Light tasks: Claude Haiku
- Background: Gemini Flash
- Sub-agents: GPT-4o-mini
- Context limit: 100,000 tokens
- Best for: Active daily use, coding assistance, content creation
Power User: $100-150/month
- Default model: Claude Sonnet
- Complex work: Claude Opus (when needed)
- Multiple specialized sub-agents
- Browser automation enabled
- Best for: Teams, heavy automation, multi-agent workflows
Before vs After
| Before Optimization | After Optimization | |
|---|---|---|
| Monthly cost | $200-600 | $20-60 |
| Default model | Opus/GPT-4 | Haiku/GPT-4o-mini |
| Context management | None | QMD + auto-summarize |
| Heartbeat model | Same as default | Gemini Flash |
| Budget alerts | None | Set at 80% |
Server-Side Optimization
Your server setup also affects costs. Here’s how to optimize the hardware side.
Right-Size Your Server
Don’t over-provision. Most OpenClaw instances are memory-limited, not CPU-limited:
| Use Case | Recommended Specs | Monthly Cost |
|---|---|---|
| Personal (1-2 channels) | 1 vCPU, 2 GB RAM | ~$5-10 |
| Small team (3-5 channels) | 2 vCPU, 4 GB RAM | ~$15-25 |
| Production (10+ channels) | 4 vCPU, 8 GB RAM | ~$30-50 |
Unload Unused Skills
Each skill adds to your agent’s memory usage and context window. Most users find that 8-12 well-chosen skills is the sweet spot. If you have 30+ skills loaded but only use 10, remove the rest.
Set Up Log Rotation
OpenClaw generates logs that can fill up your disk. Set up automatic rotation:
- Rotate daily
- Compress old logs
- Keep only 7 days of logs
Use a LiteLLM Proxy (Advanced)
For power users, running a LiteLLM proxy in front of your API providers gives you:
- Response caching: 20-50% cost reduction on repeated queries
- Automatic fallback: If one provider is down, it switches to another
- Unified dashboard: Track costs across all providers in one place
- Rate limiting: Prevents accidental cost spikes from retry loops
Why xCloud for OpenClaw Hosting
I host my OpenClaw on xCloud, and here’s why it works well for cost optimization:
- One-click setup in 5 minutes — No Docker, no terminal commands, no DevOps. You sign up, pick a plan, and your OpenClaw is running.
- Managed infrastructure — xCloud handles server security, SSL certificates, automatic backups, firewall rules, and updates. You focus on using your agent, not maintaining it.
- BYOK (Bring Your Own Key) — You use your own API keys from Anthropic, OpenAI, or any provider. This means you have full control over your API spending and can apply all the optimization tips in this guide directly.
- Starting at $24/month — The hosting cost is predictable and separate from your API costs. No surprise charges for bandwidth or storage.
- Right-sized servers — xCloud gives you a dedicated, isolated environment. You’re not sharing resources with other users, which means consistent performance.
The combination of cheap managed hosting ($24/month) plus the optimization tips above (API costs under $20-60/month) means you can run a fully capable AI agent for under $50-85/month total.
Quick Optimization Checklist
Here’s everything from this guide in one checklist. Start from the top and work your way down:
- Switch default model to Claude Haiku or GPT-4o-mini
- Set max_output_tokens to 2,048
- Enable prompt caching (use Anthropic native format if using a proxy)
- Set up model routing — cheap default, mid-tier fallback, premium only when needed
- Assign cheap models to cron jobs, heartbeats, and sub-agents
- Set context token limit to 50,000-100,000
- Enable auto-summarize for long conversations
- Keep workspace files under 3,000 tokens total
- Enable QMD for smart context retrieval
- Use isolated sessions for all cron jobs
- Disable thinking mode for non-complex tasks
- Reduce heartbeat frequency or route through cheap model
- Set monthly budget cap and alert at 80%
- Review spending monthly — check model distribution and cost per request
- Right-size your server — don’t over-provision
Most people see the biggest drop from just the first 5 items. You don’t have to do everything at once. Start with the quick wins and add more as you go.
Ready to get started? If you don’t have an OpenClaw server yet, xCloud’s managed hosting gets you up and running in 5 minutes — no server management required. Pair it with the tips above and you’ll have a powerful AI agent that doesn’t break the bank.
💡 Pro Tips From the Trenches
These are things I discovered after weeks of running OpenClaw in production — not in any docs, just from tinkering.
Pro Tip #1 — Use openrouter:free for Heartbeat (Literally $0)
OpenRouter has a :free suffix on many models that gives you access to genuinely free-tier API calls. This is perfect for your heartbeat, which runs in the background every 30 minutes and doesn’t need any reasoning power.
Set your heartbeat to a free OpenRouter model and you pay zero for it — every single day, indefinitely:
{
"cron": {
"heartbeat": {
"model": "openrouter/google/gemini-2.0-flash:free",
"every": "30m",
"session": "isolated"
}
}
}The :free models on OpenRouter have rate limits, but for a simple heartbeat ping, they’re more than sufficient. This alone can save you $10-30/month depending on your current setup.
Good free models to try for heartbeat:
google/gemini-2.0-flash:free,meta-llama/llama-3.1-8b-instruct:free,mistralai/mistral-7b-instruct:free
Pro Tip #2 — Scan for Free Models During Onboarding
When you first set up OpenClaw and go through the onboarding flow, there is an option to scan available OpenRouter models — including the free tier. Most people skip past this screen without realizing what it does.
If you select “Scan OpenRouter Free” during onboarding, OpenClaw will:
- Pull the current list of
:freemodels from OpenRouter - Let you assign them to specific roles (heartbeat, sub-agents, background tasks)
- Save those assignments to your config automatically
This is the fastest way to start with a zero-cost background setup from day one — before you’ve even written your first prompt. If you already skipped this step during setup, you can trigger the scan manually from Settings → Providers → OpenRouter → Scan Free Models.
Bottom line: The best API cost is $0. OpenRouter’s free tier makes that possible for everything that runs in the background. Use it.