Is this the real Claude API?

Yes. Your requests are processed by the same Claude models (Opus 4.7, Sonnet 4.6, Haiku 4.5) with the same context windows and capabilities. The only difference is the price.

How do I switch from the official API?

Just change the base URL and API key — it's a one-line change in your code. Works with Claude Code too; full setup steps are in our docs.

What payment methods do you accept?

We accept cryptocurrency — USDT (TRC20/ERC20), BTC, ETH, and 100+ other coins via Oxapay. Credits never expire.

Are there rate limits?

There are no fixed per-account caps. Throughput depends on system load, upstream provider availability, and the model in use — newer models often have tighter caps than older ones.

Do you store my prompts or data?

No. We don't log, store, or train on your API requests. Zero data retention policy on request content.

24/7 support via email at support@claudeapi.cheap. Pro users get priority response.

All posts

May 14, 2026·10 min readsaasflat fee saas claude apipricing strategymarginsubscription

Flat-Fee SaaS with Claude API: When the Margin Math Actually Works

Build a flat-fee SaaS with Claude API without bleeding margin. 4 levers: tier design, prompt caching, soft caps, model routing. Real $29/mo example.

Why Founders Want Flat Fee (and Why Retail Rates Push Them to Metered)

Every AI-powered SaaS founder eventually has the same debate with themselves:

Flat-fee plan: predictable revenue, simpler messaging, lower friction. The customer knows exactly what they will pay every month. You know exactly what to forecast.

Metered plan: aligned with cost, no margin risk, scales with usage. But it adds friction, suppresses adoption among casual users, and turns every feature decision into a billing decision.

Flat fee almost always wins on conversion. Metered almost always wins on margin protection. At Claude API retail rates, most founders cannot afford to take the conversion win because the margin floor is too thin.

This post is the engineering and pricing playbook for keeping flat fee as a real option. We cover:

The four engineering levers that change the math

The margin floor calculation that decides whether flat fee is viable

A worked example of a \$29/month SaaS plan that actually clears margin

Where Claude pricing leverage (70-80% off) fits into the design

The Margin Floor That Decides Everything

For a flat-fee SaaS plan to be defensible, you need a margin floor below which even your worst-cost customer is profitable. A margin floor of 30% means even your 95th-percentile heavy user is contributing 30 cents on every dollar of subscription revenue.

The rough formula:

Margin floor = 1 − (95th-percentile-user monthly AI cost / subscription price)

If you cannot calculate this number for your own product, you cannot honestly price a flat-fee plan. Most early-stage SaaS founders cannot because:

The product is new, so usage distributions are not stable yet

The 95th-percentile user has not emerged yet (you have 50 customers, not 5,000)

The internal cost-per-call is not instrumented per user

Which is why most founders launch at \$29/month, get one power user, and watch their margin go negative on that customer six weeks in.

The Four Levers That Change the Math

Four engineering levers move the cost-per-user line down. Used together, they shift the margin floor by an order of magnitude.

Lever 1: Prompt Caching

Claude's prompt cache costs ~1.25× input on write and ~0.1× input on read. A 50KB system prompt that gets reused 1,000 times costs ~10% of nominal input cost over the lifetime.

The operational target: 80%+ cache hit ratio on system prompts and stable context segments. Common reasons cache hit ratios are low:

Date or timestamp embedded in the system prompt — busts cache daily

User ID inserted at the wrong scope — busts cache per-user instead of shared

Drift in tool-call schema — invalidates cache on every minor update

Instrument cache hit ratio per endpoint. Track it as a first-class metric.

Lever 2: Model Routing

Defaulting to Sonnet 4.6 for every task is the most common over-spend pattern. A meaningful share of agent tasks — classification, summarization, extraction, simple Q&A — produce equivalent results on Haiku 4.5 at ~1/3 the cost.

The routing pattern:

def pick_model(task_type, complexity):
    if task_type in ("classify", "extract", "simple_qa", "summarize_short"):
        return "claude-haiku-4-5"
    elif complexity == "hard" or task_type in ("code_complex", "reasoning_deep"):
        return "claude-opus-4-7"
    else:
        return "claude-sonnet-4-6"

A SaaS routing 70% of traffic to Haiku, 25% to Sonnet, 5% to Opus typically lands at 35-40% of a pure-Sonnet bill — for equivalent or better task results when the routing rule is sound.

Lever 3: Soft Caps Per Plan Tier

"Unlimited" is a marketing claim, not an SLA. For flat-fee plans to survive, set a 95th-percentile soft cap on usage with a clear path to upgrade if a customer hits it.

This is not about restricting normal users — it is about preventing the 5% of customers who would otherwise consume 80% of the cost. A reasonable design:

Standard plan: \$29/mo, soft cap at 500 task units/month

Pro plan: \$79/mo, soft cap at 2,500 task units/month

Team plan: \$199/mo, soft cap at 10,000 task units/month

When a user hits 80% of the cap, send a friendly email with the upgrade link. When they hit 100%, throttle to a slower lane (not zero — keep the product working) and offer the upgrade.

The customers who hit the cap are typically the ones happy to upgrade. The customers who do not hit it never feel restricted.

Lever 4: Per-Call Cost Reduction (70-80% Off)

The fourth lever is the API rate itself. Moving from full retail to 70-80% off changes every other piece of the math:

Margin floor moves dramatically — the same heavy user is now 1/5 the cost

Soft caps can be raised, reducing the customers who hit them

The pressure on cache hit ratio is lower (still important, just less critical)

The model routing trade-off shifts (Sonnet at Pro rates is cheaper than Haiku at retail)

The migration is one environment variable — see our Anthropic base URL setup guide for the env-var pattern. No code rewrite, no SDK swap.

A Worked Example: \$29/mo Plan That Clears 70% Margin

Let's design a flat-fee plan for a code-generation SaaS that uses Claude Sonnet 4.6 with reasonable engineering hygiene.

Assumptions

Subscription price: \$29/month

Target margin floor: 70% (i.e., 95th-percentile-user cost <= \$8.70/month)

Per-task usage: 8K input + 4K output (typical code-gen)

Soft cap: 200 tasks/month (95th-percentile user is at 180 tasks)

Cache hit ratio: 70% on system prompts (realistic, not aspirational)

Model routing: 80% Sonnet, 20% Opus for hard problems

Cost at Retail Rates (Pre-Discount)

Per-task cost with 70% cache hit ratio on input:

Effective input cost: 8K × \$3/M × (0.3 fresh + 0.7 × 0.1 cached) = \$0.0089

Output cost (Sonnet): 4K × \$15/M = \$0.06

Per Sonnet task: ~\$0.069

For 80/20 Sonnet/Opus split, weighted: ~\$0.087/task

95th-percentile user (180 tasks): \$15.66/month at retail. That is 54% of the \$29 subscription. Margin floor is at 46% — below the 70% target. You either raise the price, lower the cap, or change the cost.

Cost at Pro Rates (80% Off)

Same engineering hygiene, same usage, claudeapi.cheap Pro rates:

Effective input cost: 8K × \$0.60/M × 0.37 = \$0.0018

Output cost (Sonnet): 4K × \$3/M = \$0.012

Per Sonnet task: ~\$0.014

Weighted (80/20 Sonnet/Opus): ~\$0.018/task

95th-percentile user (180 tasks): \$3.24/month at Pro rates. That is 11% of the \$29 subscription. Margin floor is at 89%. You can raise the soft cap to 400 tasks/month and still clear 78%.

The rate change does not just save money — it changes which pricing designs are viable. A flat-fee plan that dies at retail rates survives comfortably at Pro rates, even with the same product and the same engineering choices.

Code Pattern: Per-User Cost Tracking

If you do not already track cost per user per month, the simplest implementation is to log every API response with the user ID and the cost:

from anthropic import Anthropic
import logging

client = Anthropic()  # uses ANTHROPIC_BASE_URL env var

def call_claude(user_id, messages, model="claude-sonnet-4-6"):
    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=messages
    )
    
    # Cost calculation at Pro rates per 1M tokens
    PRICES = {
        "claude-sonnet-4-6": (0.60, 3.00),
        "claude-opus-4-7": (1.00, 5.00),
        "claude-haiku-4-5": (0.20, 1.00),
    }
    input_price, output_price = PRICES[model]
    cost = (
        response.usage.input_tokens * input_price / 1_000_000 +
        response.usage.output_tokens * output_price / 1_000_000
    )
    
    logging.info(f"user={user_id} model={model} cost={cost:.4f}")
    
    return response

From there: aggregate per user per month, sort, look at the 95th percentile, calibrate your plan tier accordingly.

What Stays the Same When You Switch Rates

The shift from retail rates to 70-80% off is a single env var change. What you keep:

Same Anthropic SDK in your code

Same model strings (claude-sonnet-4-6, claude-opus-4-7, claude-haiku-4-5)

Same Messages API format

Same streaming and prompt-caching mechanics

Same response shape and error contract

What changes: your bill, your gross margin, and the design space of pricing plans you can credibly offer.

For more on cost analysis at scale, see our unit economics deep-dive. For the cost-reduction-at-bill-scale angle, see cutting a \$24K bill.

Frequently Asked Questions

Does this work with our existing SDK code?

Yes. Two environment variables (ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL) point at the proxy. The Python and Node Anthropic SDKs pick them up automatically. No refactor.

Can we A/B test the rate change before fully switching?

Yes. Route 10-20% of production traffic through the proxy for a week. Compare cost-per-call, latency, and any error rate. If clean, move the rest.

What about prompt caching at the proxy?

Claude prompt caching works identically through the proxy. Cache writes at ~1.25× input, cache reads at ~0.1× input. Same mechanics end-to-end.

Is there a meaningful latency overhead?

Single-digit milliseconds in most regions. We can share specific monitoring numbers on request for high-volume operators.

Can we set a hard ceiling on per-customer cost?

Not directly per customer in our system — that is your application-layer responsibility. The proxy gives you a hard ceiling at the wallet level (prepaid balance, no overdraft) and exposes usage per call so you can build per-customer caps in your code.

What if our SaaS hits five-figure MRR — does this still scale?

Yes. We have operators running tens of thousands of calls per day. Rate limits on Pro plan: 500 req/min, 2M tokens/min. Beyond that, contact support for higher limits.

How does this compare to OpenRouter for SaaS use cases?

Different shape. We are Claude-focused with one wallet for Claude/GPT/Gemini, prepaid hard ceiling (protection against retry loops), no credit-expiration math. See OpenRouter comparison.

Does this affect rate limits Anthropic has on our account?

The proxy has its own rate limits separate from your direct Anthropic account. Your direct account is untouched.

Can we still use Claude Code in our agent tooling internally?

Yes. Claude Code reads the same env vars. Many SaaS teams use Claude Code internally for ops and code-gen while serving customers through their main app — one wallet works for both.

Make Flat Fee Viable Again

The entire reason your pricing instincts push you toward flat fee is the same reason your CFO instincts push you back to metered: customers love flat fee, but the unit math punishes it at retail AI rates.

The four levers above — caching, routing, soft caps, and per-call rate reduction — push the math back to where flat fee is defensible. The fourth lever is the most leveraged: one environment variable, 60-second reversibility, 70-80% per-call savings.

Run the margin floor calculation for your 95th-percentile user. If it does not clear your target margin at retail rates, run it again at Pro rates. The difference is usually the difference between "this pricing model is not viable" and "this is the obvious plan to ship."

Run the math on your stack