Performance & Cost Optimization
Calling an LLM directly is simple — until you need it to be fast, reliable, and affordable at scale. SentientOne handles the hardest parts automatically, so you get production-grade performance without building any of it yourself.
Direct LLM call vs SentientOne
| Capability | Direct LLM call | SentientOne |
|---|---|---|
| Prompt caching | Build per provider | Automatic |
| Retry with backoff | Build & maintain | Built-in |
| Error classification | Parse each provider | Standardised codes |
| Context window management | Manual truncation | Automatic |
| Token cost tracking | DIY logging | Per-request, per-agent |
| Streaming resilience | Handle stalls yourself | Monitored & reported |
| Multi-provider support | Separate SDK per provider | One API, any model |
Prompt caching
Every agent has a system prompt, tool definitions, and often a base set of instructions that are identical across every request. When you call an LLM directly, these tokens are re-processed and re-billed on every single call.
- Cache hits use provider-native cachingAnthropic
cache_controland OpenAIprompt_cacheare wired in automatically. You get the cache discount without writing a single config line. - Stable prefixes onlySystem prompts and tool schemas — the parts that don't change between requests — are what gets cached. User messages always count as new tokens.
- Up to 90% reductionOn cached prefixes you pay only the provider's cache-hit rate, typically ~10% of the standard input rate.
- Works out of the boxZero configuration. Every agent benefits as soon as it sees a second request with the same system prompt.
Without SentientOne
Automatic retries & failover
LLM providers have transient failures — rate limits, overload errors, network timeouts. When you call them directly, you build retry logic, backoff strategies, and error classification yourself. SentientOne handles all of this transparently.
- Provider-level retriesIf Anthropic returns a 529 (overloaded) or OpenAI returns a 503, SentientOne automatically retries with exponential backoff — up to 3 attempts. Your application never sees the transient failure.
- Intelligent error classificationNot all errors should be retried. Auth failures return immediately; rate limits wait and retry; server errors use backoff. You get the right behaviour without writing error-handling code.
- Timeout protectionLong-running LLM calls are bounded with configurable timeouts. If a provider hangs, the request is cleanly terminated and reported — your application doesn't block indefinitely.
- Streaming resilienceStreaming responses are monitored for stalls. If a stream stops producing chunks, it's detected and surfaced as an error event rather than leaving your client waiting forever.
Token optimization
Token usage directly impacts your LLM costs. SentientOne applies several techniques to keep consumption as low as possible without sacrificing response quality.
- Smart conversation truncationLong histories are automatically truncated to fit within the model's context window while preserving the most recent and relevant messages. You don't manage context windows yourself.
- Efficient tool definitionsMCP tool schemas are optimised before being sent to the LLM. Redundant descriptions and unnecessary metadata are stripped to reduce prompt token usage on every request.
- Response cost trackingEvery request logs prompt tokens, completion tokens, and USD cost. You can identify expensive agents or conversations and optimise system prompts to reduce spend — data most direct-call setups never capture. See Observability.
- Model-aware encodingToken counting and context management use the correct tokeniser for each model (cl100k for GPT-4, Claude's tokeniser for Anthropic). Avoids silent truncation or unexpected overflows from generic counters.
Why this matters
Building all of this yourself is possible — but it takes significant engineering effort, ongoing maintenance, and deep familiarity with each LLM provider's quirks.
Engineering saved
Weeks → days
Cache discount
Up to 90%
Providers supported
4+
Bottom line