Semantic caching for LLM APIs — early access open

Stop paying for
the same LLM call twice.

Pearlite sits between your app and your LLM API. One line of
code. Up to 40% cost reduction. Works with every major provider.

Get Early Access → See the Product ↓
pearlite — live demo
Send a query
Similarity threshold
0.92
Cache stream
Send a query to see results →
<12ms  cached response
40%  avg cost reduction
OpenAI · Anthropic · Gemini
Groq · Mistral · Any OAI-compatible
Integrate

Drop in. No rewrites.

A direct replacement for your LLM SDK across any language — no new paradigms, no config files. Just wrap and go.

import { Pearlite } from 'pearlite' const client = new Pearlite({ apiKey: 'pk_live_••••••••', similarity: 0.92, // tune aggressiveness provider: 'openai' }) // Drop-in for OpenAI SDK const { data, cached, saved } = await client.chat({ model: 'gpt-4o', messages: [{ role: 'user', content: userMessage }] }) // cached → true | false // saved → '$0.008' // latency → '11ms'
Live Cache Stream
The Problem

Your AI bill is full of questions
you’ve already answered.

Every semantically identical query costs you full price. Pearlite ends that.

Without Pearlite
10:42:01"What are your pricing plans?"$0.008
10:42:03"how much does it cost?"$0.008
10:42:07"pricing?"$0.008
10:42:09"What are your pricing plans?"$0.008
10:42:12"do you have a free tier?"$0.008
10:42:14"how much does this cost?"$0.008
With Pearlite
10:42:01"What are your pricing plans?"$0.008
10:42:03"how much does it cost?"HIT$0.000
10:42:07"pricing?"HIT$0.000
10:42:09"What are your pricing plans?"HIT$0.000
10:42:12"do you have a free tier?"HIT$0.000
10:42:14"how much does this cost?"HIT$0.000
How It Works

The full infrastructure, explained.

Pearlite is a transparent proxy that intercepts LLM calls, computes semantic similarity, and returns cached responses in milliseconds — without ever touching your application logic.

Request Flow Architecture
Your App
API call
Pearlite
Proxy
embed query
Vector Store
(per workspace)
sim ≥ threshold
Cache Hit
<12ms
sim < threshold
LLM Provider
600–1200ms
STEP 01
Query Interception
Your request hits the Pearlite proxy endpoint instead of the provider directly. We parse the message payload and extract the semantic content to embed. Zero changes to your existing code structure.
STEP 02
Vector Embedding
The query is converted to a high-dimensional embedding using a fast embedding model (<3ms). This vector captures semantic meaning — not just keywords — enabling fuzzy matching across rephrasings, typos, and synonyms.
STEP 03
Similarity Search
We perform ANN (approximate nearest-neighbor) search across your workspace's vector store. If cosine similarity ≥ your configured threshold, the cached response is returned immediately without touching the LLM API.
STEP 04
Miss Handling + Store
On a cache miss, the request is forwarded transparently to your LLM provider. The response streams back to your app while simultaneously being stored in your isolated vector store for future hits. TTL and eviction policies are fully configurable.
Analytics

Everything visible. Always.

A real-time dashboard that shows where every dollar goes and how much Pearlite is saving you — per day, per model, per use case.

Total Calls Today
24,847
+12% vs yesterday
Cache Hit Rate
41.2%
10,234 hits today
Saved This Month
$960
vs $2,400 without Pearlite
Daily API spend — before vs after Pearlite
Last 14 days
Cache hit rate by category
Avg across beta users
Response latency — hit vs miss
Distribution in ms
The Math

See what you’d save.

Move the slider. Numbers update in real time.

Monthly LLM API calls: 100,000
Current Spend
$800
at $0.008 per call
With Pearlite
$480
40% avg cache hit rate
You Save
$320
every month

Based on observed 40% avg cache hit rate. Customer support bots typically see 55–70%.

Use Cases

Built for how AI is actually used.

Anywhere users ask questions, Pearlite finds patterns and saves you money.

Customer Support Bots
Users ask the same 50 questions in 500 different ways. Pearlite catches all of them — pricing, refunds, account issues, all cached.
avg 62% cache rate
RAG Applications
Document Q&A systems get hammered with repeated queries. Cache the answers, not just the chunks. Cut retrieval costs too.
avg 44% cache rate
Internal Copilots
Your team asks the same codebase questions every day. Stop paying for the same answer twice. Pearlite gets smarter with every query.
avg 38% cache rate
Public AI Features
Search bars, content generators, recommendation engines — semantic deduplication at scale. Ship cheaper, faster.
avg 41% cache rate
Vertical AI Apps
Legal, medical, finance — domain-specific apps where the same queries repeat constantly. High stakes, high volume, high savings.
avg 51% cache rate
Analytics Queries
Natural language to SQL loops hit identical queries constantly. Pearlite eliminates redundancy at the source before the LLM even sees it.
avg 48% cache rate
Under the Hood

Semantic, not string matching.

Exact-match caching is table stakes. Pearlite understands meaning — so "pricing?" and "how much does this cost?" are the same query.

Similarity Engine
Vector-based understanding
Every query converts to an embedding. Compared against your cache via cosine similarity. You control the threshold from conservative to aggressive.
"What's the price?"0.97HIT
"How much does it cost?"0.94HIT
"pricing?"0.91HIT
Data Privacy
Your data stays yours
Embeddings and responses stored in an isolated vector store per workspace. Never trains our models. Never shared. Fully deletable on request.
Isolated per workspace
Zero model training on your data
Full deletion on request
SOC 2 in progress
Latency
Your users will notice
Cache hits return under 12ms. Compared to 600–1200ms for a live LLM call. Speed and savings in the same package.
LLM call
840ms
Pearlite
11ms
Provider Support

Works with every major LLM provider.

If it takes a prompt, Pearlite can cache it.

OpenAI
Anthropic
Google Gemini
Groq
Mistral
Any OpenAI-compatible API
Documentation

Everything you need, documented.

From quickstart to advanced configuration, the docs cover everything. Most teams are in production within an hour.

Quickstart
Get from zero to your first cache hit in under 5 minutes. Works with any existing LLM integration.
→ START HERE
Configuration
Similarity thresholds, TTL settings, per-model policies, namespace isolation, and streaming support.
→ ADVANCED
API Reference
Full SDK reference for Node.js, Python, and the REST API. Response types, error codes, headers.
→ REFERENCE
Dashboard
Explore cache analytics, hit rate by model, cost savings over time, and workspace management.
→ DASHBOARD
Security & Privacy
Workspace isolation, data handling, retention policies, deletion APIs, and our SOC 2 roadmap.
→ SECURITY
Migration Guide
Already using Redis or exact-match caching? Migrate to semantic caching without losing your existing cache.
→ MIGRATE
Open Source

Built in public. Contributions welcome.

The Pearlite SDK is open source. The core proxy logic, embedding pipeline, and SDK clients are all on GitHub. Contributions welcome.

View on GitHub → npm install pearlite
Pricing

Simple, usage-based pricing.

Pay only for what you use. No seats, no contracts, no hidden fees. Scale up or down any time.

Hobby
$0
For side projects and exploration. No credit card required.
  • 10,000 API calls / month
  • 1 workspace
  • All LLM providers
  • Basic analytics dashboard
  • Community support
Get Started Free
Enterprise
Custom
For high-volume teams with compliance and SLA requirements.
  • Unlimited API calls
  • Unlimited workspaces
  • SOC 2 (in progress)
  • VPC deployment option
  • SLA + dedicated support
  • Custom data retention
Talk to Us

Developers are screaming about this.

Across Reddit, Hacker News, and Twitter — the same frustration keeps surfacing. We built Pearlite because we felt it too.

r/MachineLearning
↑ 1.4k
"Our LLM API bill went from $400 to $3,200 in one month after we launched our chatbot. Looking at the logs, at least 60% of questions are variations of the same 20 things. There has to be a better way."
u/devops_tired · 847 comments
Hacker News
↑ 312
"We're paying $0.008 per call and our support bot answers 'what are your pricing plans?' about 4,000 times a day. That's $32/day for the exact same answer. Caching by string match doesn't work because everyone phrases it differently."
thrwaway_swe · 6 hours ago
↗ 2.1k
"my llm api costs are insane. same question asked 50 different ways = 50 full price calls. 'what's the refund policy' 'can i get a refund' 'do you have refunds' — why is there no semantic caching layer for this lol"
@indie_hacker_io
r/SideProject
↑ 891
"Just got my first OpenAI bill for a full month — $1,800. I nearly choked. I built a customer support bot and everyone keeps asking the same FAQ questions. I need to cache LLM responses but exact match caching barely helps because natural language is messy."
u/bootstrapped_dev · 234 comments
Hacker News
↑ 178
"Every time I look at our LLM spend analytics, it's depressing. Tons of near-duplicate queries hitting the API fresh every single time. We tried building our own semantic cache but maintaining embeddings infrastructure is a whole project in itself."
silentcoder_sf · 3 hours ago
↗ 987
"spending $2k/month on gpt-4 for a FAQ bot. the answer to 'how do i cancel' is identical whether someone types 'how do i cancel', 'cancellation process', or 'i want to cancel my subscription'. paying for all three separately is painful"
@yc_founder_23
r/startups
↑ 654
"LLM API costs are the #1 thing killing our margins. We're a small team, our product is AI-first, and every user query costs money. I've tried prompt caching, exact-match Redis caching... but semantic deduplication? Nobody has a clean solution."
u/ramen_profitable · 119 comments
Hacker News
↑ 445
"Ask HN: How do you handle LLM cost optimization at scale? We're at $15k/month and climbing. Semantic caching seems promising but building it in-house feels like a distraction from our actual product."
building_in_public · 78 comments
r/MachineLearning
↑ 1.4k
"Our LLM API bill went from $400 to $3,200 in one month after we launched our chatbot. Looking at the logs, at least 60% of questions are variations of the same 20 things. There has to be a better way."
u/devops_tired · 847 comments
Hacker News
↑ 312
"We're paying $0.008 per call and our support bot answers 'what are your pricing plans?' about 4,000 times a day. That's $32/day for the exact same answer. Caching by string match doesn't work because everyone phrases it differently."
thrwaway_swe · 6 hours ago
↗ 2.1k
"my llm api costs are insane. same question asked 50 different ways = 50 full price calls. 'what's the refund policy' 'can i get a refund' 'do you have refunds' — why is there no semantic caching layer for this lol"
@indie_hacker_io
r/SideProject
↑ 891
"Just got my first OpenAI bill for a full month — $1,800. I nearly choked. I built a customer support bot and everyone keeps asking the same FAQ questions. I need to cache LLM responses but exact match caching barely helps because natural language is messy."
u/bootstrapped_dev · 234 comments
Hacker News
↑ 178
"Every time I look at our LLM spend analytics, it's depressing. Tons of near-duplicate queries hitting the API fresh every single time. We tried building our own semantic cache but maintaining embeddings infrastructure is a whole project in itself."
silentcoder_sf · 3 hours ago

LLM caching.
Available today.

We're personally onboarding the first 100 teams. If you're spending $500+/month on LLM APIs and want to cut that bill — get in line.

No credit card Free tier available Setup in 5 minutes
RK
PS
AM
NV
SK
63 teams already on the waitlist
You're on the list.
We'll reach out personally before onboarding. Expect to hear within 48 hours.