Scaling the Intelligence Engine: How SINSA Optimized Lunar's Production-Grade RAG Product

Lunar is a fast-growing, developer-first platform designed for building, deploying, and optimizing production-grade Retrieval-Augmented Generation (RAG) applications.

The Strategic Challenge: The 'Good Problem' of Enterprise Scale

For the Founder: Lunar was experiencing the best kind of problem: their product was so effective that it was attracting customers far larger than their initial architecture was designed for. This success created a critical inflection point. The very features that won them early adopters were becoming liabilities at enterprise scale. They were facing a classic startup challenge: the need to re-architect their core product while still flying the plane. The risk was not just technical; it was existential. Failure to scale would mean losing their hard-won enterprise clients and jeopardizing their next funding round. For the Technical Audience: The primary strategic challenge was that their monolithic RAG architecture did not scale efficiently with the size of the client's knowledge base. This manifested as: SLA Breach Risk: P95 and P99 query latencies were approaching unacceptable levels for enterprise contracts. COGS (Cost of Goods Sold) Escalation: The direct LLM API costs per query were scaling linearly with context size, creating an unsustainable margin erosion problem. Trust Degradation: A decline in retrieval accuracy on large, complex document sets was leading to an increase in support tickets and a drop in user trust.

How We Did It: The Architectural Blueprint for a Scalable RAG System

For the Technical Audience: Our solution was to evolve Lunar's architecture from a single-stage process to a cascaded, multi-stage pipeline designed to progressively refine context and optimize resource allocation at each step. A. The Retrieval Funnel: From Broad Search to Precise Context Stage 1: Candidate Retrieval with Hybrid Search: We replaced the single vector search with a parallelized hybrid search system. An incoming query simultaneously triggers two retrieval methods: Keyword Search (BM25): Essential for capturing specific, literal matches like model numbers, error codes, or exact phrases where semantic meaning is less important. Optimized Vector Search: We implemented filtering and pre-selection strategies on the vector database to narrow the search space before executing the k-NN query. This parallel approach allowed us to retrieve a broad but manageable set of ~100 candidate documents quickly and cost-effectively. Stage 2: Precision with Cross-Encoder Re-Ranking: This was the most critical architectural change. The ~100 candidates were passed to a dedicated cross-encoder re-ranking model. Unlike bi-encoders (used for the initial vector search), cross-encoders process the query and each candidate document together, allowing for a much deeper and more accurate assessment of relevance. We prototyped several models and selected one that offered the best trade-off between accuracy and latency. This stage acts as an aggressive filter, funneling the 100 candidates down to the top 3-5 most relevant chunks. Stage 3: Grounded Generation with Dynamic Model Selection: Only the hyper-relevant, re-ranked context is passed to the final generative LLM. We also implemented a dynamic model selection layer. Simpler queries could be routed to a faster, cheaper model like Claude 3 Haiku, while more complex, high-value queries were routed to a premium model like Claude 3 Opus or GPT-4. This routing logic was based on query complexity analysis and client tier. B. Evaluation Framework: Ensuring Continuous Improvement We established a robust MLOps evaluation framework using a "gold standard" dataset of question-answer pairs curated from Lunar's own logs. This allowed for continuous, automated regression testing of the entire pipeline, measuring: Context Precision & Recall: To evaluate the effectiveness of the retrieval and re-ranking stages. Answer Faithfulness: To measure how well the generated answer adheres to the provided source context, minimizing hallucination. End-to-End Latency and Cost Metrics.

The Strategic Impact: Unlocking New Growth Levers

For the Founder: The technical re-architecture was not just an engineering exercise; it was a fundamental business model upgrade. The results directly translated into a stronger, more defensible market position for Lunar. The Ultimate Outcome: Product-Led Growth Unlocked The most significant business impact was that the flexible, multi-stage pipeline enabled a sophisticated, tiered product strategy. Lunar could now go to market with: A Free/Developer Tier using the fast, initial retrieval stage. A Pro Tier that activated the high-accuracy re-ranker. An Enterprise Tier that offered dynamic model selection and the most powerful generative LLMs. This allowed them to align cost and value perfectly with their customer segments, creating a powerful engine for upselling and product-led growth. Our partnership didn't just fix a bottleneck; it helped them build their next-generation product line.

The Results

40% Reduction

in Average Query Latency

30% Lower

LLM Operational Costs

25% Increase

in Retrieval Accuracy

New Product Tiers

Enabled by the flexible pipeline

Technology Used:

Claude 3 Opus GPT-4 Cohere ReRank Hybrid Search MLOps

Back to Case Studies