HomeOpen Source
Open Source

Why Your RAG Pipeline Lies: The Hidden Architecture Decisions That Determine Whether AI Actually Retrieves the Right Information

S
Staff Writer | Contributing Writer | Jul 5, 2026 | 9 min read ✓ Reviewed

You've built a Retrieval-Augmented Generation (RAG) system. You've loaded your documents into a vector database, wired it up to a language model, and asked it a question. Sometimes it gives a brilliant, accurate answer. Other times it confidently makes something up — a hallucination so polished it's hard to spot. The frustrating truth is that both outcomes flow directly from architectural decisions most people make without thinking twice.

RAG itself is a straightforward idea: instead of relying on a language model's frozen training knowledge, you retrieve relevant text from your own documents at query time and hand it to the model as context. The model answers from what it was just given, not from memory. Done well, this is powerful. Done carelessly, it introduces a chain of subtle failure points — and the most dangerous ones happen long before the language model sees a single word.

What a RAG Pipeline Actually Is

Before diagnosing what goes wrong, it helps to see the full pipeline clearly. A typical RAG system has two phases:

NVIDIA Jetson Nano Developer Kit
🛒 NVIDIA Jetson Nano Developer Kit →

As an Amazon Associate, I earn from qualifying purchases.

Indexing phase (done once, upfront): Your source documents are split into smaller pieces called chunks. Each chunk is converted into a numerical vector — called an embedding — by an embedding model. These vectors are stored in a vector database alongside the original text.

Query phase (done at runtime): When a user asks a question, that question is also converted into an embedding using the same model. The system searches the vector database for chunks whose embeddings are mathematically closest to the query's embedding — this is called similarity search. The top matching chunks are retrieved, handed to the language model as context, and the model generates an answer.

Simple on paper. But each step in this pipeline has its own way of going wrong.

The Chunking Problem: Garbage In, Garbage Out

Chunking is the least glamorous part of a RAG pipeline and arguably the most consequential. The way you split your documents determines the shape of every piece of information the system will ever retrieve. Get it wrong here and no amount of clever retrieval logic downstream can fix it.

What Is Fixed-Size Chunking — and Why It Fails

The simplest approach is to divide every document into chunks of a fixed size — say, every 512 tokens (a token is roughly a word or word-fragment). This is fast and easy to implement. It's also frequently harmful.

Naive fixed-size chunking (e.g. 512-token chunks) can split sentences mid-context, causing embedding models to produce less accurate similarity scores compared to semantic or recursive chunking strategies. Think about what that means concretely: a sentence that begins at the end of one chunk and finishes at the start of the next is now represented by two separate embeddings, neither of which captures the full thought. A question about that idea will fail to retrieve either chunk confidently, because neither one looks like a complete, coherent answer.

Better Alternatives: Semantic and Recursive Chunking

Recursive chunking works by trying to split at natural boundaries first — paragraphs, then sentences, then words — only resorting to hard character limits when necessary. This preserves far more semantic integrity in each chunk.

Semantic chunking goes further: it embeds sentences as it goes and watches for drops in similarity between consecutive sentences. When similarity drops sharply, it treats that as a natural topic boundary and starts a new chunk there. The result is chunks that each contain one coherent idea rather than an arbitrary slice of text.

A useful complementary technique is to include overlap between chunks — meaning the last few sentences of chunk N are repeated at the start of chunk N+1. This ensures that a sentence straddling a boundary exists fully in at least one chunk, reducing the chance of mid-context splits causing retrieval failures.

The Embedding Model Problem: Domain Mismatch

Once your chunks exist, embedding models convert them into vectors. The idea is that two pieces of text with similar meaning end up close together in the vector space, so similarity search can find relevant matches. But embedding models are not universal translators of meaning — they are trained on specific data, in specific domains, using specific vocabulary.

The BEIR benchmark (Benchmarking IR), published in 2021, demonstrated that embedding models fine-tuned on one domain often degrade substantially in retrieval accuracy when applied to out-of-domain corpora. In plain terms: an embedding model trained on web pages or Wikipedia might be excellent at finding relevant results in those contexts, but perform poorly when you point it at legal contracts, medical literature, or internal engineering documentation. The words and relationships that matter in those domains weren't well-represented in training.

This is why choosing an embedding model isn't just a performance checkbox. If your documents use specialized vocabulary — and most enterprise or technical documents do — you need to evaluate whether your embedding model actually understands how terms in that domain relate to each other. Options include fine-tuning a model on your domain, or selecting a model specifically trained for it.

The Query Problem: Users Don't Write Like Documents

Even with good chunking and a well-matched embedding model, there's a structural tension at the heart of similarity search: user queries and the documents you're searching look very different from each other. A query is typically a short question. A document chunk is a dense passage of exposition. They may share the same underlying topic without sharing much surface vocabulary, and their embeddings may not end up very close in vector space as a result.

HyDE: Searching with Hypothetical Answers

One elegant solution to this problem is called Hypothetical Document Embeddings, or HyDE. HyDE (Hypothetical Document Embeddings), introduced in a 2022 paper by Gao et al., improves retrieval by generating a hypothetical answer to a query first, then embedding that answer for similarity search instead of the raw query.

The intuition is straightforward: a hypothetical answer looks more like the documents in your index than the raw question does. When you embed that hypothetical answer and search for similar chunks, you're searching in the right neighborhood of the vector space — even if the hypothetical answer itself is imperfect or partially wrong. You don't use the hypothetical answer as the final response; you use it as a better search probe. The actual retrieved chunks still come from your trusted documents.

The Reranking Problem: Top-K Isn't Always Good-K

Vector similarity search returns a ranked list of chunks — the ones whose embeddings are closest to the query embedding. This is efficient and scalable, but it has a meaningful weakness: it evaluates each chunk independently against the query. A chunk is scored on its own without any awareness of what the query is actually asking, only on whether its embedding is numerically close.

This is where reranking enters the pipeline.

How Cross-Encoder Rerankers Work

A reranker is a second-pass model that takes the top results from your initial similarity search and re-scores them with a more sophisticated lens. Instead of comparing vectors independently, a cross-encoder reranker reads the query and a candidate chunk together as a pair, in full, and predicts a relevance score directly.

Cross-encoder rerankers, such as those based on BERT-style models, evaluate query-document pairs jointly rather than independently, and have been shown to significantly improve top-k retrieval precision over bi-encoder similarity alone. Because the model sees both the query and the document at the same time, it can reason about their relationship in context — picking up on nuances that raw vector distance misses entirely.

The practical architecture looks like this: vector search retrieves the top 50 or 100 candidate chunks quickly. The reranker then scores each of those 50–100 pairs against the query and returns only the top 3–5 genuinely relevant ones. The language model then answers from those. This two-stage design gives you the speed of vector search and the precision of a deeper model — without needing to run the expensive reranker across your entire corpus.

How the Pieces Fit Together

It's worth stepping back to see how these components interact, because a failure in any one of them cascades through the rest.

  • Bad chunking means your embeddings represent incoherent fragments. Even a great embedding model can't save a split sentence.
  • Domain-mismatched embeddings mean chunks that are semantically relevant don't appear close in vector space — so they never even make it into the candidate set.
  • Poor query formulation (the mismatch between question and document style) means similarity search probes the wrong region of the vector space — HyDE is one way to address this.
  • No reranking means you hand the language model your top results by vector distance, which may include loosely related chunks and exclude the truly relevant ones that ranked slightly lower.

When all these things go wrong together, the language model receives context that is technically from your documents but is not actually relevant to the question. It then does what language models do: it generates a coherent-sounding answer from whatever it was given. That's not the model hallucinating in the traditional sense — it's the retrieval pipeline hallucinating on the model's behalf.

Practical Takeaways for Building Better RAG

If you're building or debugging a RAG system with an LLM, here's where to focus your attention:

Audit your chunks first. Before worrying about any other component, look at your actual chunks. Are they coherent units of meaning? Do they contain complete sentences? If you're using fixed-size chunking, experiment with recursive or semantic chunking and spot-check the results manually.

Test your embedding model on your domain. Run a small evaluation: take questions you know the answers to, retrieve the top chunks, and check whether the right chunk is actually in the results. If it's consistently missing, your embedding model may not speak your domain's language.

Add a reranker before you do anything else to your prompt. A cross-encoder reranker is often the single highest-leverage addition to a RAG pipeline because it dramatically improves the quality of what the language model actually reads. The improvement in final answer quality tends to be immediate and measurable.

Consider HyDE for question-heavy use cases. If users are asking genuine questions (rather than issuing keyword-style searches), generating a hypothetical answer as a search probe can significantly improve retrieval relevance, especially when documents are written in a declarative style that doesn't naturally mirror question phrasing.

RAG is one of the most powerful patterns in applied NLP today, but its power depends entirely on whether the retrieval half actually works. The language model at the end of the pipeline can only be as good as what you put in front of it. Getting chunking, embeddings, and reranking right isn't optional infrastructure — it's the whole job.

Sources

Every factual claim in this article was independently verified against the following sources:

Open Source RAG pipeline chunking and reranking
S
Staff Writer

Contributing Writer at UMI Groups

Related Articles