Once your documents are chunked and stored in Qdrant, retrieval is the step that decides what the LLM actually sees. Most tutorials show a single similaritySearch(query, 5) call and move on. That works for a demo, but basic nearest-neighbor search fails in predictable ways in production: it misses relevant documents when the query is phrased differently from the stored text, returns exactly k results even when none of them are relevant, struggles when exact keywords matter as much as meaning, and delivers small precise chunks that strip away the surrounding context the LLM needs for a complete answer.
This tutorial covers four retrieval strategies that each solve a distinct version of these failures. You will run real code against a shared Qdrant collection and see how each technique works, when it outperforms basic search, and when it is not the right fit.
This article follows on from the document chunking techniques guide, which covers splitting documents before ingestion. For setting up Qdrant itself, see the Getting Started with Qdrant guide.
Why Retrieval Technique Matters
When you embed a query and search for the nearest vectors, you are betting that the embedding model captures the user’s full intent in a single fixed-size vector. That assumption breaks in predictable ways.
A query like “why won’t my service start” is semantically different from “service startup failure” even though they describe the same problem. The user’s phrasing produces a different point in embedding space than the documentation’s phrasing. A single vector search may miss the most relevant document entirely.
Keyword searches have the opposite problem. They find exact term matches but miss documents that explain the same concept using different words.
Add in the fact that most RAG chunk sizes are tuned for retrieval precision rather than for delivering enough context for a full answer, and you end up with three distinct failure modes that require three different fixes. Each technique in this tutorial targets one of them directly.
Prerequisites
- Ubuntu 20.04, 22.04, or 24.04
- Docker running for Qdrant
- Node.js 18 or newer (
node --versionto check) - Ollama installed with
nomic-embed-textandllama3.2pulled - Qdrant running on port 6333
Start Qdrant if it is not already running:
docker run -d --name qdrant -p 6333:6333 qdrant/qdrant
Pull the LLM used for multi-query retrieval:
ollama pull llama3.2
Project Setup
Create a project directory and install the dependencies used across all four techniques:
mkdir ~/retrieval-demo && cd ~/retrieval-demo
npm init -y
npm pkg set type=module
npm install @langchain/qdrant @langchain/ollama @langchain/core langchain @langchain/textsplitters @xenova/transformers
@langchain/qdrantconnects LangChain to Qdrant as a vector store@langchain/ollamaprovidesOllamaEmbeddingsandChatOllamalangchainincludes theMultiQueryRetrieverand other higher-level abstractions@langchain/textsplittersis used for chunking in the parent document technique@xenova/transformersruns cross-encoder models locally via ONNX without a Python runtime
Shared Sample Data
All four techniques query the same Qdrant collection. Create setup.js and run it once before working through any of the techniques:
import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings } from "@langchain/ollama";
import { Document } from "@langchain/core/documents";
const embeddings = new OllamaEmbeddings({
model: "nomic-embed-text",
baseUrl: "http://localhost:11434",
});
const docs = [
new Document({
pageContent:
"Nginx processes connections using an event loop rather than spawning a thread per request. This allows a single worker to serve thousands of concurrent connections without exhausting system memory.",
metadata: { topic: "nginx", type: "concept" },
}),
new Document({
pageContent:
"To configure Nginx as a reverse proxy, define an upstream block with your backend address, then reference it inside a location block using proxy_pass. Add proxy_set_header lines to forward the client IP and host header.",
metadata: { topic: "nginx", type: "howto" },
}),
new Document({
pageContent:
"If Nginx fails to start, run nginx -t to validate the configuration file before restarting. Common startup failures include a syntax error in nginx.conf, a port already in use by another process, or a missing SSL certificate referenced in the config.",
metadata: { topic: "nginx", type: "troubleshooting" },
}),
new Document({
pageContent:
"Redis Sentinel provides automatic failover by monitoring master and replica nodes. When the Sentinel quorum agrees the master is unreachable, Sentinel promotes a replica and notifies clients of the new master address.",
metadata: { topic: "redis", type: "concept" },
}),
new Document({
pageContent:
"To enable password authentication in Redis, add the requirepass directive to redis.conf and restart the service. Clients must issue the AUTH command immediately after connecting before sending any other commands.",
metadata: { topic: "redis", type: "howto" },
}),
new Document({
pageContent:
"If Redis cannot write data to disk, verify that the directory set in the dir configuration option exists and is writable by the Redis process user. Also check available disk space and confirm that AOF or RDB persistence is not disabled.",
metadata: { topic: "redis", type: "troubleshooting" },
}),
new Document({
pageContent:
"A Kubernetes Deployment manages a set of identical pods through a ReplicaSet. When a pod crashes, the ReplicaSet controller creates a replacement automatically without manual intervention.",
metadata: { topic: "kubernetes", type: "concept" },
}),
new Document({
pageContent:
"To expose a Kubernetes Deployment outside the cluster, create a Service with type LoadBalancer or NodePort. LoadBalancer provisions a cloud load balancer. NodePort opens a static port on every cluster node.",
metadata: { topic: "kubernetes", type: "howto" },
}),
new Document({
pageContent:
"If a Kubernetes pod stays in CrashLoopBackOff, inspect the container output with kubectl logs. Common causes are a missing environment variable, a failing liveness probe, or an application that exits immediately due to a misconfigured startup command.",
metadata: { topic: "kubernetes", type: "troubleshooting" },
}),
];
await QdrantVectorStore.fromDocuments(docs, embeddings, {
url: "http://localhost:6333",
collectionName: "infra_docs",
});
console.log("Setup complete.", docs.length, "documents indexed into infra_docs.");
node setup.js
Technique 1: Multi-Query Retrieval
When a user’s query is short or ambiguous, a single embedding vector may not capture the full intent. A question like “why won’t it start” might embed close to concept documents rather than troubleshooting ones, simply because the phrasing does not include keywords like “crash”, “failure”, or “startup error”.
Multi-query retrieval fixes this by asking an LLM to generate several rephrasings of the original query, running a separate vector search for each variant, and returning the deduplicated union of all results. A document that only matches one phrasing of the question still gets retrieved.
LangChain’s MultiQueryRetriever handles all of this. Create retrieve-multiquery.js:
import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings, ChatOllama } from "@langchain/ollama";
import { MultiQueryRetriever } from "langchain/retrievers/multi_query";
const embeddings = new OllamaEmbeddings({
model: "nomic-embed-text",
baseUrl: "http://localhost:11434",
});
const vectorStore = await QdrantVectorStore.fromExistingCollection(embeddings, {
url: "http://localhost:6333",
collectionName: "infra_docs",
});
const llm = new ChatOllama({
model: "llama3.2",
baseUrl: "http://localhost:11434",
});
const retriever = MultiQueryRetriever.fromLLM({
llm,
retriever: vectorStore.asRetriever(3),
verbose: true,
});
const query = "why won't my service start";
const docs = await retriever.invoke(query);
console.log(`\nRetrieved ${docs.length} unique documents:\n`);
docs.forEach((doc, i) => {
console.log(`[${i + 1}] ${doc.metadata.topic} / ${doc.metadata.type}`);
console.log(" " + doc.pageContent.slice(0, 100) + "...");
});
node retrieve-multiquery.js
With verbose: true you can see the generated variants printed to the console. The LLM typically produces phrasings like “service startup failure diagnosis”, “application crashes immediately on start”, and “how to debug a service that won’t launch”. Each variation covers a different angle, so troubleshooting documents that would score low against the vague original query still get retrieved.
When to use it: Short, conversational, or ambiguous queries from end users. Knowledge bases where users describe problems informally and documentation uses technical terminology.
Weakness: Adds one LLM call per retrieval request. For applications with tight latency budgets, this overhead is significant. It also requires a running LLM, not just an embedding model.
Technique 2: Hybrid Search (Dense + BM25)
Semantic search finds documents with related meaning, but it can miss results when exact terms matter. A query for “nginx worker_processes auto” is looking for documentation that uses that exact configuration directive. The dense embedding for the query may not rank the matching document at the top if semantically similar content outweighs exact-term content in the vector space.
Hybrid search runs two rankings in parallel: dense vector similarity for semantic relevance and BM25 keyword scoring for term frequency. It then merges both ranked lists using Reciprocal Rank Fusion (RRF), which adds the rank positions rather than raw scores. A document that ranks well in both lists ends up near the top of the final result even if it was not first in either.
The BM25 scoring in this example runs over the candidates already returned by Qdrant rather than over the full corpus. Qdrant handles the coarse semantic filter; BM25 re-weights the results by keyword relevance.
Create retrieve-hybrid.js:
import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings } from "@langchain/ollama";
const embeddings = new OllamaEmbeddings({
model: "nomic-embed-text",
baseUrl: "http://localhost:11434",
});
const vectorStore = await QdrantVectorStore.fromExistingCollection(embeddings, {
url: "http://localhost:6333",
collectionName: "infra_docs",
});
function bm25Score(query, docText, corpus, k1 = 1.5, b = 0.75) {
const terms = query.toLowerCase().split(/\W+/).filter(Boolean);
const words = docText.toLowerCase().split(/\W+/).filter(Boolean);
const avgLen = corpus.reduce((s, d) => s + d.split(/\W+/).length, 0) / corpus.length;
return terms.reduce((total, term) => {
const tf = words.filter((w) => w === term).length;
if (tf === 0) return total;
const df = corpus.filter((d) => d.toLowerCase().includes(term)).length;
const idf = Math.log((corpus.length - df + 0.5) / (df + 0.5) + 1);
const norm = (tf * (k1 + 1)) / (tf + k1 * (1 - b + (b * words.length) / avgLen));
return total + idf * norm;
}, 0);
}
function rrfScore(rank, k = 60) {
return 1 / (k + rank);
}
async function hybridSearch(query, topK = 5) {
// Step 1: Retrieve broad candidates from Qdrant by dense similarity
const denseResults = await vectorStore.similaritySearchWithScore(query, 15);
const corpus = denseResults.map(([doc]) => doc.pageContent);
// Step 2: Score the same candidates by BM25 and sort
const bm25Ranked = denseResults
.map(([doc]) => ({ doc, bm25: bm25Score(query, doc.pageContent, corpus) }))
.sort((a, b) => b.bm25 - a.bm25);
// Step 3: Build rank maps (1-based)
const denseRank = new Map(denseResults.map(([doc], i) => [doc.pageContent, i + 1]));
const bm25Rank = new Map(bm25Ranked.map(({ doc }, i) => [doc.pageContent, i + 1]));
// Step 4: Merge with RRF and return top results
return denseResults
.map(([doc]) => ({
doc,
score: rrfScore(denseRank.get(doc.pageContent)) + rrfScore(bm25Rank.get(doc.pageContent)),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
const query = "nginx proxy_pass configuration";
const results = await hybridSearch(query);
console.log(`\nHybrid search results for: "${query}"\n`);
results.forEach(({ doc, score }, i) => {
console.log(`[${i + 1}] RRF: ${score.toFixed(4)} | ${doc.metadata.topic} / ${doc.metadata.type}`);
console.log(" " + doc.pageContent.slice(0, 100) + "...");
});
node retrieve-hybrid.js
Try switching the query to something vague like “configuring a web server” and then to something specific like “proxy_pass upstream block”. The first favors semantic ranking; the second favors BM25 because it contains exact directive names.
When to use it: Queries that mix semantic intent with exact technical terms. Search over documentation that contains command names, configuration directives, error codes, or other identifiers where exact matches matter as much as meaning.
Weakness: BM25 here runs only over the 15 dense candidates, not the full collection. If the dense search misses a relevant document entirely, BM25 cannot rescue it. For very large collections, consider using Qdrant’s native sparse vector support for a true server-side hybrid search.
Technique 3: Cross-Encoder Re-ranking
Dense vector search encodes the query and each document separately, then compares the resulting vectors. The query and the document never interact during scoring. A cross-encoder takes a different approach: it reads the query and a candidate document together as a single input and outputs one relevance score for that pair. Because both sides are processed jointly, the model can match on phrasing, intent, and specificity in ways that separate encoders cannot.
The standard pattern is a two-stage pipeline. Stage one retrieves a broad candidate set from Qdrant quickly using dense search. Stage two passes each candidate through the cross-encoder and re-ranks by the resulting score. You get the speed of approximate nearest-neighbor retrieval and the precision of joint query-document scoring.
The first time this script runs, @xenova/transformers downloads the Xenova/ms-marco-MiniLM-L-6-v2 model from HuggingFace (about 80 MB). Subsequent runs use the local cache.
Create retrieve-rerank.js:
import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings } from "@langchain/ollama";
import { pipeline } from "@xenova/transformers";
const embeddings = new OllamaEmbeddings({
model: "nomic-embed-text",
baseUrl: "http://localhost:11434",
});
const vectorStore = await QdrantVectorStore.fromExistingCollection(embeddings, {
url: "http://localhost:6333",
collectionName: "infra_docs",
});
console.log("Loading cross-encoder (first run downloads ~80 MB)...");
const reranker = await pipeline(
"text-classification",
"Xenova/ms-marco-MiniLM-L-6-v2"
);
async function retrieveWithReranking(query, candidateK = 9, returnK = 4) {
// Stage 1: Broad candidate retrieval from Qdrant
const candidates = await vectorStore.similaritySearch(query, candidateK);
// Stage 2: Score each (query, candidate) pair with the cross-encoder
const pairs = candidates.map((doc) => ({
text: query,
text_pair: doc.pageContent,
}));
const scores = await reranker(pairs);
// Sort by cross-encoder score, highest first
return candidates
.map((doc, i) => ({ doc, score: scores[i].score }))
.sort((a, b) => b.score - a.score)
.slice(0, returnK);
}
const query = "how do I fix a container that crashes immediately after starting";
const results = await retrieveWithReranking(query);
console.log(`\nRe-ranked results for: "${query}"\n`);
results.forEach(({ doc, score }, i) => {
console.log(`[${i + 1}] Score: ${score.toFixed(4)} | ${doc.metadata.topic} / ${doc.metadata.type}`);
console.log(" " + doc.pageContent.slice(0, 110) + "...");
});
node retrieve-rerank.js
Compare this output to running a plain similaritySearch with the same query. The cross-encoder should promote the troubleshooting type documents over concept type ones because the query is asking about a failure, not a definition. Dense search does not understand that distinction from the vector alone.
When to use it: Production RAG where incorrect retrievals are costly, such as customer-facing chatbots or document QA systems. The two-stage approach gives the best retrieval quality available without a custom-trained model.
Weakness: Adds latency proportional to the number of candidates re-ranked. Re-ranking 9 candidates adds roughly 200-600 ms depending on hardware. Not suitable for sub-100 ms response time requirements.
Technique 4: Parent Document Retrieval
Small chunks produce precise embeddings that match specific queries well. Large chunks give the LLM enough surrounding context to produce a complete, accurate answer. These two goals pull in opposite directions when you maintain a single chunk size.
Parent document retrieval resolves this by maintaining two levels. Small child chunks (150-200 characters) are stored in Qdrant and are used during retrieval because their tight scope produces sharp embeddings. Each child carries a parentId field pointing to its full-size parent. Large parent chunks (600-800 characters) are stored in a separate document store. When retrieval returns a child, the system fetches the parent and passes that to the LLM instead. The child drives the matching. The parent drives the answer.
In this example the parent store is a JSON file, which keeps the code self-contained. In production you would use a database table or an object store keyed by document ID.
First, run setup-parent.js to build the two-level index. This creates a separate Qdrant collection called child_chunks:
import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings } from "@langchain/ollama";
import { Document } from "@langchain/core/documents";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { writeFile } from "fs/promises";
const embeddings = new OllamaEmbeddings({
model: "nomic-embed-text",
baseUrl: "http://localhost:11434",
});
const parentDocs = [
"Nginx uses an event-driven, non-blocking architecture. The main process reads configuration and manages worker processes. Worker processes handle all client connections. Each worker serves thousands of concurrent connections using the OS event notification mechanism, without requiring one thread per connection. Setting worker_processes to auto creates one worker per available CPU core, which is the recommended default for most deployments.",
"Configuring Nginx as a reverse proxy requires an upstream block defining the backend servers and a location block with proxy_pass referencing that upstream. The upstream block supports multiple servers for load balancing. Add proxy_set_header Host and proxy_set_header X-Real-IP directives to forward the original host and client IP address, which most backend applications need to log requests and enforce access control correctly.",
"Redis Sentinel provides automatic failover without manual operator involvement. You deploy at least three Sentinel processes that each monitor the master and replicas. When the Sentinel quorum agrees the master is unreachable, one Sentinel coordinates the promotion of a replica to master, rewrites the configuration on all Sentinel nodes, and notifies connected clients through a Pub/Sub channel so client libraries reconnect transparently.",
"Kubernetes pods are the smallest deployable units and are ephemeral by design. You do not create pods directly in production. A Deployment defines the desired state, a ReplicaSet tracks the actual state, and the ReplicaSet controller creates or deletes pods to close the gap. When a pod crashes, the ReplicaSet controller immediately creates a replacement. Rolling updates replace pods gradually to avoid downtime.",
];
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 200,
chunkOverlap: 20,
});
const parentStore = {};
const childDocs = [];
for (let parentId = 0; parentId < parentDocs.length; parentId++) {
parentStore[String(parentId)] = parentDocs[parentId];
const children = await splitter.splitText(parentDocs[parentId]);
for (const child of children) {
childDocs.push(
new Document({
pageContent: child,
metadata: { parentId: String(parentId) },
})
);
}
}
await writeFile("parent_store.json", JSON.stringify(parentStore, null, 2));
await QdrantVectorStore.fromDocuments(childDocs, embeddings, {
url: "http://localhost:6333",
collectionName: "child_chunks",
});
console.log(`Indexed ${childDocs.length} child chunks from ${parentDocs.length} parent documents.`);
console.log("Parent store written to parent_store.json");
node setup-parent.js
Now create retrieve-parent.js:
import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings } from "@langchain/ollama";
import { readFile } from "fs/promises";
const embeddings = new OllamaEmbeddings({
model: "nomic-embed-text",
baseUrl: "http://localhost:11434",
});
const vectorStore = await QdrantVectorStore.fromExistingCollection(embeddings, {
url: "http://localhost:6333",
collectionName: "child_chunks",
});
const parentStore = JSON.parse(await readFile("parent_store.json", "utf8"));
async function retrieveParents(query, returnK = 2) {
// Retrieve matching child chunks (fetch more than needed to ensure enough unique parents)
const children = await vectorStore.similaritySearch(query, returnK * 3);
const seenParentIds = new Set();
const results = [];
for (const child of children) {
const parentId = child.metadata.parentId;
if (!seenParentIds.has(parentId)) {
seenParentIds.add(parentId);
results.push({
parentContent: parentStore[parentId],
matchedChild: child.pageContent,
parentId,
});
}
if (results.length >= returnK) break;
}
return results;
}
const query = "how does Nginx handle concurrent connections";
const results = await retrieveParents(query, 2);
console.log(`\nParent document results for: "${query}"\n`);
results.forEach(({ parentContent, matchedChild, parentId }, i) => {
console.log(`[${i + 1}] Matched via child chunk (parent ${parentId}):`);
console.log(` Child: "${matchedChild.slice(0, 70)}..."`);
console.log(` Parent (${parentContent.length} chars):`);
console.log(" " + parentContent);
console.log();
});
node retrieve-parent.js
Notice the difference in length between the matched child chunk and the returned parent. The child is short enough to produce a focused embedding that matches the query precisely. The parent contains the full paragraph with enough context for the LLM to answer completely without having to make assumptions.
When to use it: Long technical documents, runbooks, and API references where individual sentences or short passages are good search targets but poor answer sources. Any situation where small chunks improve retrieval precision but the LLM produces incomplete answers.
Weakness: Requires maintaining two synchronized data stores. If a parent document is updated, you must delete and re-index both the parent store entry and all of its child chunks in Qdrant. Partial updates create inconsistency silently.
Choosing the Right Technique
Conversational or underspecified queries from end users: Use multi-query retrieval. When you cannot predict how users will phrase their questions and the knowledge base uses technical language, generating query variants improves recall without any changes to the index.
Queries containing specific identifiers, commands, directive names, or error messages: Use hybrid search. When exact term matching matters as much as semantic meaning, BM25 combined with dense retrieval prevents the exact-match failures that pure vector search produces.
High-precision production search where incorrect retrievals are costly: Use cross-encoder re-ranking. The two-stage pipeline, broad dense retrieval followed by joint query-document scoring, gives the best quality available from off-the-shelf retrieval without training a custom model.
Long documents where small chunks lose the context needed for a complete answer: Use parent document retrieval. When you observe that your LLM gives incomplete or slightly-off answers because the retrieved chunk is too short, the two-tier index solves the problem without increasing chunk size for everyone.
These techniques are also composable. A production pipeline might use multi-query retrieval to generate candidates, apply hybrid BM25 weighting across the union, and then re-rank the top 15 with a cross-encoder. Each additional layer adds precision at the cost of latency and complexity.
Common Mistakes
Not deduplicating in multi-query retrieval
MultiQueryRetriever handles deduplication internally. If you implement multi-query manually by running three separate searches and concatenating the results, you must deduplicate before passing to the LLM. Sending the same chunk three times wastes context window tokens without adding any information.
Setting the BM25 candidate pool too small
In the hybrid search example, BM25 runs over the candidates Qdrant already returned. If your dense retrieval passes a k of 5 and all 5 results are semantically similar but miss the keyword-heavy document you need, BM25 has nothing to work with. Increase the dense candidate count to 15 or 20 before applying hybrid re-weighting.
Re-ranking the entire collection instead of a bounded candidate set Running a cross-encoder over 100 candidates is 6-7 times slower than running it over 15. Always retrieve a fixed candidate pool from Qdrant and re-rank only that pool. The candidate count is a latency knob you can tune independently from the final return count.
Updating parent documents without re-indexing children
Parent document retrieval breaks silently when parents and children go out of sync. If you edit a parent document, the old child chunks in Qdrant still point to the old parentId. The retrieval will work, but the content returned to the LLM will be stale. Treat parent and child re-indexing as a single atomic operation tied to your document update workflow.
Best Practices
Measure before adding complexity. Basic similarity search is fast, cheap, and often good enough. Run a representative set of test queries against your collection, check whether the right documents are being retrieved, and only add a more complex strategy when you can measure a real recall or precision problem.
Log the retrieval method alongside results. Add a field like retrieval_method: "rerank" to the documents you pass to the LLM. Over time this lets you correlate which technique produced correct answers versus which ones produced hallucinations, and tune accordingly.
Tune the candidate pool size, not just the final return count. For re-ranking and hybrid search, the ceiling on quality is the quality of the initial candidate set. If you retrieve 10 candidates and re-rank, you can only return the best of those 10. Increasing the candidate count from 10 to 20 is often more impactful than switching to a better re-ranker.
Cache cross-encoder scores for frequent queries. Cross-encoder inference is the most expensive part of the re-ranking pipeline. A simple in-memory LRU cache keyed on the (query, document hash) pair eliminates redundant inference for repeated queries in chatbot or search applications where the same questions come up often.
Conclusion
Basic vector similarity search is a starting point, not a complete retrieval strategy. Multi-query retrieval improves recall for ambiguous queries by widening the search. Hybrid BM25 search recovers exact-keyword matches that dense encoding misses. Cross-encoder re-ranking raises precision by scoring the query and each document together rather than separately. Parent document retrieval delivers the context the LLM needs without sacrificing the precision that small chunks provide.
Most production RAG systems pick one or two of these techniques based on where the retrieval actually fails. Measure your current pipeline on representative queries first, identify the failure pattern, then apply the technique that targets it.
For the document preparation step that produces chunks before indexing, see the document chunking techniques guide. For Qdrant’s collection management and filter syntax, see the Getting Started with Qdrant guide.