Building an End-to-End RAG Pipeline with Node.js and Ollama on Ubuntu

Most RAG tutorials explain one piece in isolation: how to chunk text, how to search a vector database, how to call an embedding model. What they rarely show is how to connect all of it into something that actually reads your documents and answers questions from them.

This tutorial does exactly that. You will build a working RAG chatbot from scratch in Node.js on Ubuntu. It reads plain text files from disk, splits them into chunks, generates vector embeddings, stores them in Qdrant, retrieves relevant chunks at query time, and feeds them to a local LLM via Ollama to produce a grounded answer. By the end you have an interactive terminal chatbot that you can point at any folder of documents.

The document chunking techniques guide covers splitting strategies in detail. The document retrieval techniques guide covers advanced retrieval patterns such as multi-query and cross-encoder re-ranking. The focus here is on the full pipeline running end to end, so you understand how the stages connect before adding complexity.

How the Pipeline Works

A RAG pipeline runs in two phases.

The ingestion phase runs once when you first set up the system, and again whenever your documents change. It reads source files, splits each document into chunks, converts every chunk into a vector embedding using an embedding model, and stores the vectors alongside the original text in Qdrant. After ingestion, Qdrant is your searchable knowledge base.

The query phase runs on every user question. It converts the question into a vector using the same embedding model, searches Qdrant for the chunks whose vectors are closest to the question vector, assembles the retrieved text into a context block, and sends that block with the question to a local LLM. The LLM reads the context and generates an answer grounded in your documents rather than its training data.

This design scales to large document collections because vector search is fast. Passing thousands of document pages directly into an LLM prompt is not practical. Passing the four most relevant paragraphs is.

Prerequisites

Ubuntu 20.04, 22.04, or 24.04
Node.js 18 or newer (node --version to check)
Docker installed and running
Ollama installed

If you need Node.js:

curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs

Start Qdrant with Docker:

docker run -d --name qdrant -p 6333:6333 qdrant/qdrant

Pull the two Ollama models used in this tutorial:

ollama pull nomic-embed-text
ollama pull llama3.2

nomic-embed-text produces the vector embeddings. llama3.2 is the LLM that generates answers. Both run comfortably on a machine with 8 GB RAM.

Project Setup

Create the project directory and install dependencies:

mkdir ~/rag-pipeline && cd ~/rag-pipeline
npm init -y
npm pkg set type=module
npm install @langchain/qdrant @langchain/ollama @langchain/core @langchain/textsplitters

Create a docs/ directory for your source documents:

mkdir docs

Sample Documents

You need real documents to test against. The following three files cover different infrastructure topics, which makes it easy to verify the chatbot retrieves the correct source for each question.

cat > ~/rag-pipeline/docs/nginx.txt << 'EOF'
Nginx is a high-performance web server and reverse proxy. It uses an event-driven, non-blocking architecture that allows a single worker process to handle thousands of concurrent connections without spawning one thread per connection. The worker count is set in nginx.conf using the worker_processes directive, typically auto to match the number of CPU cores.

To configure Nginx as a reverse proxy, define an upstream block pointing to your backend servers and reference it from a location block using proxy_pass. Add proxy_set_header Host and proxy_set_header X-Real-IP directives to forward the original hostname and client IP to the backend application.

Rate limiting in Nginx is configured with limit_req_zone in the http block. Define a shared memory zone with a key, memory size, and allowed request rate, then apply it inside a server or location block using limit_req. A burst value allows short spikes before requests are rejected.
EOF

cat > ~/rag-pipeline/docs/redis.txt << 'EOF'
Redis is an in-memory key-value store used for caching, session storage, pub/sub messaging, and lightweight queuing. Data lives in RAM by default for maximum read and write speed. Persistence can be added through RDB point-in-time snapshots or AOF command logging.

Redis Sentinel provides automatic failover for high availability without Redis Cluster. You deploy at least three Sentinel processes that each monitor the primary and replica nodes. When the Sentinel quorum agrees the primary is unreachable, one Sentinel coordinates promoting a replica to primary and notifies connected clients of the new address.

Redis keyspace expiry deletes keys automatically after a TTL expires. Use the EXPIRE command to set a time-to-live on any key. Keys without a TTL persist indefinitely. Expiry is the standard way to implement cache invalidation in Redis without manual cleanup.
EOF

cat > ~/rag-pipeline/docs/postgresql.txt << 'EOF'
PostgreSQL is a relational database with strong ACID guarantees and support for advanced data types including JSONB, arrays, range types, and user-defined types. JSONB stores JSON in a binary format that supports GIN indexing on nested fields, making it practical for semi-structured data without a separate document store.

Connection pooling is essential for PostgreSQL under high concurrency. PostgreSQL spawns one OS process per connection, which is expensive. PgBouncer sits between the application and PostgreSQL, accepting many application connections and multiplexing them over a smaller pool of real server connections. Transaction-mode pooling is the most efficient option for stateless applications.

PostgreSQL has built-in full-text search using the tsvector and tsquery types. A GIN index on a tsvector column makes text search queries fast. This is a reasonable alternative to Elasticsearch for applications that already rely on PostgreSQL and do not need distributed search or analytics.
EOF

Step 1: Build the Ingestion Pipeline

Create ingest.js. This script reads all .txt and .md files from docs/, splits each document into overlapping chunks, generates embeddings, and stores everything in Qdrant:

import { readdir, readFile } from "fs/promises";
import { join } from "path";
import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings } from "@langchain/ollama";
import { Document } from "@langchain/core/documents";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const COLLECTION = "my_docs";
const DOCS_DIR = "./docs";

const embeddings = new OllamaEmbeddings({
  model: "nomic-embed-text",
  baseUrl: "http://localhost:11434",
});

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 50,
});

async function loadDocs(dir) {
  const files = await readdir(dir);
  const docs = [];
  for (const file of files) {
    if (!file.endsWith(".txt") && !file.endsWith(".md")) continue;
    const content = await readFile(join(dir, file), "utf8");
    docs.push(new Document({ pageContent: content, metadata: { source: file } }));
  }
  return docs;
}

const raw = await loadDocs(DOCS_DIR);
const chunks = await splitter.splitDocuments(raw);

console.log(`Loaded ${raw.length} files, produced ${chunks.length} chunks.`);

await QdrantVectorStore.fromDocuments(chunks, embeddings, {
  url: "http://localhost:6333",
  collectionName: COLLECTION,
});

console.log("Indexing complete. Collection:", COLLECTION);

Run the ingestion:

node ingest.js

Expected output:

Loaded 3 files, produced 10 chunks.
Indexing complete. Collection: my_docs

splitDocuments preserves the source metadata on every chunk it produces, so each Qdrant point knows which file it came from. fromDocuments creates the collection and indexes all chunks in a single call. The chunkOverlap: 50 ensures that key phrases near chunk boundaries appear in at least one chunk fully rather than being cut in half.

Step 2: Build the Query Pipeline

Create query.js. This module handles the retrieval and generation logic for a single question:

import { QdrantVectorStore } from "@langchain/qdrant";
import { OllamaEmbeddings, ChatOllama } from "@langchain/ollama";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";

const COLLECTION = "my_docs";
const TOP_K = 4;
const MIN_SCORE = 0.3;

const embeddings = new OllamaEmbeddings({
  model: "nomic-embed-text",
  baseUrl: "http://localhost:11434",
});

const vectorStore = await QdrantVectorStore.fromExistingCollection(embeddings, {
  url: "http://localhost:6333",
  collectionName: COLLECTION,
});

const llm = new ChatOllama({
  model: "llama3.2",
  baseUrl: "http://localhost:11434",
  temperature: 0.1,
});

export async function ask(question) {
  const results = await vectorStore.similaritySearchWithScore(question, TOP_K);
  const relevant = results.filter(([, score]) => score >= MIN_SCORE);

  if (relevant.length === 0) {
    return "No relevant information found in the knowledge base for that question.";
  }

  const context = relevant
    .map(([doc], i) => `[${i + 1}] Source: ${doc.metadata.source}\n${doc.pageContent}`)
    .join("\n\n---\n\n");

  const systemPrompt = `You are a helpful assistant. Answer the user's question based only on the context below.
If the answer is not in the context, say you do not have that information. Do not guess or use outside knowledge.

Context:
${context}`;

  const response = await llm.invoke([
    new SystemMessage(systemPrompt),
    new HumanMessage(question),
  ]);

  return response.content;
}

Several design choices here are worth understanding.

similaritySearchWithScore returns an array of [document, score] pairs, where the score is cosine similarity between 0 and 1. Filtering with MIN_SCORE = 0.3 discards chunks that have low relevance to the question. Without this filter, the LLM always receives TOP_K chunks even when none of them are meaningful matches, which leads to answers based on loosely-related content.

temperature: 0.1 keeps the model close to the facts in the retrieved context. A high temperature encourages creative variation in language, which in a factual QA system means the model is more likely to paraphrase beyond what the context says. For question answering, keep the temperature low.

The system prompt instruction to answer only from the context is the most important part of the whole pipeline. Without it, the LLM mixes retrieved facts with its training data in ways that are hard to detect. With it, answers that fall outside your documents produce a clear “I do not have that information” rather than a confident hallucination.

Step 3: Add an Interactive CLI

Create chat.js. This reads user input from the terminal in a loop and calls the query pipeline on each question:

import { createInterface } from "readline";
import { ask } from "./query.js";

const rl = createInterface({ input: process.stdin, output: process.stdout });

function prompt(text) {
  return new Promise((resolve) => rl.question(text, resolve));
}

console.log("RAG chatbot ready. Type 'exit' to quit.\n");

while (true) {
  const question = (await prompt("You: ")).trim();
  if (!question) continue;
  if (question.toLowerCase() === "exit") break;

  const answer = await ask(question);
  console.log(`\nAssistant: ${answer}\n`);
}

rl.close();

Start the chatbot:

node chat.js

Try these questions to verify each document is reachable:

You: How does Nginx handle many concurrent connections?
You: How do I set up Redis for high availability?
You: What is the downside of one process per connection in PostgreSQL?
You: How do I configure rate limiting in Nginx?
You: What is GIN indexing used for?

For each question the chatbot retrieves chunks from the relevant document and produces an answer that matches the text you wrote. Ask something outside the three documents and it should reply with a clear statement that the information is not available.

Re-indexing When Documents Change

The ingestion pipeline is stateless. Add files to docs/ and re-run node ingest.js to index them. If a document changes, the old chunks remain in Qdrant alongside the new ones, which creates duplicate and stale content. Clean re-indexing requires deleting the collection first:

curl -X DELETE http://localhost:6333/collections/my_docs
node ingest.js

For production systems that update frequently, build this delete-and-reindex step into your document update workflow rather than running it manually.

Common Mistakes

Not filtering by similarity score Dropping the MIN_SCORE filter means the system always passes TOP_K chunks to the LLM, even when none of them are relevant. The LLM generates an answer anyway, drawing from loosely-related content and presenting it confidently. The result looks correct but is unreliable. Always set a score threshold and return a clear “not found” message when nothing passes it.

Using a high temperature for factual QA temperature: 0.1 is the right setting for document-grounded answers. The higher the temperature, the more the model diverges from the context it was given. For a creative writing assistant built on RAG, higher temperature is appropriate. For an internal knowledge base chatbot, it is not.

Passing too many chunks to the LLM Retrieving TOP_K = 20 and injecting all 20 chunks into every prompt sounds thorough but causes two problems. Local models like llama3.2 have a context window limit, typically 8,192 tokens. Overflow silently truncates the context or crashes the call. More importantly, a large context forces the model to blend information from many sources, which increases the chance of conflation. Four to six targeted chunks consistently outperforms twenty loosely-targeted ones.

Re-indexing with a different chunk size without deleting the old collection If you run ingest.js after changing chunkSize, the new chunks are appended to the existing collection. The collection now contains chunks at two different sizes from the same source files. Retrieval becomes inconsistent. Delete the collection and start fresh whenever you change ingestion parameters.

Best Practices

Store the chunk index alongside the source filename. The current ingestion script saves source in metadata. Add a chunkIndex field too, so when the same document produces multiple retrieved chunks you can see which paragraphs matched. This is essential when debugging why a question returned the wrong answer.

Test retrieval and generation separately. Before worrying about LLM answer quality, confirm that the right chunks are actually being retrieved. Temporarily add a console.log in query.js that prints relevant.map(([doc]) => doc.metadata.source) before the LLM call. If the wrong document is retrieved, no prompt engineering fixes that.

Version your ingestion configuration. Record the chunkSize, chunkOverlap, and embedding model name somewhere, either in a config file or as part of the Qdrant collection name, for example my_docs_v2. When you tune these parameters, re-index under a new collection name instead of replacing the existing one. This lets you compare retrieval quality between versions before switching.

Calibrate MIN_SCORE for your embedding model. nomic-embed-text typically scores strong matches between 0.5 and 0.9 and irrelevant matches below 0.3. Other embedding models use different score distributions. After indexing your documents, run a set of representative questions and inspect the score output from similaritySearchWithScore to set a threshold that reflects the real score distribution for your data.

Conclusion

You have built a complete RAG pipeline: documents are loaded from disk, split into chunks with overlap, embedded and indexed into Qdrant, retrieved by vector similarity at query time, and assembled into a grounded context that the LLM uses to answer questions accurately.

The three-file structure, ingest.js, query.js, and chat.js, keeps ingestion and querying separate, which is the right separation for production systems where documents update independently of user sessions.

From here, the retrieval step can be upgraded with the advanced techniques covered in the document retrieval techniques guide: cross-encoder re-ranking for higher answer precision, parent document retrieval to give the LLM richer context, or hybrid BM25 search for queries that include exact technical terms. The ingestion side can be extended with PDF parsers, database connectors, or web scrapers. The core pattern stays the same: ingest, retrieve, augment, generate.