Document Chunking Techniques for RAG Pipelines with Node.js on Ubuntu

When people talk about why their RAG chatbot gives bad answers, they usually blame the LLM or the embedding model. The real culprit is almost always chunking. If your chunks are too large, the embedding averages over too many ideas and becomes a fuzzy match for anything. If they are too small, each chunk loses the surrounding context that makes it useful as an answer. If you split at the wrong boundaries, you cut sentences in half and produce chunks that are semantically broken.

Chunking is the step where you take a raw document and split it into the pieces that will be stored in your vector database. Every chunk becomes one searchable unit. The quality of that split determines what the retrieval step can actually find.

This tutorial covers five chunking techniques used in real RAG pipelines, each suited to different content types. You will see how each technique works, run the code on the same sample document, and compare the output so you can make an informed choice for your own use case.

Why Chunking Quality Matters

To understand why this matters, think about how retrieval works. When a user asks a question, you convert the question into an embedding vector and search for the most similar vectors in your database. Each stored vector represents one chunk. The chunk whose embedding is closest to the question vector gets retrieved.

If your chunk contains three different topics, say, authentication, rate limiting, and error codes all crammed together, the resulting embedding is a blend of all three. It will match queries about authentication reasonably well, rate limiting reasonably well, and error codes reasonably well, but none of them as precisely as a chunk that covers just one topic would. The more focused each chunk is, the sharper the retrieval.

At the same time, a chunk needs enough context to be useful when it is retrieved and injected into the LLM prompt. A single sentence like “The default value is 30 seconds.” is meaningless without the surrounding text that explains what the 30 seconds refers to.

The goal of good chunking is to produce units that are focused enough to match specific queries and complete enough to be useful as answers.

Prerequisites

Ubuntu 20.04, 22.04, or 24.04
Node.js 18 or newer, check with node --version
npm, bundled with Node.js
Basic familiarity with running Node.js scripts

If you need to install Node.js:

curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs

Project Setup

Create a project directory and install the libraries:

mkdir ~/chunking-demo && cd ~/chunking-demo
npm init -y
npm pkg set type=module
npm install @langchain/textsplitters compromise js-tiktoken

@langchain/textsplitters, the standalone text splitter module extracted from LangChain.js. Covers most techniques you need without pulling in the full LangChain dependency tree.
compromise, a lightweight NLP library for sentence detection, used for sentence-based chunking.
js-tiktoken, the JavaScript port of OpenAI’s tokenizer, needed for token-count-aware splitting.

The Sample Document

All five techniques run against the same input. Create this file so you can reproduce the results:

cat > ~/chunking-demo/sample.md << 'EOF'
# Qdrant REST API Overview

Qdrant exposes a REST API on port 6333 by default. All operations(creating collections, inserting points, and searching) are performed through standard HTTP requests. The API follows RESTful conventions and returns JSON responses.

## Collections

A collection is the top-level container in Qdrant. It holds all your vectors and their associated payloads. When you create a collection, you define the vector size and the distance metric. Cosine distance is recommended for text embeddings. Euclidean distance works better for image embeddings or other numerical features where magnitude matters.

You cannot change the vector size of a collection after it is created. If you need a different size, you must delete the collection and recreate it.

## Points and Payloads

A point is the basic unit of data in Qdrant. Each point has three components: a unique ID, a vector, and an optional payload. The ID must be either an unsigned integer or a UUID string. The vector is an array of floats matching the collection's configured size. The payload is a JSON object containing any metadata you want to store alongside the vector such as document title, source URL, creation date, or any other field you want to filter on later.

Payloads are not indexed by default. If you plan to filter search results by a payload field, you should create a payload index for that field. Without an index, Qdrant scans all points during filtering, which is slow at scale.

## Search and Filtering

The search endpoint accepts a query vector and returns the N closest points by distance. You can combine vector similarity search with payload filters in a single request. For example, you can find the 10 most similar vectors but only return those where the source field equals a specific value.

Score thresholds let you exclude results below a minimum similarity score. This is important in RAG pipelines because you do not want to inject irrelevant context into the LLM prompt just because it happens to be the closest match in an empty result set.
EOF

This document has Markdown headers, multiple paragraphs per section, and a mix of short and long sentences, a realistic mix that will show meaningful differences between techniques.

Technique 1: Character Splitting

The simplest approach. Split the text every N characters using a fixed separator. No awareness of sentences, paragraphs, or document structure.

Create chunk-character.js:

import { readFile } from "fs/promises";
import { CharacterTextSplitter } from "@langchain/textsplitters";

const text = await readFile("sample.md", "utf8");

const splitter = new CharacterTextSplitter({
  separator: "\n\n",
  chunkSize: 300,
  chunkOverlap: 0,
});

const chunks = await splitter.splitText(text);

chunks.forEach((chunk, i) => {
  console.log(`\n--- Chunk ${i + 1} (${chunk.length} chars) ---`);
  console.log(chunk);
});

node chunk-character.js

CharacterTextSplitter splits on the separator first (\n\n, meaning paragraph breaks). If any resulting piece is still larger than chunkSize, it splits further by character count. chunkOverlap: 0 means no text is shared between adjacent chunks.

When to use it: When your document is already organized into short, self-contained paragraphs and you want the simplest possible setup. Not recommended as a general-purpose choice.

Weakness: If a paragraph exceeds chunkSize, it gets cut at an arbitrary character position, not at a word or sentence boundary.

Technique 2: Recursive Character Splitting

This is the recommended default for most RAG pipelines. Instead of one fixed separator, it tries a priority list: ["\n\n", "\n", ". ", " ", ""]. It splits on the first separator that brings the chunk within the target size. If double newlines produce a chunk that is still too large, it falls back to single newlines, then to sentence boundaries, then to spaces, and only resorts to raw character splitting as a last resort.

Create chunk-recursive.js:

import { readFile } from "fs/promises";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const text = await readFile("sample.md", "utf8");

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 60,
});

const chunks = await splitter.splitText(text);

chunks.forEach((chunk, i) => {
  console.log(`\n--- Chunk ${i + 1} (${chunk.length} chars) ---`);
  console.log(chunk);
});

node chunk-recursive.js

The chunkOverlap: 60 means the last 60 characters of each chunk appear at the start of the next chunk. This prevents a key phrase from sitting right at the edge of a chunk boundary where it might be split between two separate retrieval units.

When to use it: General-purpose text, plain documentation, exported PDFs, database text fields. If you do not have a strong reason to choose another technique, start here.

Why it beats plain character splitting: Because it respects natural language boundaries as much as possible within the size constraint. You will almost never see a chunk end mid-word.

Technique 3: Markdown-Aware Splitting

For documents that have Markdown headers, MarkdownTextSplitter uses the header hierarchy as natural split points. It treats ## and ### as high-priority separators, so each section stays together rather than being cut mid-paragraph.

Create chunk-markdown.js:

import { readFile } from "fs/promises";
import { MarkdownTextSplitter } from "@langchain/textsplitters";

const text = await readFile("sample.md", "utf8");

const splitter = new MarkdownTextSplitter({
  chunkSize: 600,
  chunkOverlap: 40,
});

const chunks = await splitter.splitText(text);

chunks.forEach((chunk, i) => {
  console.log(`\n--- Chunk ${i + 1} (${chunk.length} chars) ---`);
  console.log(chunk);
});

node chunk-markdown.js

MarkdownTextSplitter is a subclass of RecursiveCharacterTextSplitter with a separator list that prioritises Markdown heading patterns. Sections under a ## heading tend to stay together as a unit rather than being fragmented into multiple chunks, which is exactly what you want when a user asks “how do payload indexes work?” the entire Payloads section should come back as one retrieval result.

When to use it: Technical documentation, wikis, READMEs, runbooks, any Markdown content where headers define meaningful sections.

Weakness: If a section is very long, it still splits mid-section. In that case, raise chunkSize or accept mid-section splits for very long sections.

Technique 4: Sentence-Based Chunking

Character count is a rough proxy for meaning. A better boundary for many document types is the sentence. It is the smallest unit in English that carries a complete thought. This technique detects sentence boundaries using compromise, then groups sentences together until a target character length is reached.

Create chunk-sentence.js:

import { readFile } from "fs/promises";
import nlp from "compromise";

const text = await readFile("sample.md", "utf8");

function splitIntoSentences(input) {
  return nlp(input).sentences().out("array").filter((s) => s.trim().length > 0);
}

function groupSentencesIntoChunks(sentences, targetSize = 400, overlap = 1) {
  const chunks = [];
  let i = 0;

  while (i < sentences.length) {
    let chunk = "";
    let j = i;

    while (j < sentences.length && chunk.length + sentences[j].length <= targetSize) {
      chunk += (chunk ? " " : "") + sentences[j];
      j++;
    }

    if (chunk.length === 0) {
      chunk = sentences[j];
      j++;
    }

    chunks.push(chunk.trim());
    i = Math.max(i + 1, j - overlap);
  }

  return chunks;
}

const sentences = splitIntoSentences(text);
const chunks = groupSentencesIntoChunks(sentences, 400, 1);

chunks.forEach((chunk, i) => {
  console.log(`\n--- Chunk ${i + 1} (${chunk.length} chars) ---`);
  console.log(chunk);
});

node chunk-sentence.js

The overlap = 1 parameter means each chunk starts one sentence back from where the previous chunk ended. This gives sentence-level overlap rather than character-level overlap. So it will always overlap by complete thoughts, not by partial strings.

When to use it: Narrative text, news articles, blog posts, support tickets, chat logs, anything where sentence boundaries carry strong semantic meaning and character counts vary significantly between paragraphs.

Weakness: compromise is a heuristic sentence detector. It handles English well but can misfire on abbreviations, code snippets embedded in text, or technical jargon written in unusual ways. It will also strip Markdown header lines since they are not valid English sentences.

Technique 5: Token-Based Splitting

LLMs think in tokens, not characters. A token is roughly 3–4 characters for English text, but the exact count varies. When you set a context window limit for your LLM, the limit is in tokens. If you size your chunks in characters, you are guessing at token counts. TokenTextSplitter uses the actual tokenizer to count tokens and split precisely.

Create chunk-token.js:

import { readFile } from "fs/promises";
import { TokenTextSplitter } from "@langchain/textsplitters";

const text = await readFile("sample.md", "utf8");

const splitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 100,
  chunkOverlap: 10,
});

const chunks = await splitter.splitText(text);

chunks.forEach((chunk, i) => {
  const approxChars = chunk.length;
  console.log(`\n--- Chunk ${i + 1} (~100 tokens, ${approxChars} chars) ---`);
  console.log(chunk);
});

node chunk-token.js

cl100k_base is the encoding used by OpenAI’s GPT-4 and GPT-3.5-turbo models, and it is a reasonable general-purpose tokenizer even when you are not using OpenAI. chunkSize: 100 means each chunk contains at most 100 tokens. With chunkOverlap: 10, the last 10 tokens of each chunk appear at the start of the next.

When to use it: When your retrieval system feeds chunks into a model with a strict token budget, or when you are building something where the downstream LLM charges per token and you need predictable chunk sizes. Also useful when you are mixing content types and character count gives inconsistent results (e.g., code-heavy files where tokens are denser than in prose).

Weakness: The cl100k_base tokenizer is optimized for OpenAI models. If you are using a different model (Mistral, LLaMA, Gemma), its tokenizer may count tokens differently, so your chunk token counts will be approximate for those models. Still better than character counting, but not exact.

Choosing the Right Technique

No single technique is universally best. Here is a practical decision guide:

Plain text, mixed content, or exported PDFs: Use RecursiveCharacterTextSplitter. It handles messy input well and respects language structure without requiring special handling.

Markdown documentation, wikis, or runbooks: Use MarkdownTextSplitter. Your sections stay intact and retrieval aligns with how humans think about the document structure.

News articles, blog posts, or narrative prose: Use sentence-based chunking with compromise. Clean sentence boundaries produce more focused embeddings than character splits on this kind of content.

Applications with strict token budgets or OpenAI-compatible models: Use TokenTextSplitter. You get predictable sizes that match what the LLM actually sees.

Quick prototype or baseline test: Use CharacterTextSplitter with paragraph separators. It is the simplest to configure and good enough to verify that the rest of your pipeline works before you invest in tuning the splitter.

Common Mistakes

Using chunk sizes that are too large A 2000-character chunk might cover three different topics. Its embedding will be a blend of all three and will retrieve as a mediocre match for any of them. Smaller chunks with tighter focus nearly always improve retrieval precision. Start at 400–600 characters and increase only if you are losing important context.

Using chunk sizes that are too small A 50-character chunk like “The default is 30 seconds.” is meaningless in isolation. The retrieved chunk gets injected into the LLM prompt and the model cannot answer from it. Chunks should be long enough to carry a complete thought.

Setting overlap to zero A sentence or key phrase that falls exactly at a chunk boundary will not be in either neighboring chunk’s core content. Some overlap (even just 10–15% of the chunk size) significantly reduces this risk.

Using different chunk strategies for indexing and querying Your ingestion script and your query script must treat text consistently. If your ingestion lowercases text before chunking, your query text should be lowercased too. If you strip Markdown syntax during ingestion, the chunk embeddings will reflect plain text, and queries using Markdown syntax may match poorly.

Not testing retrieval after changing chunking Chunking changes affect retrieval quality in ways that are not obvious from reading the code. After changing your chunk size or strategy, run a set of representative questions against the re-indexed collection and verify that the retrieved chunks actually contain the expected information.

Best Practices

Store chunk metadata alongside the text. When you insert chunks into Qdrant, include at minimum the source filename, chunk index within the document, and the chunking strategy used. This lets you trace which technique produced a given retrieval result and compare strategies objectively.

Experiment with chunkSize and chunkOverlap separately. Size and overlap interact. A large overlap on a small chunk creates significant redundancy. A zero overlap on a large chunk creates boundary gaps. Start with chunkOverlap at roughly 10–15% of chunkSize.

Consider your embedding model’s optimal input length. nomic-embed-text performs best on inputs up to about 512 tokens. Chunks much longer than that may not embed well. Check your embedding model’s documentation for its recommended input range.

Re-index when you change chunking strategy. Changing chunk size or technique means your stored vectors no longer reflect the same boundaries. Delete the Qdrant collection and re-run ingestion every time you change chunking parameters.

Conclusion

Chunking is the part of a RAG pipeline that most tutorials skip over, yet it has a bigger impact on retrieval quality than almost anything else. The technique you choose should match your content type: recursive splitting for general text, Markdown-aware splitting for structured docs, sentence-based for prose, and token-based when you need precise control over model input size.

The next step is wiring these chunks into a complete RAG pipeline, embedding them into Qdrant, querying by similarity, and feeding the results to a local LLM.