Effective Chunking Strategies for RAG Applications (Part 1)

When building a RAG (Retrieval-Augmented Generation) system, chunking text into manageable segments is a crucial step. Chunking not only ensures that content is well-organized but also improves the relevance and efficiency of search results. While many chunking techniques exist, this post will focus on basic strategies implemented using Langchain and Llama-Index. This is the first part of a series where we will explore these strategies.

Chunking Considerations

Before diving into the methods, it's essential to consider:

Chunk Size The size of each chunk should strike a balance between maintaining enough context for meaningful analysis and avoiding excessively large chunks that could affect focus. Smaller chunks (e.g., 256 to 512 tokens) are suited for detailed, granular tasks, whereas larger chunks may be better for understanding broader themes.

Chunk Overlap An overlap of 100-200 tokens is generally effective. This overlap helps maintain continuity and context between chunks, ensuring that segmentation does not disrupt the flow and coherence of the text.

Model Compatibility The chunk size should align with the processing capabilities of the underlying language models. Some models handle larger chunks effectively, while others might be optimized for shorter chunks, suitable for sentence-level embeddings. Ensure that your chunk size is compatible with the model’s requirements to optimize performance.

Task Specificity: The nature of your task significantly impacts the optimal chunking strategy. For tasks involving precise information retrieval, smaller, more focused chunks can enhance retrieval accuracy. Conversely, tasks requiring complex reasoning or broader context might benefit from larger chunks that capture more comprehensive information.

System Constraints: If the chunked content needs to be processed by another system with token limitations or other constraints, you must adjust chunk sizes to fit within those boundaries. Ensure that the chunks do not exceed the maximum token limits of any integrated systems or APIs to avoid processing issues.

The Problem We Want to Solve

In a RAG application, retrieving relevant information from a vast amount of data efficiently is paramount. Incorrect chunking can result in either losing important context or including too much noise, leading to poor search results. The goal is to find a chunking strategy that balances precision with context retention, optimizing both the embedding process and retrieval quality.

Chunking Methods

Character Splitting

This method involves splitting text at fixed character intervals, possibly with some overlap, to ensure context is maintained across chunks. This is a simple yet effective approach for uniformly structured content.

Using Langchain

text = """
Character splitting is the most basic form of splitting up your text.
It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.
"""

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=10, separator=' ', strip_whitespace=False)
documents = text_splitter.create_documents([text])
print(documents)

Recursive Character Text Splitting

Recursive chunking breaks down text hierarchically, using different separators, to create contextually relevant chunks.

Using Langchain


text = """
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ['\n\n', '\n', ' ', '']. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
"""

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450,
                                               chunk_overlap=50)
documents = text_splitter.create_documents([text])
print(documents)

Using Llama-index

from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.core.node_parser import LangchainNodeParser
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader("data")
documents = reader.load_data()

parser = LangchainNodeParser(RecursiveCharacterTextSplitter())
nodes = parser.get_nodes_from_documents(documents)
print(nodes)

Sentence Splitting

A Sentence Splitter breaks text into individual sentences, facilitating more precise text analysis and processing.

Using Llama-index


# text = """
# This tool enhances tasks like information retrieval and text generation by treating each sentence as a distinct unit, ensuring context is maintained and understood correctly.
# """
# with open('data/text.txt', 'w') as f:
#     f.write(text)

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader("data")
documents = reader.load_data()

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
print(nodes)

Document Specific Splitting

A "structure-aware" chunker divides text based on its inherent structure, such as headings, lists, or sections, to preserve the content’s logical organization. This method ensures that chunks retain their meaningful context and coherence, which is particularly useful for structured documents like reports or manuals.

Using Langchain

with open("README.md") as f:
    markdown_text = f.read()

from langchain.text_splitter import MarkdownTextSplitter

splitter = MarkdownTextSplitter(chunk_size = 40, 
                                chunk_overlap=0)

documents = splitter.create_documents([markdown_text])

print(documents)

Python code Splitter :

python_text = """
import numpy as np
def mean_squared_error(y_true, y_pred):
    # Convert inputs to numpy arrays
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    # Compute the squared differences
    squared_differences = (y_true - y_pred) ** 2

    # Compute the mean of the squared differences
    mse = np.mean(squared_differences)

    return mse
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mse = mean_squared_error(y_true, y_pred)
print(f"Mean Squared Error: {mse}")
"""


from langchain.text_splitter import PythonCodeTextSplitter
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
documents = python_splitter.create_documents([python_text])

print(documents)

javascript code Splitter :

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
javascript_text = """
function dotProduct(vectorA, vectorB) {
    if (vectorA.length !== vectorB.length) {
        throw new Error('Vectors must be of the same length');
    }

    return vectorA.reduce((sum, currentValue, index) => {
        return sum + currentValue * vectorB[index];
    }, 0);
}

// Example usage:
const vectorA = [1, 2, 3];
const vectorB = [4, 5, 6];

const result = dotProduct(vectorA, vectorB);
console.log(`Dot Product: ${result}`);
"""
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=65, chunk_overlap=0
)
documents = js_splitter.create_documents([javascript_text])

print(documents)

Using Llama-index

MarkdownNodeParser :

from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="data" ,
                               required_exts=[".md"])

markdown_docs = reader.load_data()

parser = MarkdownNodeParser()

nodes = parser.get_nodes_from_documents(markdown_docs)


print(nodes)

JSONNodeParser :

from llama_index.core.node_parser import JSONNodeParser
from llama_index.core import SimpleDirectoryReader

parser = JSONNodeParser()

reader = SimpleDirectoryReader(input_dir="data" ,
                               required_exts=[".json"])

json_docs = reader.load_data()
nodes = parser.get_nodes_from_documents(json_docs)
print(nodes)

HTMLNodeParser :


from llama_index.core.node_parser import HTMLNodeParser

tags = ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "b", "i", "u", "section"]

parser = HTMLNodeParser(tags=tags)
nodes = parser.get_nodes_from_documents(html_docs)
print(nodes)

Semantic Chunking

Semantic chunking aims to group text into chunks based on semantic meaning rather than fixed size or structure. This method uses embeddings to assess the similarity between chunks, ensuring that semantically similar content remains together.

Using Llama-index

In LlamaIndex, the SemanticSplitterNodeParser class implements this by adaptively selecting breakpoints based on embedding similarity, with configurable parameters such as buffer_size (initial window size for chunks), breakpoint_percentile_threshold (split threshold), and embed_mode (embedding model used).


# pip install llama-index-embeddings-huggingface llama-index-embeddings-instructor

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")


from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
)

base_splitter = SentenceSplitter(chunk_size=512)

documents = SimpleDirectoryReader(input_files=["text.txt"]).load_data()
nodes = splitter.get_nodes_from_documents(documents)

print(nodes)

Using Langchain

Similarly, Langchain's SemanticChunker detects sentence boundaries by analyzing embedding differences; sentences are split when the difference exceeds a specified threshold, maintaining semantic coherence within chunks.


# pip install langchain_experimental fastembed langchain_community

from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

from langchain_experimental.text_splitter import SemanticChunker
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)
documents = text_splitter.create_documents([text])


semantic_chunker = SemanticChunker(embed_model, breakpoint_threshold_type="percentile")
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])

print(semantic_chunks)

Choosing the Best Chunking Method

Selecting the right chunking strategy depends on your application's requirements and constraints. For simple, structured content, character splitting or recursive chunking may suffice. For more complex documents, document-specific or semantic chunking might be necessary to preserve context and meaning. Consider model compatibility, task specificity, and system constraints to ensure the optimal chunking method for your needs.