Chunking

We recommend some implementations of commonly used text chunking approaches.


Example Python Code Samples

1. Naive (Fixed-Size) Chunking

Splits text by a fixed number of characters (not words, as the Python example implies), ignoring sentence or semantic boundaries.

def fixed_size_chunk(text: str, chunk_size: int = 100) -> list[str]:
    """Splits text into fixed-size chunks based on character count."""
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i : i + chunk_size])
    return chunks

# Example:
# text = "This is a sample text to demonstrate fixed-size chunking."
# chunks = fixed_size_chunk(text, 20)
# print(chunks)
# Output: ['This is a sample tex', 't to demonstrate fix', 'ed-size chunking.']

However, sentences are often cut awkwardly, losing coherence.

2. Sentence-Based Chunking

Groups a fixed number of sentences together. Requires a sentence tokenizer library.

import nltk # Requires: pip install nltk

# Ensure NLTK data is downloaded (run once)
# try:
#     nltk.data.find('tokenizers/punkt')
# except nltk.downloader.DownloadError:
#     nltk.download('punkt')

def sentence_based_chunk(text: str, max_sentences: int = 2) -> list[str]:
    """Chunks text by grouping a maximum number of sentences."""
    try:
        sentences = nltk.sent_tokenize(text)
    except LookupError:
        print("NLTK 'punkt' tokenizer not found. Please run nltk.download('punkt')")
        return [] # Or handle error appropriately

    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunks.append(" ".join(sentences[i : i + max_sentences]))
    return chunks

# Example:
# text = "Mr. Smith went to Washington. He visited the White House. Then he went home."
# chunks = sentence_based_chunk(text, 2)
# print(chunks)
# Output: ['Mr. Smith went to Washington. He visited the White House.', 'Then he went home.']

However, might not handle very long sentences or paragraphs well.

3. Other Chunking

  • Paragraph-Based: Split text by paragraphs (e.g., newlines). Large paragraphs can create big chunks.

  • Semantic: Use embeddings or topic modeling to chunk by semantic boundaries.

  • Agentic: Use an LLM to decide chunk boundaries based on context or meaning.

Last updated