Chunking

We recommend some implementations of commonly used text chunking approaches.


Example Python Code Samples

1. Naive (Fixed-Size) Chunking

Splits text by a fixed number of characters (not words, as the Python example implies), ignoring sentence or semantic boundaries.

def fixed_size_chunk(text: str, chunk_size: int = 100) -> list[str]:
    """Splits text into fixed-size chunks based on character count."""
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i : i + chunk_size])
    return chunks

# Example:
# text = "This is a sample text to demonstrate fixed-size chunking."
# chunks = fixed_size_chunk(text, 20)
# print(chunks)
# Output: ['This is a sample tex', 't to demonstrate fix', 'ed-size chunking.']

However, sentences are often cut awkwardly, losing coherence.

2. Sentence-Based Chunking

Groups a fixed number of sentences together. Requires a sentence tokenizer library.

However, might not handle very long sentences or paragraphs well.

3. Other Chunking

  • Paragraph-Based: Split text by paragraphs (e.g., newlines). Large paragraphs can create big chunks.

  • Semantic: Use embeddings or topic modeling to chunk by semantic boundaries.

  • Agentic: Use an LLM to decide chunk boundaries based on context or meaning.

Last updated