Welcome to Chunkipy’s documentation!

GitHub Repository: https://github.com/gioelecrispo/chunkipy

Python 3.10, 3.11, 3.12, 3.13 PyPI version Codecov

chunkipy is a modular and extensible text chunking library for Python — built to help you split large texts into smaller, meaningful segments for NLP, LLMs, and text processing pipelines.

It provides both ready-to-use chunkers and plug-and-play components, enabling developers to use it out of the box or fully customize its behavior according to their own needs.

Why Chunkipy?

Traditional text-splitting libraries often limit flexibility to simple fixed-size splitting or token-based segmentation, ignoring linguistic or semantic structure. chunkipy bridges that gap with a flexible, language-aware architecture that adapts to your use case.

  • Lightweight core — install only what you need

  • Configurable overlapping via overlap_ratio to preserve context across chunks

  • Plug-and-play defaults for immediate use without configuration

  • Optional language detection for multilingual text processing

  • Highly modular design — every component (chunker, splitter, estimator, detector) can be replaced or extended

The result is a library that’s both pragmatic for production and powerful for research, enabling data scientists and developers to easily find the best chunking strategy for a given use case.

Available Chunkers

chunkipy provides several built-in chunking strategies, each designed for different use cases. All of them implement a common interface, so you can easily switch between them or even define your own custom chunker.

They are flexible: you can combine them with custom text splitters, size estimators, or language detectors.

FixedSizeTextChunker

Available nowpip install chunkipy

Splits text into fixed-size chunks based on the number of words or characters. This is the simplest and most predictable method, suitable when you just need evenly sized chunks for processing.

Fixed-size chunking

RecursiveTextChunker

Available nowpip install chunkipy

Uses a hierarchy of rules to split text at logical boundaries (e.g., paragraphs, sentences) while respecting the desired chunk size and overlap.

Recursive chunking

DocumentBasedTextChunker

🗺️ Roadmap — not yet implemented

Splits text based on document structure, such as paragraphs or sections. For example, it can split at double newlines or specific headings, depending on the document’s type (markdown, HTML, etc.).

Document-based chunking

SemanticTextChunker

🗺️ Roadmap — not yet implemented

Splits text based on semantic similarity between consecutive sentences or paragraphs. This method ensures that each chunk preserves contextual coherence — ideal for embeddings, RAG pipelines, and LLM-driven applications.

Semantic chunking

LLMBasedChunker

🗺️ Roadmap — not yet implemented

Uses a large language model (LLM) to intelligently segment text based on its meaning and context. This approach is highly flexible and can adapt to various text types and structures, making it suitable for complex chunking tasks.

LLM-based chunking

CustomTextChunker

You can create your own chunker by extending the base BaseTextChunker class and implementing the chunk method.

Type

Status

Overlap

Language-Aware

FixedSizeTextChunker

✅ Available

RecursiveTextChunker

✅ Available

✅ (depends on the splitter)

DocumentBasedTextChunker

🗺️ Roadmap

SemanticTextChunker

🗺️ Roadmap

LLMBasedTextChunker

🗺️ Roadmap

CustomTextChunker

✅ Available

✅/❌

✅/❌

Configurable Overlapping

Preserve context across chunks with configurable overlap ratios via overlap_ratio (0.0–1.0). Overlapping is supported for all chunkers except the LLM-based one: FixedSizeTextChunker and RecursiveTextChunker support it today.

Pre-built and Customizable Splitters

Ready-to-use text splitters (e.g. SpacySentenceTextSplitter) are available, but you can easily implement your own custom splitter by extending BaseTextSplitter.

Language Detection (Optional)

chunkipy includes plug-and-play language detectors for cases where splitting logic depends on language (e.g. spaCy or Stanza models). It is completely optional — you can use built-in detectors or implement your own by extending BaseLanguageDetector.

Highly Modular Architecture

Every component of the pipeline is replaceable and independently configurable:

  • Chunkers define how chunks are formed.

  • Splitters define how text is divided into linguistic units.

  • Size estimators define how chunk size is measured (characters, words, tokens).

  • Detectors identify the language for language-dependent splitters.

This modularity allows users to combine, extend, or replace any part of the system without touching the rest of the library.

Quick Summary

  • 🧩 2 built-in chunkers available now — 3 more on the roadmap.

  • 🔁 Overlapping support — configurable context preservation via overlap_ratio.

  • 🧠 Smart splitters — pre-built (spaCy, Stanza) and fully customizable.

  • 🌍 Optional language detection — lightweight and extendable.

  • ⚙️ Highly modular architecture — all components are replaceable.

  • 🚀 Ready-to-use defaults — works out of the box, no configuration required.