chunkipy
Public package exports for chunkipy.
This module exposes the main chunker classes and data models that are intended
to be imported directly from chunkipy.
- class chunkipy.BaseLanguageDetector[source]
Bases:
ABCBase class for strategies that detect the language of a text.
- class chunkipy.BaseTextChunker(chunk_size=None, size_estimator=None)[source]
Bases:
ABCBase class for all chunker implementations.
- Parameters:
chunk_size (
int) – Maximum size allowed for a single chunk in the units defined bysize_estimator.size_estimator (
BaseSizeEstimator) – Strategy used to measure text size. Defaults toWordSizeEstimator.
- class chunkipy.Chunk(overlap=<factory>, content=<factory>)[source]
Bases:
objectSingle chunk returned by a text chunker.
A chunk is composed of two ordered collections:
overlap: text parts repeated from the previous chunk to preserve contextcontent: text parts that are unique to the current chunk
The
textandsizeproperties are computed over the combinedtext_partsview.- property size: int
Calculates and returns the total size of all TextPart objects within text_parts.
- Returns:
The total size of all TextPart objects.
- Return type:
- class chunkipy.Chunks(iterable=(), /)[source]
-
List-like collection of
Chunkobjects returned by chunkers.
- class chunkipy.FastTextLanguageDetector(model_path, label_prefix='__label__')[source]
Bases:
BaseLanguageDetectorDetect language codes using a FastText language identification model.
The detector expects a path to a model compatible with the FastText Python bindings, such as Facebook’s
lid.176.bin.
- class chunkipy.FixedSizeTextChunker(chunk_size=None, size_estimator=None, overlap_ratio=0.0)[source]
Bases:
BaseOverlapTextChunkerChunk text into fixed-size slices using the configured size estimator.
Each segment emitted by
size_estimator.segmentis treated as a unit of size1during chunk assembly.- Parameters:
chunk_size (int)
size_estimator (BaseSizeEstimator)
overlap_ratio (float)
- class chunkipy.LangdetectLanguageDetector[source]
Bases:
BaseLanguageDetectorDetect language codes using the optional
langdetectdependency.
- class chunkipy.Overlap[source]
Bases:
TextPartsMixin,deque[TextPart]Deque-like collection used to carry overlap between consecutive chunks.
- class chunkipy.RecursiveTextChunker(chunk_size=None, size_estimator=None, overlap_ratio=0.0, text_splitters=None)[source]
Bases:
BaseOverlapTextChunkerChunk text by recursively applying increasingly fine-grained splitters.
The chunker tries each splitter in order until a text part fits within the configured
chunk_size. Custom splitters are attempted before the default fallback splitters.- Parameters:
chunk_size (int)
size_estimator (BaseSizeEstimator)
overlap_ratio (float)
text_splitters (List[BaseTextSplitter])
- class chunkipy.TextPart(size, text)[source]
Bases:
objectRepresents a fragment or segment of a complete text, along with its character size.
- Parameters:
Modules
Public language detector classes exposed by |
|
Public size estimator classes exposed by |
|
Public chunker classes and data models exposed by |
|
Public text splitter classes exposed by |
|
Shared utility helpers used across chunkipy internals and optional extras. |