Welcome to Chunkipy’s documentation!
chunkipy is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, chunkipy provides flexibility and control
for various text processing tasks.
Motivation and Features
chunkipy was created to address the need within the field of Natural Language Processing (NLP) to chunk text so that it does not exceed the input size of neural networks such as BERT, but it could be used for several other use cases.
The library offers some useful features:
Size estimation: unlike other text chunking libraries,
chunkipyoffers the possibility of providing a size estimator function, in order to build the chunks taking into account the counting function (e.g. tokenizer) that will use those chunks.Split text into meaningful sentences: as an optional configuration,
chunkipy, in creating the chunks, avoids cutting sentences, and always tries to have a complete and syntactically correct sentence. This is achieved through the use of the sentence segmenter libraries, that utilize semantic models to cut text into meaningful sentences.Smart Overlapping:
chunkipyoffers the possibility to define anoverlap_percentageand create overlapping chunks to preserve the context along chunks.Flexibility for text splitters: Additionally,
chunkipyoffers complete flexibility in choosing how to split, allowing users to define their own text splitting function or choose from a list of pre-defined text spliters.