Spacy Sentence Text Splitter

Description

SpacySentenceTextSplitter detects the text language, selects the configured spaCy model, and splits content into sentence units for recursive chunking pipelines. It is a semantic splitter suited for multilingual workflows where sentence boundaries are more meaningful than punctuation-only splitting.

Note

You need to install the spacy library and download at least one model (e.g. the English one) before using this splitter:

pip install chunkipy[spacy]
python -m spacy download en_core_web_sm

API / Documentation

class chunkipy.text_splitters.semantic.sentences.SpacySentenceTextSplitter(models_map=None, text_limit=None, language_detector=None)[source]

Bases: BaseSemanticTextSplitter

Sentence splitter using spaCy for semantic text splitting. This class uses spaCy to split text into sentences based on the language detected in the text. It supports multiple languages by loading different spaCy models based on the detected language. If the language is not supported, it defaults to English.

Parameters:

models_map (Dict[str, str] | None)
text_limit (int)
language_detector (BaseLanguageDetector | None)

models

A dictionary mapping language codes to spaCy model names.

Type:: Dict[str, str]

models_map

A dictionary mapping language codes to spaCy model names.

Type:: Dict[str, str]

text_limit

The maximum length of text to process at once. If None, DEFAULT_LIMIT from base class is applied.

Type:: int

DEFAULT_LANG = 'en'

DEFAULT_MODELS_MAP = {'en': 'en_core_web_sm'}

Example

This example is included in examples/chunkers/recursive/prebuilt_spacy_text_splitter.py.

from chunkipy.size_estimators import WordSizeEstimator
from chunkipy import RecursiveTextChunker
from chunkipy.utils import MissingDependencyError


if __name__ == "__main__":

    with open("examples/texts/napoleon.txt", "r") as file:
        text = file.read()
        
    try:
        from chunkipy.text_splitters.semantic.sentences import SpacySentenceTextSplitter
        from chunkipy.size_estimators.openai_size_estimator import OpenAISizeEstimator

        word_size_estimator = WordSizeEstimator()
        openai_size_estimator = OpenAISizeEstimator()

        print(f"Num of chars: {len(text)}")
        print(f"Num of tokens (using WordSizeEstimator): {word_size_estimator.estimate_size(text)}")
        print(f"Num of tokens (using OpenAISizeEstimator): {openai_size_estimator.estimate_size(text)}")

        spacy_text_splitter = SpacySentenceTextSplitter()

        models_map={
            "en": "en_core_web_sm",
            "de": "de_core_news_sm",
            "it": "it_core_news_sm",
        }

        text_chunker = RecursiveTextChunker(
            chunk_size=200,
            overlap_ratio=0.25,
            size_estimator=openai_size_estimator,
            text_splitters=[spacy_text_splitter]
        )

        chunks = text_chunker.chunk(text)
        print(f"Got: {len(chunks)}")
        print(f"Here the text_parts: {chunks.get_all_text_parts()}")
    except MissingDependencyError as e:
        print(f"Error: {e}")

    

More examples are available under examples/chunkers/recursive/.