Spacy Sentence Text Splitter
Description
SpacySentenceTextSplitter detects the text language, selects the configured
spaCy model, and splits content into sentence units for recursive chunking
pipelines. It is a semantic splitter suited for multilingual workflows where
sentence boundaries are more meaningful than punctuation-only splitting.
Note
You need to install the spacy library and download at least one model (e.g. the English one) before using this splitter:
pip install chunkipy[spacy]
python -m spacy download en_core_web_sm
API / Documentation
- class chunkipy.text_splitters.semantic.sentences.SpacySentenceTextSplitter(models_map=None, text_limit=None, language_detector=None)[source]
Bases:
BaseSemanticTextSplitterSentence splitter using spaCy for semantic text splitting. This class uses spaCy to split text into sentences based on the language detected in the text. It supports multiple languages by loading different spaCy models based on the detected language. If the language is not supported, it defaults to English.
- Parameters:
text_limit (int)
language_detector (BaseLanguageDetector | None)
- text_limit
The maximum length of text to process at once. If None, DEFAULT_LIMIT from base class is applied.
- Type:
- DEFAULT_LANG = 'en'
- DEFAULT_MODELS_MAP = {'en': 'en_core_web_sm'}
Example
This example is included in examples/chunkers/recursive/prebuilt_spacy_text_splitter.py.
1from chunkipy.size_estimators import WordSizeEstimator
2from chunkipy import RecursiveTextChunker
3from chunkipy.utils import MissingDependencyError
4
5
6if __name__ == "__main__":
7
8 with open("examples/texts/napoleon.txt", "r") as file:
9 text = file.read()
10
11 try:
12 from chunkipy.text_splitters.semantic.sentences import SpacySentenceTextSplitter
13 from chunkipy.size_estimators.openai_size_estimator import OpenAISizeEstimator
14
15 word_size_estimator = WordSizeEstimator()
16 openai_size_estimator = OpenAISizeEstimator()
17
18 print(f"Num of chars: {len(text)}")
19 print(f"Num of tokens (using WordSizeEstimator): {word_size_estimator.estimate_size(text)}")
20 print(f"Num of tokens (using OpenAISizeEstimator): {openai_size_estimator.estimate_size(text)}")
21
22 spacy_text_splitter = SpacySentenceTextSplitter()
23
24 models_map={
25 "en": "en_core_web_sm",
26 "de": "de_core_news_sm",
27 "it": "it_core_news_sm",
28 }
29
30 text_chunker = RecursiveTextChunker(
31 chunk_size=200,
32 overlap_ratio=0.25,
33 size_estimator=openai_size_estimator,
34 text_splitters=[spacy_text_splitter]
35 )
36
37 chunks = text_chunker.chunk(text)
38 print(f"Got: {len(chunks)}")
39 print(f"Here the text_parts: {chunks.get_all_text_parts()}")
40 except MissingDependencyError as e:
41 print(f"Error: {e}")
42
43
44
45
More examples are available under examples/chunkers/recursive/.