chunkipy.text_splitters.semantic.sentences

class chunkipy.text_splitters.semantic.sentences.SpacySentenceTextSplitter(models_map={'en': 'en_core_web_sm'}, text_limit=None)[source]

Bases: BaseSemanticTextSplitter

Sentence splitter using spaCy for semantic text splitting. This class uses spaCy to split text into sentences based on the language detected in the text. It supports multiple languages by loading different spaCy models based on the detected language. If the language is not supported, it defaults to English.

Parameters:

models_map (Dict[str, str])
text_limit (int)

models

A dictionary mapping language codes to spaCy model names.

Type:: Dict[str, str]

models_map

A dictionary mapping language codes to spaCy model names.

Type:: Dict[str, str]

text_limit

The maximum length of text to process at once. If None, DEFAULT_LIMIT from base class is applied.

Type:: int

DEFAULT_LANG = 'en'

DEFAULT_MODELS_MAP = {'en': 'en_core_web_sm'}

class chunkipy.text_splitters.semantic.sentences.StanzaSentenceTextSplitter(text_limit=None)[source]

Bases: BaseSemanticTextSplitter

Sentence splitter using Stanza for semantic text splitting. This class uses Stanza to split text into sentences based on the language detected in the text. It supports multiple languages by loading different Stanza models based on the detected language.

Parameters:: text_limit (int)

text_limit

The maximum length of text to process at once. If None, DEFAULT_LIMIT from base class is applied.

Type:: int

langdetect_stanza_mapping = {'af': 'af', 'ar': 'ar', 'bg': 'bg', 'bn': None, 'ca': 'ca', 'cs': 'cs', 'cy': None, 'da': 'da', 'de': 'de', 'el': 'el', 'en': 'en', 'es': 'es', 'et': 'et', 'fa': 'fa', 'fi': 'fi', 'fr': 'fr', 'gu': None, 'he': 'he', 'hi': 'hi', 'hr': 'hr', 'hu': 'hu', 'id': 'id', 'it': 'it', 'ja': 'ja', 'kn': None, 'ko': 'ko', 'lt': 'lt', 'lv': 'lv', 'mk': None, 'ml': None, 'mr': 'mr', 'ne': None, 'nl': 'nl', 'no': 'no', 'pa': None, 'pl': 'pl', 'pt': 'pt', 'ro': 'ro', 'ru': 'ru', 'sk': 'sk', 'sl': 'sl', 'so': None, 'sq': None, 'sv': 'sv', 'sw': None, 'ta': 'ta', 'te': 'te', 'th': None, 'tl': None, 'tr': 'tr', 'uk': 'uk', 'ur': 'ur', 'vi': 'vi', 'zh-cn': 'zh-hans', 'zh-tw': 'zh-hant'}

Modules

`spacy_sentences_text_splitter`
`stanza_sentences_text_splitter`