Custom language detectors

Create a custom detector when built-in language detection strategies are not enough for your pipeline or compliance constraints.

Base class

Extend BaseLanguageDetector and implement detect(self, text: str) -> str.

Minimal example

from chunkipy.language_detectors import BaseLanguageDetector
from chunkipy.text_splitters.semantic.sentences import SpacySentenceTextSplitter


class PrefixLanguageDetector(BaseLanguageDetector):
    def detect(self, text: str) -> str:
        self._validate_text(text)
        return "it" if text.strip().startswith("IT:") else "en"


detector = PrefixLanguageDetector()
splitter = SpacySentenceTextSplitter(language_detector=detector)

Guidelines

  • Return a normalized language code expected by downstream splitters.

  • Validate empty input before detection.

  • Keep behavior deterministic and test language edge cases explicitly.

See also