Custom size estimators
Create a custom estimator when chunk size must follow domain-specific logic.
Base class
Extend BaseSizeEstimator and implement:
estimate_size(self, text: str) -> intsegment(self, text: str) -> Generator[str, None, None]
The segment method is especially important for
FixedSizeTextChunker.
Instead of depending on a separate splitter implementation, the chunker uses
segment as its atomic splitting strategy. This keeps sizing and splitting
logic in one place and avoids duplicating domain rules across estimator and
splitter components. Different size estimator and splitter combinations would
not make sense for FixedSizeTextChunker.
Minimal example
from chunkipy import FixedSizeTextChunker
from chunkipy.size_estimators.base_size_estimator import BaseSizeEstimator
class CommaSizeEstimator(BaseSizeEstimator):
def estimate_size(self, text: str) -> int:
return len(text.split(","))
def segment(self, text: str):
for part in text.split(","):
yield part
estimator = CommaSizeEstimator()
chunker = FixedSizeTextChunker(chunk_size=3, size_estimator=estimator)
text = "alpha,beta,gamma,delta"
chunks = chunker.chunk(text)
Guidelines
Keep
estimate_sizeandsegmentlogically consistent.For
FixedSizeTextChunker, implementsegmentcarefully because it controls atomic parts and replaces the need for a dedicated splitter.Use tests with representative long and short texts.