Overview
Size estimators define how chunk size is measured. The selected estimator directly affects chunk boundaries, overlap behavior, and final chunk count.
Choose the estimator according to your downstream workload:
use character-based sizing for deterministic low-level control;
use word-based sizing for general NLP tasks;
use token-based sizing when your target model enforces token limits.
If built-in strategies are not enough, you can implement your own estimator by
extending BaseSizeEstimator
to adapt chunk sizing to any domain-specific requirement.
Built-in estimators
Chunkipy currently provides three estimators:
Quick comparison
Estimator |
Unit |
Extra dependency |
Recommended when |
|---|---|---|---|
|
Characters |
None |
You need deterministic and very fast sizing |
|
Words |
None |
You want human-readable chunk lengths for generic NLP |
|
Tokens ( |
|
You need chunk sizes aligned with LLM token budgets |
Choosing an estimator
Start with
WordSizeEstimatorfor most projects.Prefer
OpenAISizeEstimatorfor prompt budgeting and LLM/RAG pipelines.Switch to
CharSizeEstimatorwhen you need strict, tokenizer-independent reproducibility.Build a custom estimator by subclassing
BaseSizeEstimatorwhen your sizing logic is domain-specific.
For implementation details and usage examples, see the dedicated pages: