Overview

Size estimators define how chunk size is measured. The selected estimator directly affects chunk boundaries, overlap behavior, and final chunk count.

Choose the estimator according to your downstream workload:

use character-based sizing for deterministic low-level control;
use word-based sizing for general NLP tasks;
use token-based sizing when your target model enforces token limits.

If built-in strategies are not enough, you can implement your own estimator by extending BaseSizeEstimator to adapt chunk sizing to any domain-specific requirement.

Built-in estimators

Chunkipy currently provides three estimators:

Quick comparison

Estimator	Unit	Extra dependency	Recommended when
`CharSizeEstimator`	Characters	None	You need deterministic and very fast sizing
`WordSizeEstimator`	Words	None	You want human-readable chunk lengths for generic NLP
`OpenAISizeEstimator`	Tokens (`tiktoken`)	`chunkipy[tiktoken]`	You need chunk sizes aligned with LLM token budgets

Choosing an estimator

Start with WordSizeEstimator for most projects.
Prefer OpenAISizeEstimator for prompt budgeting and LLM/RAG pipelines.
Switch to CharSizeEstimator when you need strict, tokenizer-independent reproducibility.
Build a custom estimator by subclassing BaseSizeEstimator when your sizing logic is domain-specific.

For implementation details and usage examples, see the dedicated pages: