Overview

Size estimators define how chunk size is measured. The selected estimator directly affects chunk boundaries, overlap behavior, and final chunk count.

Choose the estimator according to your downstream workload:

  • use character-based sizing for deterministic low-level control;

  • use word-based sizing for general NLP tasks;

  • use token-based sizing when your target model enforces token limits.

If built-in strategies are not enough, you can implement your own estimator by extending BaseSizeEstimator to adapt chunk sizing to any domain-specific requirement.

Built-in estimators

Chunkipy currently provides three estimators:

Quick comparison

Estimator

Unit

Extra dependency

Recommended when

CharSizeEstimator

Characters

None

You need deterministic and very fast sizing

WordSizeEstimator

Words

None

You want human-readable chunk lengths for generic NLP

OpenAISizeEstimator

Tokens (tiktoken)

chunkipy[tiktoken]

You need chunk sizes aligned with LLM token budgets

Choosing an estimator

For implementation details and usage examples, see the dedicated pages: