Feature request
Expose a tokenizer API on the inference server: POST /tokenize and POST /decode, aligned with the shape used by Text Embeddings Inference (TEI) — and equivalently by Cohere (/tokenize + /detokenize) and vLLM.
POST /tokenize — {model, inputs, add_special_tokens?} → per input, a list of {id, text, special, start, stop} (with start/stop as character offsets) plus a token count.
POST /decode — {model, ids, skip_special_tokens?} → text.
These primitives cover three client scenarios. In priority order for me:
- Truncate an input to N tokens before sending it.
- Chunk one input into K segments of at most N tokens each.
- Tokenize an input to get token ids and character offsets.
(1) and (2) are my main interest; (3) is the primitive they build on. No popular tokenizer API standardises chunking, so I'd suggest either shipping the primitives only and letting clients chunk from the offsets, or adding a thin, clearly-Infinity-specific convenience layer (e.g. max_tokens to head-truncate, chunk_size/chunk_overlap to split) that reuses the truncation logic already present for reranking. Happy to follow your preference here.
Motivation
Clients that send long inputs have to guess how the served model's tokenizer treats them. Without a server-side tokenizer endpoint, every client has to ship and pin the matching tokenizers/transformers version per model, keep it in sync with the model Infinity actually serves, and reimplement chunking. That logic differs per model (BERT vs. XLM-R vs. a custom merge), so it drifts easily and is a common source of "works locally, mismatched in prod" mismatches.
Infinity already holds the right tokenizer in process (_infinity_tokenizer in the SentenceTransformer and CrossEncoder backends) and already truncates by tokens for reranking (truncate_texts_to_tokens, plus the max_query_tokens / max_tokens_per_doc / max_pair_tokens budgets on /rerank). Exposing that capability lets clients reuse the exact tokenizer of the served model instead of duplicating it.
The intent is to match an existing popular API rather than invent a custom one. TEI is the closest precedent, since Infinity already positions itself as a TEI/OpenAI alternative; /tokenize + /decode carries the character offsets needed for chunking and lets existing TEI clients work unchanged. Cohere's /tokenize + /detokenize naming is an alternative that matches Infinity's existing Cohere-aligned /rerank.
Two open questions for you:
- TEI-aligned
/tokenize + /decode, or Cohere-aligned /tokenize + /detokenize (matching the existing /rerank)?
- Primitives only, or include the truncate/chunk convenience layer?
Your contribution
Yes — I'm happy to open a PR once the endpoint shape and the two questions above are agreed.
References: TEI API, Cohere /tokenize, vLLM tokenizer endpoints.
Feature request
Expose a tokenizer API on the inference server:
POST /tokenizeandPOST /decode, aligned with the shape used by Text Embeddings Inference (TEI) — and equivalently by Cohere (/tokenize+/detokenize) and vLLM.POST /tokenize—{model, inputs, add_special_tokens?}→ per input, a list of{id, text, special, start, stop}(withstart/stopas character offsets) plus a token count.POST /decode—{model, ids, skip_special_tokens?}→ text.These primitives cover three client scenarios. In priority order for me:
(1) and (2) are my main interest; (3) is the primitive they build on. No popular tokenizer API standardises chunking, so I'd suggest either shipping the primitives only and letting clients chunk from the offsets, or adding a thin, clearly-Infinity-specific convenience layer (e.g.
max_tokensto head-truncate,chunk_size/chunk_overlapto split) that reuses the truncation logic already present for reranking. Happy to follow your preference here.Motivation
Clients that send long inputs have to guess how the served model's tokenizer treats them. Without a server-side tokenizer endpoint, every client has to ship and pin the matching
tokenizers/transformersversion per model, keep it in sync with the model Infinity actually serves, and reimplement chunking. That logic differs per model (BERT vs. XLM-R vs. a custom merge), so it drifts easily and is a common source of "works locally, mismatched in prod" mismatches.Infinity already holds the right tokenizer in process (
_infinity_tokenizerin the SentenceTransformer and CrossEncoder backends) and already truncates by tokens for reranking (truncate_texts_to_tokens, plus themax_query_tokens/max_tokens_per_doc/max_pair_tokensbudgets on/rerank). Exposing that capability lets clients reuse the exact tokenizer of the served model instead of duplicating it.The intent is to match an existing popular API rather than invent a custom one. TEI is the closest precedent, since Infinity already positions itself as a TEI/OpenAI alternative;
/tokenize+/decodecarries the character offsets needed for chunking and lets existing TEI clients work unchanged. Cohere's/tokenize+/detokenizenaming is an alternative that matches Infinity's existing Cohere-aligned/rerank.Two open questions for you:
/tokenize+/decode, or Cohere-aligned/tokenize+/detokenize(matching the existing/rerank)?Your contribution
Yes — I'm happy to open a PR once the endpoint shape and the two questions above are agreed.
References: TEI API, Cohere /tokenize, vLLM tokenizer endpoints.