chessbench

Testing Local LLMs by their Chess-playing proficiency through tool use.

Important

ILLEGAL_MOVE: The model returned a move, but it was not legal according to the current board state.
INVALID_FORMAT: The model returned a response that could not be parsed as a move (e.g., conversational text or incorrect notation).
LLM_ERROR: The model failed to provide a result due to technical limits, such as getting stuck in an infinite loop or running out of context.

Puzzles

Model Name	Total Puzzles	Accuracy	Average rating
Qwen3.5-4B-Q4_K_M-no-reasoning	1794	6.1%	1104 Elo
Qwen3-0.6B-Q4_K_M-no-reasoning	16004	6.0%	1073 Elo
Llama-3.2-3B-Instruct-Q4_K_M	11521	5.7%	1057 Elo
Llama-3.2-1B-Instruct-Q4_K_M	6162	3.1%	1046 Elo
nanbeige4.1-3b-q4_k_m	93	12.9%	1044 Elo
gemma-4-E4B-it-Q4_K_M	1286	9.9%	1044 Elo
LFM2.5-1.2B-Instruct-Q4_K_M	3040	2.9%	1024 Elo
gemma-4-E4B-it-Q4_K_M-no-reasoning	256	11.7%	945 Elo

Model Name	Total Games	Game Completion Rate	Illegal Move Rate	Avg. Tokens/Move	W / D / L	Note
gemma-4-E4B-it-Q4_K_M	69	2.9%	0.9%	1422.5	0 / 68 / 1	I'm very impressed, even though play is honestly very bad.
gemma-4-E2B-it-Q4_K_M	11	0.0%	N/A	N/A	0 / 11 / 0	The N/A values are missing because this benchmark is from an older version of ChessBench.
Qwen3.5-4B-Q4_K_M	10	0.0%	0.0%	1227.8	0 / 10 / 0	This model loves to get stuck in infinitely repeating sequences, making itself run out of context.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
puzzles		puzzles
selfplay		selfplay
LICENSE		LICENSE
README.md		README.md
main.py		main.py
puzzler.py		puzzler.py