Skip to content

ChefZander/chessbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chessbench

Testing Local LLMs by their Chess-playing proficiency through tool use.

Important

ChessBench Flags

  • ILLEGAL_MOVE: The model returned a move, but it was not legal according to the current board state.
  • INVALID_FORMAT: The model returned a response that could not be parsed as a move (e.g., conversational text or incorrect notation).
  • LLM_ERROR: The model failed to provide a result due to technical limits, such as getting stuck in an infinite loop or running out of context.

Puzzles

Model Name Total Puzzles Accuracy Average rating
Qwen3.5-4B-Q4_K_M-no-reasoning 1794 6.1% 1104 Elo
Qwen3-0.6B-Q4_K_M-no-reasoning 16004 6.0% 1073 Elo
Llama-3.2-3B-Instruct-Q4_K_M 11521 5.7% 1057 Elo
Llama-3.2-1B-Instruct-Q4_K_M 6162 3.1% 1046 Elo
nanbeige4.1-3b-q4_k_m 93 12.9% 1044 Elo
gemma-4-E4B-it-Q4_K_M 1286 9.9% 1044 Elo
LFM2.5-1.2B-Instruct-Q4_K_M 3040 2.9% 1024 Elo
gemma-4-E4B-it-Q4_K_M-no-reasoning 256 11.7% 945 Elo

Self-Play

Model Name Total Games Game Completion Rate Illegal Move Rate Avg. Tokens/Move W / D / L Note
gemma-4-E4B-it-Q4_K_M 69 2.9% 0.9% 1422.5 0 / 68 / 1 I'm very impressed, even though play is honestly very bad.
gemma-4-E2B-it-Q4_K_M 11 0.0% N/A N/A 0 / 11 / 0 The N/A values are missing because this benchmark is from an older version of ChessBench.
Qwen3.5-4B-Q4_K_M 10 0.0% 0.0% 1227.8 0 / 10 / 0 This model loves to get stuck in infinitely repeating sequences, making itself run out of context.

About

Testing Local LLMs by their Chess-playing proficiency through tool use.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages