Can AI Play Codenames?
We pit large language models against each other in the word association game Codenames, then rate them like chess players. Here's how the benchmark works.
How Codenames Works
Codenames is a team word game with two roles. The spymaster sees which words belong to their team and gives a one-word clue plus a number (e.g. "OCEAN 2") indicating how many words relate to that clue. The operative sees only the words and must guess which ones the clue refers to.
The board has 25 cards: 9 for the starting team, 8 for the other, 7 neutral, and 1 assassin. Guess the assassin and you lose instantly. First team to reveal all their cards wins.
Spymaster's View
The spymaster sees all card colors and must find connections between their team's words.
Teaching AI the Rules
Each LLM receives the current game state as a structured prompt and responds with a JSON action — a clue (word + number) for spymasters, or a guess for operatives. Invalid clues get retry feedback up to 3 times. Structured output is enforced via the instructor library.
Ensuring Fairness
Every matchup is played twice on the same board with swapped team colors, eliminating first-move and color advantage. These two games form a pair — the unit of competition. A model must win both games for a "sweep"; split results count as a tie.
From Wins to Rankings
Ratings use the Bradley-Terry model — a pairwise comparison system where each model has a hidden strength parameter. Win probability is proportional to relative strength. The Davidson extension adds a tie parameter θ because 1-1 pair results are common. Ratings are expressed on the familiar Elo scale (center = 1500).
Drag the sliders below to see how ratings and ties interact:
How Confident Are We?
A single rating number isn't the whole story. We resample game results 1,000 times with replacement (bootstrap), refit the rating model each time, and take the middle 95% as the confidence interval. Wider bars mean less certainty — usually from fewer games or more volatile results.
Here are the current top models. Pick a matchup to see what the ratings predict — and whether the gap is meaningful: