CN
Codenames

Can AI Play Codenames?

We pit large language models against each other in the word association game Codenames, then rate them like chess players. Here's how the benchmark works.

OCEAN
CASTLE
DIAMOND
SHADOW
ROCKET
FOREST
BRIDGE
CRYSTAL
PHOENIX
HARBOR
VELVET
THUNDER
COMPASS
BEACON
GLACIER
MARBLE
FALCON
LANTERN
SUMMIT
BREEZE
CORAL
WRAITH
PRISM
SPARK
EMBER

How Codenames Works

Codenames is a team word game with two roles. The spymaster sees which words belong to their team and gives a one-word clue plus a number (e.g. "OCEAN 2") indicating how many words relate to that clue. The operative sees only the words and must guess which ones the clue refers to.

The board has 25 cards: 9 for the starting team, 8 for the other, 7 neutral, and 1 assassin. Guess the assassin and you lose instantly. First team to reveal all their cards wins.

BEACH
WAVE
CASTLE
MOON
BOAT
SAND
NIGHT
CROWN
CORAL

Spymaster's View

The spymaster sees all card colors and must find connections between their team's words.

Teaching AI the Rules

Each LLM receives the current game state as a structured prompt and responds with a JSON action — a clue (word + number) for spymasters, or a guess for operatives. Invalid clues get retry feedback up to 3 times. Structured output is enforced via the instructor library.

🎮Game State
📝Prompt
🤖LLM
📦JSON Output
Game Action
3 retry attemptsClue validationStructured output via instructor

Ensuring Fairness

Every matchup is played twice on the same board with swapped team colors, eliminating first-move and color advantage. These two games form a pair — the unit of competition. A model must win both games for a "sweep"; split results count as a tie.

Game 1
Model A (Red)vsModel B (Blue)
Game 2 (mirrored)
Model A (Blue)vsModel B (Red)
Same board, swapped sides
2-0SweepA wins both
1-1TieSplit pair
0-2SweptB wins both

From Wins to Rankings

Ratings use the Bradley-Terry model — a pairwise comparison system where each model has a hidden strength parameter. Win probability is proportional to relative strength. The Davidson extension adds a tie parameter θ because 1-1 pair results are common. Ratings are expressed on the familiar Elo scale (center = 1500).

Drag the sliders below to see how ratings and ties interact:

A
1600Elo
vs
B
1400Elo
53.2%
29.9%
16.8%
A winsTieB wins
+200 Elo
B dominatesEqualA dominates
1.00
No tiesMany ties
P(A beats B)=
γAγA + γB + θ · √(γA · γB)
Davidson extension of the Bradley-Terry model
Elo Scale Conversion
Log-strength values are centered at 1500 and scaled so ±400 equals one order of magnitude — just like chess ratings.

How Confident Are We?

A single rating number isn't the whole story. We resample game results 1,000 times with replacement (bootstrap), refit the rating model each time, and take the middle 95% as the confidence interval. Wider bars mean less certainty — usually from fewer games or more volatile results.

Here are the current top models. Pick a matchup to see what the ratings predict — and whether the gap is meaningful:

16501700175018001850190019502000205021002150220022502300
Gemini 3.1 Pro Preview
2128
Gpt 5.2
1926
Gpt 5 Mini
1784
Grok 4.1 Fast
1779
Pick a matchup to see predicted outcomes:
Gemini 3.1 Pro PreviewvsGpt 5.2
53%
30%
17%
Gemini 3.1 Pro Preview winsTieGpt 5.2 wins
Overlapping confidence intervals. These models' CIs overlap — with more games, the ranking between them could shift. The 202-point gap isn't yet definitive.