Can AI Play Codenames?

We pit large language models against each other in the word association game Codenames, then rate them like chess players. Here's how the benchmark works.

OCEAN

CASTLE

DIAMOND

SHADOW

ROCKET

FOREST

BRIDGE

CRYSTAL

PHOENIX

HARBOR

VELVET

THUNDER

COMPASS

BEACON

GLACIER

MARBLE

FALCON

LANTERN

SUMMIT

BREEZE

CORAL

WRAITH

PRISM

SPARK

EMBER

How Codenames Works

Codenames is a team word game with two roles. The spymaster sees which words belong to their team and gives a one-word clue plus a number (e.g. "OCEAN 2") indicating how many words relate to that clue. The operative sees only the words and must guess which ones the clue refers to.

The board has 25 cards: 9 for the starting team, 8 for the other, 7 neutral, and 1 assassin. Guess the assassin and you lose instantly. First team to reveal all their cards wins.

BEACH

WAVE

CASTLE

MOON

BOAT

SAND

NIGHT

CROWN

CORAL

Spymaster's View

The spymaster sees all card colors and must find connections between their team's words.

Teaching AI the Rules

Each LLM receives the current game state as a structured prompt and responds with a JSON action — a clue (word + number) for spymasters, or a guess for operatives. Invalid clues get retry feedback up to 3 times. Structured output is enforced via the instructor library.

🎮Game State

📝Prompt

🤖LLM

📦JSON Output

⚡Game Action

3 retry attemptsClue validationStructured output via instructor

Ensuring Fairness

Every matchup is played twice on the same board with swapped team colors, eliminating first-move and color advantage. These two games form a pair — the unit of competition. A model must win both games for a "sweep"; split results count as a tie.

Game 1

Model A (Red)vsModel B (Blue)

Game 2 (mirrored)

Model A (Blue)vsModel B (Red)

Same board, swapped sides

2-0SweepA wins both

1-1TieSplit pair

0-2SweptB wins both

From Wins to Rankings

Ratings use the Bradley-Terry model — a pairwise comparison system where each model has a hidden strength parameter. Win probability is proportional to relative strength. The Davidson extension adds a tie parameter θ because 1-1 pair results are common. Ratings are expressed on the familiar Elo scale (center = 1500).

Drag the sliders below to see how ratings and ties interact:

1600Elo

1400Elo

53.2%

29.9%

16.8%

A winsTieB wins

Rating Difference+200 Elo

B dominatesEqualA dominates

Tie Propensity (θ)1.00

No tiesMany ties

P(A beats B)=

γAγA + γB + θ · √(γA · γB)

Davidson extension of the Bradley-Terry model

Elo Scale Conversion

Log-strength values are centered at 1500 and scaled so ±400 equals one order of magnitude — just like chess ratings.

How Confident Are We?

A single rating number isn't the whole story. We resample game results 1,000 times with replacement (bootstrap), refit the rating model each time, and take the middle 95% as the confidence interval. Wider bars mean less certainty — usually from fewer games or more volatile results.

Here are the current top models. Pick a matchup to see what the ratings predict — and whether the gap is meaningful:

16501700175018001850190019502000205021002150220022502300

Gemini 3.1 Pro Preview

2128

Gpt 5.2

1926

Gpt 5 Mini

1784

Grok 4.1 Fast

1779

Pick a matchup to see predicted outcomes:

Gemini 3.1 Pro PreviewvsGpt 5.2

53%

30%

17%

Gemini 3.1 Pro Preview winsTieGpt 5.2 wins

Overlapping confidence intervals. These models' CIs overlap — with more games, the ranking between them could shift. The 202-point gap isn't yet definitive.