Investigations into Wordle

Introduction

Back when the Wordle word game first started to get attention (and before it was assimilated by the New York Times), I dug into the source code, extracted the word lists, and did some analysis. NYT has changed the word lists at least slightly since then, so this information might be a little stale.

There is ample opportunity for spoilers here, to varying degrees. Consider the consequences before you read further, or click any word lists. My thinking is that, since all my analysis scripts were run against the combined word list (including allowed non-answer guess words), I'm not exploiting too much — there is no penalty for a guess not in the word list, after all. But people have different standards of gaming "purity".

Game Play

The overt rules of Wordle are:

Wordle wants you to guess a 5-letter word; it gives you six tries.

Each guess must be a word recognized and allowed by Wordle. If you guess isn't recognized, Wordle shows "Not in word list" and lets you edit your guess and try again. This doesn't count against your six guesses.

After each accepted guess, Wordle highlights each letter: Green means you have the right letter in the right position. Yellow means the letter occurs in the answer word, but not that position. Grey means the letter does not occur in the answer word.

Every day presents a different answer word. Everyone plays the same word, every day.

Behind the Scenes

Wordle was originally created by Josh Wardle. He later sold all rights to the game to the New York Times Company.

All of Wordle is contained in a single JavaScript file. That includes the word lists. The HTML page is just a wrapper that invokes the JavaScript. (The NYT has since added quite a bit to the wrapper, but (as of 2022 September) it looks like the basic architecture is the same — the game itself is still in the one file, with internal word lists.)

The answer list is pre-determined and hard coded. Wordle does simple math on the current date to index into the list. The first word played on 2021 May 19 (variable Ha in the code). The browser's local clock is used, so the word rolls over at midnight local time. If your computer's clock is wrong, you'll get different words at different times vs everyone else.

A larger list of allowable words is also hard-coded. The answer list is disjoint; the code checks both lists before rejecting a word. Answers do not repeat.

Analysis and Strategy

Word Counts

Depending on who you ask, there are anywhere from 25,000 to 160,00 5-letter words in the English language.

Using the "american-english-insane" list shipped with Debian (file dated 2020 February 29), I get a count of 26,567 with this simple search:

egrep -c '^[a-zA-Z]{5}$' /usr/share/dict/american-english-insane

Wordle's combined lists of words (answers and allowed) is just under 13,000 words. Considerably smaller than any of the above.

The list of answer words was purportedly chosen by Wardle's girlfriend, by selecting more recognizable words. So guessing more common words is a good general strategy. Some answers are more obscure than others, but very esoteric words rarely appear.

There are just over 2300 answer words. At one word a day, that's over six years worth of answers, so it should last until mid-to-late 2027.

The list of allowable guesses (excluding answers) is about 10,700 words long. The source of this list isn't publicly known, from what I can find.

Letter Frequency Counts

The frequency of occurrences of letters in English generally is fairly well established, but Wordle is different. We're limiting the corpus to just 5-letter words. Since there are rules (or at least, patterns) in English word construction, that changes things.

I wrote a Perl script (ltrposhz.pl) to count letter frequencies in each position, and ran it on the combined word list. Results in wordle.counts. Script and results are linked below.

Starter Words

The big Wordle strategy question is: What is the best word to use for one's first guess?

A proper analysis would take into account individual letter frequencies for each position, as well as likelihood of the guess chosen to lead to better revelations, plus consideration of vowels vs constants.

I did not do that. Developing that algorithm was more than I was willing to attempt (and possibly beyond my ken; I'm more of a systems engineer type, not a pure algorithmist).

I did, however, come up with some quick-and-dirty scripts that take a brute-force-and-ignorance approach to the question. That was good enough for my purposes.

I wrote pick.pl in an attempt to find combinations of words that make optimum use of guesses and use the most common letters. It proved to be a little too brutish and ignorant, but did produce pick-2.out.

I fell back to variations on simple grep commands to find words made of the first X most common characters, minus characters already chosen for prior words. I then ran the output through scripts that eliminate words with repeated letters (norep.pl) and words that use more than one vowel (onevowel.pl). For example:

egrep [aeioudycpmhgbkfwvzjxq]{5} combined.dict | ./norep.pl | ./onevowel.pl

Conclusion — My Strategy

I settled on starting with two words: AROSE, then UNTIL. That gets all the vowels and the top six consonants. AROSE also puts letters in their most likely positions for each slot.

Unless those two hit extraordinarily well, I almost always follow up with CHAMP/CHOMP/CHUMP/CHIMP, depending on which vowels have hit and/or if their position(s) are known. That eliminates some more of the most common letters, and usually helps narrow the position of a vowel.

My other common probe is GAWKY/FUDGY. I'll use that for a fourth sacrifice if I'm not feeling inspired. If my vowel hits are poor after the first two, I might try this before CH?MP, to test for Y as a vowel.

I am bad at anagrams generally (if I try to think of a word with some particular subset of letters, my brain just makes fart noises) so a process-of-elimination strategy suits me in Wordle.

Results / Output

Word lists

Analysis Results

Scripts