A spelling checker to rule them all
Judging from my readers' comments, it looks like my blog posts are plagued with typos - probably a side effect of writing at night. Fear not, however, as Thomas Bayes and Peter Norvig come to our rescue with a deceptively simple - if functional - spelling checker.
Recipe difficulty:
- Statistics: 🔥 - easy peasy
- Technical: 🔥🔥 - nothing crazy, just needs some familiarity with functional composition and list comprehensions.
- Time required: 🕜 - around 1 hour from start to finish.
What it is
Well, a spelling checker.
Why it matters
Writing spell checkers in a Bayesian framework is a fun and lightweight dive into rules composition; and it is pretty damn efficient!
The framework basically works like this: for each words, we generate its genetic evolution using the following basic operators, as outlined by Peter Norvig's seminal étude:
- delete: we remove a letter at random
- swap: we invert two letters by place
- replace: we replace any letter
- insert: we insert a letter at a random position.
Composing the basic operators multiple times, we can move away from the original word more and more, at an increasing computational cost: by combining these genetic operators, we can degenerate any word - for instance, moving away from 'home'→'hoem'→'hoeme' in two steps.
We can of course do this the other way around: the composition is bijective, so starting from the misspelled 'hoeme' we can reach 'home' in two steps. By intersecting a list of these results with a dictionary of the English language, we can find candidate replacements for any word.
Once we have generated a list candidates for a certain (potential) misspelling, we need to rank them and find the most likely candidate. To do so, we can consider Bayes' rule, which says:
\[ P(word \vert misspelling) = \frac{P(misspelling \vert word)\times P(word)}{P(misspelling)}\]
Assuming the probability of a given misspelling is infinitesimal and constant, we get the simplification:
\[ P(W \vert M) = P(M \vert W)\times P(W) \]
We can calculate $P(W)$ by simply looking at the relative frequency of words in the English language:
ALPHABET = string.ascii_lowercase
def dictionary_from_file(path, min_freq):
dictionary = {}
with open(path, 'r', encoding='utf-8') as file:
for line in file:
word, freq = line.split(' ')
if int(freq) > min_freq:
dictionary[word] = int(freq)
return dictionary
def normalized_dictionary(dictionary, total_frequency = sum(dictionary.values())):
dictionary = dictionary.copy()
for key in dictionary.keys():
dictionary[key] = np.log(dictionary[key]/total_frequency)
return dictionary
dictionary = dictionary_from_file('wordsfreq.txt',1000)
dictionary_normalized = normalized_dictionary(dictionary)
dictionary_set = set(dictionary.keys())
We get a dictionary like: {"the":.0332, "a':.0122, ... }, with the relative frequency of each word. We could improve the model by working with bi- or tri-grams, but for simplicity's sake, let's stick to the assumption that context is not relevant for our spelling checker.
To get our variations, as mentioned we just combine a number of compositional operators.
def get_variations(word, dictionary_set = None):
split = [(word[:i], word[i:]) for i in range(len(word))]
delete = [left+right[1:] for left, right in split]
swap = [left+right[1]+right[0]+right[2:] for left, right in split if len(right)>1]
replace = [left+letter+right[1:] for left, right in split for letter in ALPHABET]
insert = [left+letter+right for left, right in split for letter in ALPHABET]
if dictionary_set is not None:
return dictionary_set.intersection(set(delete+swap+replace+insert))
return set(delete+swap+replace+insert)
def get_variations_E2(word, dictionary_set = None):
distance1 = get_variations(word)
all_words = set()
for word in distance1:
all_words.update(get_variations(word))
if dictionary_set is not None:
return dictionary_set.intersection(all_words)
return all_words
Now we need to get $P(M|W)$, our error model. We can assign a relative weight to the e2 and e1 distance: a higher value for the e1 weight means that we'll score the probability of words which are one step away higher than those who require two misspellings in the same word (as it should be).
Without further ado:
def get_normalized_variations(word, dictionary_normalized=None, e1=2, e2=1):
distance1 = get_variations(word, set(dictionary_normalized.keys()))
distance2 = get_variations_E2(word, set(dictionary_normalized.keys()))
dist_dict = dict.fromkeys(distance2, e2)
dist_dict.update(dict.fromkeys(distance1, e1))
total_frequency = sum(dist_dict.values())
for key in dist_dict.keys():
dist_dict[key] = np.log(dist_dict[key]/total_frequency)
return dist_dict
def score_candidates(word, dictionary_normalized, e1=2, e2=1):
pwr_w_dict = get_normalized_variations(word, dictionary_normalized, e1, e2)
p_w_dict = {word: dictionary_normalized[word] for word in pwr_w_dict.keys()}
scores = {word: np.exp(pwr_w_dict[word]+p_w_dict[word]) for word in pwr_w_dict.keys()}
if len(scores.keys())>0:
return scores
return {'UNK':1}
Note the log trick (np.log): since we'll be multiplying very small numbers, we risk numerical underflow. By converting our probabilities to log, summing them instead than multiplying, and then converting them back to exponential (np.exp) we get the correct probabilities without our computer freaking out...
Finally, we get the best candidate correction using a couple of helper functions: what we do is basically rank all corrections by their probability $ P(W \vert M) = P(M \vert W)\times P(W) $, get the highest scoring candidate (if there is any), and return it.
def get_best_candidate(word, dictionary_normalized, e1=2, e2=1):
scored_suggestions = score_candidates(word, dictionary_normalized, e1, e2)
return max(scored_suggestions.items(), key=operator.itemgetter(1))[0]
def correct(word, dictionary_normalized, e1=2, e2=1):
if word.lower() not in dictionary_normalized.keys():
return get_best_candidate(word, dictionary_normalized, e1, e2)
return word
def spell_check_text(text, dictionary_normalized, e1=2, e2=1, min_word_length=1):
replace_chars = "\\`*_{}[]()>#+-.,!$"
for c in replace_chars:
text = text.replace(c, " ")
text = " ".join(text.lower().split())
words = ''.join([x for x in text if x in ALPHABET+' ']).split(' ')
words = [w for w in words if w!= '' and len(w) > min_word_length]
corrections = {}
for word in words:
corr_word = correct(word, dictionary_normalized, e1, e2)
if word != corr_word:
corrections[word] = corr_word
return corrections
So, does it work?
text = '''Jueedging fraom my raders' commments,
it loeks like my blog posts are plaegued with typographic mispelled
- probably a side effect of writing at night.'''
spell_check_text(text, dictionary_normalized, e1=.95, e2=.05)
Misspelling | Correction |
---|---|
'jueedging' | 'judging' |
'fraom' | 'from', |
'raders' | 'readers', |
'commments' | 'comments', |
'loeks' | 'looks', |
'plaegued' | 'plagued', |
'mispelled' | 'misspelled' |
I'd say it does! That's no guarantee my next blog posts will be typo-free, but it sure is a step forward! :)