Build a Markov Chain Sentence Generator in 20 lines of Python
A bot who can write a long letter with ease, cannot write ill.
—Jane Austen, Pride and Prejudice
This post walks you through how to write a Markov Chain from scratch with Python in order to generate completely new sentences that resemble English.
The text we’ll be using to build the Markov Chain is Pride and Prejudice by Jane Austen. You can follow along here or grab a runnable notebook version of this post on Colab.
Setup
First download the full text of Pride and Prejudice.
# Download Pride and Prejudice and cut off header.
!curl https://www.gutenberg.org/files/1342/1342-0.txt | tail -n+32 > /content/pride-and-prejudice.txt
# Preview file.
!head -n 10 /content/pride-and-prejudice.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 707k 100 707k 0 0 1132k 0 --:--:-- --:--:-- --:--:-- 1130k
PRIDE AND PREJUDICE
By Jane Austen
Chapter 1
It is a truth universally acknowledged, that a single man in possession
Add some necessary imports.
import numpy as np
import random
import re
from collections import defaultdict
Building the Markov Chain
Read the file into a string and split the words into a list. Then, we can use Python’s handy defaultdict to create the Markov Chain.
To build the chain, take every word in the text and insert it into the dictionary where the key is the previous word, incrementing the counter for that word in the inner dictionary each time. This generates a dictionary where each key points to all the words that have ever followed it, along with the number of instances.
# Read text from file and tokenize.
path = '/content/pride-and-prejudice.txt'
with open(path) as f:
text = f.read()
tokenized_text = [
word
for word in re.split('\W+', text)
if word != ''
]
# Create graph.
markov_graph = defaultdict(lambda: defaultdict(int))
last_word = tokenized_text[0].lower()
for word in tokenized_text[1:]:
word = word.lower()
markov_graph[last_word][word] += 1
last_word = word
# Preview graph.
limit = 3
for first_word in ('the', 'by', 'who'):
next_words = list(markov_graph[first_word].keys())[:limit]
for next_word in next_words:
print(first_word, next_word)
the feelings
the minds
the surrounding
by jane
by a
by the
who has
who waited
who came
Generating Sentences
Now for the fun part. Define a function to help us walk the chain. It starts at a random word and of the possible choices for the next word, it takes a weighted random choice using np.random.choice.
def walk_graph(graph, distance=5, start_node=None):
"""Returns a list of words from a randomly weighted walk."""
if distance <= 0:
return []
# If not given, pick a start node at random.
if not start_node:
start_node = random.choice(list(graph.keys()))
weights = np.array(
list(markov_graph[start_node].values()),
dtype=np.float64)
# Normalize word counts to sum to 1.
weights /= weights.sum()
# Pick a destination using weighted distribution.
choices = list(markov_graph[start_node].keys())
chosen_word = np.random.choice(choices, None, p=weights)
return [chosen_word] + walk_graph(
graph, distance=distance-1,
start_node=chosen_word)
for i in range(10):
print(' '.join(walk_graph(
markov_graph, distance=12)), '\n')
was with each other of communication it kitty and such a doubt
when the country ensued made for she cried miss elizabeth that had
it would have taken a valuable neighbour lady s steady friendship replied
on these recollections that he considered as well is but her companions
and laugh that i only headstrong and what lydia s mr darcy
till supper his it a part us yesterday se nnight elizabeth had
on that he that whatever she thus addressed them that he might
countenance of both joy jane when it which mr darcy was suddenly
woods to me you know him at five years longer be adapted
unless charlotte s letter though she did before they must give her
And that’s it for a basic Markov Chain! There are a lot of enhancements that could be made from here, but hopefully this demonstrates that you can implement a Markov Chain text generator with just a couple dozen lines of Python.