用20行Python构建Markov Chain语句生成器
A bot who can write a long letter with ease, cannot write ill.
—Jane Austen, Pride and Prejudice
这篇文章将引导您逐步学习如何使用Python从头开始编写马尔可夫链(Markov Chain),以生成好像一个真实的人写的英语的全新句子。 简·奥斯丁的《傲慢与偏见》(Pride and Prejudice by Jane Austen) 是我们用来构建马尔可夫链的文字。 Colab 上有一篇可运行的笔记本版本。
Setup
首先下载“傲慢与偏见”的全文。
# 下载Pride and Prejudice和并切断头.
!curl https://www.gutenberg.org/files/1342/1342-0.txt | tail -n+32 > /content/pride-and-prejudice.txt
# 预览文件.
!head -n 10 /content/pride-and-prejudice.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 707k 100 707k 0 0 1132k 0 --:--:-- --:--:-- --:--:-- 1130k
PRIDE AND PREJUDICE
By Jane Austen
Chapter 1
It is a truth universally acknowledged, that a single man in possession
添加一些必要的导入。
import collections
import random
import re
import numpy as np
建立马尔可夫链
将文件读取为字符串,然后将单词拆分为列表。
然后,我们可以使用Python方便的defaultdict
来创建马尔可夫链。
要构建链,请获取文本中的每个单词,然后将其插入到键为前一个单词的字典中,并在内部字典中每次增加该单词的计数器。
这将生成一个词典,其中每个键都指向该键之后的所有单词以及实例数。
# 从文件中读取文本并标记化.
path = '/content/pride-and-prejudice.txt'
with open(path) as f:
text = f.read()
tokenized_text = [
word
for word in re.split('\W+', text)
if word != ''
]
# 创建图.
markov_graph = collections.defaultdict(lambda: collections.Counter())
last_word = tokenized_text[0].lower()
for word in tokenized_text[1:]:
word = word.lower()
markov_graph[last_word].update([word])
last_word = word
# 预览图.
limit = 3
for first_word in ('the', 'by', 'who'):
next_words = list(markov_graph[first_word].keys())[:limit]
for next_word in next_words:
print(first_word, next_word)
the feelings
the minds
the surrounding
by jane
by a
by the
who has
who waited
who came
产生句子
现在是有趣的部分。 定义一个功能来帮助我们走链。 它从一个随机词开始,然后是下一个词的可能选择,它使用np.random.choice进行加权随机选择。
def walk_graph(graph, distance=5, start_node=None):
"""返回随机加权步行中的单词列表."""
if distance <= 0:
return []
# 如果未给出,则随机选择一个起始节点.
if not start_node:
start_node = random.choice(list(graph.keys()))
weights = np.array(
list(markov_graph[start_node].values()),
dtype=np.float64)
# 标准化字数总和为1.
weights /= weights.sum()
# 使用加权分布选择目的地.
choices = list(markov_graph[start_node].keys())
chosen_word = np.random.choice(choices, None, p=weights)
return [chosen_word] + walk_graph(
graph, distance=distance-1,
start_node=chosen_word)
for i in range(10):
print(' '.join(walk_graph(
markov_graph, distance=12)), '\n')
was with each other of communication it kitty and such a doubt
when the country ensued made for she cried miss elizabeth that had
it would have taken a valuable neighbour lady s steady friendship replied
on these recollections that he considered as well is but her companions
and laugh that i only headstrong and what lydia s mr darcy
till supper his it a part us yesterday se nnight elizabeth had
on that he that whatever she thus addressed them that he might
countenance of both joy jane when it which mr darcy was suddenly
woods to me you know him at five years longer be adapted
unless charlotte s letter though she did before they must give her
这就是基本的马尔可夫链! 可以从此处进行很多增强,但是希望这表明您可以仅用几十行Python来实现Markov Chain文本生成器。