The Most Commonly Used Chinese Words
I think the most effective way to learn a language is to prioritize learning the day-to-day most frequently used words. Picking words to study in order of frequency is the optimal way to maximally increase your marginal understanding of the language for each successive word you learn.
To that end, I took a dataset of Weibo (China’s Twitter) posts and ranked all words that appeared in the dataset order of frequency.
The CSV contains a “back” column so you can also load it into Anki and use it as flashcards. The dataset is thanks to Dr. Minlie Huang at Tsinghua University. You can find the dataset here.
Let me know if this is useful!
Code to Generate Top Word List
If you’re interested in how this was generated and want to remix the code, it’s here on Colab, as well as written out below.
import pandas as pd
!pip install chinese
!pip install pinyin
Collecting chinese
[?25l Downloading https://files.pythonhosted.org/packages/15/fe/35c1cd7792f0c899fbeae66d35491721cae6be6d8a128d4f77e6e3479b3a/chinese-0.2.1-py3-none-any.whl (12.6MB)
[K |████████████████████████████████| 12.6MB 2.8MB/s
[?25hCollecting pynlpir
[?25l Downloading https://files.pythonhosted.org/packages/7c/66/79d353119143f92fdf80aea0e8b5b8289baf60708a3202fc7a4d3a530d0e/PyNLPIR-0.6.0-py2.py3-none-any.whl (13.1MB)
[K |████████████████████████████████| 13.1MB 1.3MB/s
[?25hRequirement already satisfied: jieba in /usr/local/lib/python3.6/dist-packages (from chinese) (0.42.1)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from pynlpir->chinese) (7.0)
Installing collected packages: pynlpir, chinese
Successfully installed chinese-0.2.1 pynlpir-0.6.0
Requirement already satisfied: pinyin in /usr/local/lib/python3.6/dist-packages (0.4.0)
from chinese import ChineseAnalyzer
analyzer = ChineseAnalyzer()
result = analyzer.parse('就')
print(result.tokens())
print(result.pinyin())
import pinyin
import pinyin.cedict
['就']
jiù
!curl -O http://coai.cs.tsinghua.edu.cn/media/files/ecm_train_data.zip
!unzip ecm_train_data.zip
!du -csh *
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 44.7M 100 44.7M 0 0 294k 0 0:02:35 0:02:35 --:--:-- 218k
Archive: ecm_train_data.zip
inflating: train_data.json
45M ecm_train_data.zip
55M sample_data
126M train_data.json
225M total
import json
with open('train_data.json') as f:
weibo_posts = json.load(f)
print('Num posts:', len(weibo_posts))
Num posts: 1119207
# Preview Weibo posts.
for post in weibo_posts[:10]:
print(post)
[['希望 九 哥 日日 开心 同 我 打 羽毛 波\n', 1], ['加 埋 哦 ! 句 需要 运动 !\n', 0]]
[['哈哈 、 生日 快乐 。 我 地 居然 同一 日 生日\n', 5], ['哈哈 … 生日 快乐 。\n', 5]]
[['有人 问 , 何时 去 北京 演出 呢 ? 正好 你 回答 一下 。\n', 0], ['北京 ? 争取 2011 年底 。 回答 完毕\n', 0]]
[['这 美 ?\n', 1], ['哈哈 ~ ~\n', 5]]
[['一定 要 支持 ~\n', 1], ['谢谢 支持 祝 你 好运 哦\n', 1]]
[['肿 么 呢 ? 青奈 滴 ? 吃 点 退烧 药\n', 3], ['这 都 不是 重点 , 马 春然 完美 演出 才 是 正经\n', 1]]
[['粗粗 吧 , 我 觉得 还 好 啊 。 你 回来 没 啊 上报 哥 !\n', 2], ['怎么 叫 上报 哥 啊 看 我 八 月 的 排班 怎样 有 时间 就 回去\n', 4]]
[['我 听 在 块\n', 0], ['今晚 你们 全都 要 来 老 公司 么 ?\n', 4]]
[['我 想 我 会 一直 孤单 , 过 着 孤单 的 生活\n', 2], ['原来 都 是 或 曾经 孤单 , 或 正在 孤单 的 主儿 。\n', 2]]
[['混 得 好 我 就 不 回来 了 , 你们 在 天朝 要 坚强 。 到 时候 一起 接 你们 过去 。\n', 0], ['要 奸 墙 , 不要 被 强奸 。\n', 4]]
import re
weibo_word_frequency = {}
for question, response in weibo_posts:
words = question[0].split(' ') + response[0].split(' ')
for word in words:
word = word.strip()
# Only include words that consist of Chinese characters.
if re.match(r'[\u4e00-\u9fff]+', word):
if word not in weibo_word_frequency:
weibo_word_frequency[word] = 0
weibo_word_frequency[word] += 1
weibo_word_frequency['北京']
8044
df = pd.DataFrame(data=weibo_word_frequency.items(),
columns=['word', 'occurrences'])
df.head(10)
# Add frequency percentiles
df['percentile'] = df.rank(pct=True, numeric_only=True)
# What are the most frequently used words?
df.sort_values('percentile', ascending=False, inplace=True)
df.head(10)
# What are the least frequent words?
df.tail(5)
# Limit list to 2000 words
df = df[:2000]
print(df.describe())
df.head()
occurrences percentile
count 2000.000000 2000.000000
mean 7750.468500 0.989464
std 33741.542328 0.006088
min 880.000000 0.978922
25% 1297.000000 0.984193
50% 2100.000000 0.989464
75% 4473.250000 0.994732
max 772887.000000 1.000000
def word_to_pinyin(word):
if not word:
return ''
pin = pinyin.get(word)
return pin
word_to_pinyin('你')
'nǐ'
def word_to_definition(word):
if not word:
return ''
definition = pinyin.cedict.translate_word(word)
if not definition:
return ''
return '<br/>'.join(list(definition))
word_to_definition('是')
'variant of 是[shi4]<br/>(used in given names)'
df['pinyin'] = df['word'].apply(word_to_pinyin)
df['definition'] = df['word'].apply(word_to_definition)
df['back'] = df['word'].apply(lambda word: '<b>'+word_to_pinyin(word)+'</b><br/>'+word_to_definition(word))
df[['word', 'pinyin', 'percentile', 'definition']].head(50)
df.to_csv('most_common_5k_chinese_words_v6.csv')
!du -sh /content/most_common_5k_chinese_words_v6.csv
372K /content/most_common_5k_chinese_words_v6.csv