Considering File Formats for ChineseDict
Recently I’ve been hacking on ChineseDict, a fast web frontend to look up Chinese words. I started the project originally to fulfill a specific use case: when practicing chatting with friends or watching a video, I need search results in milliseconds, not seconds.
The website uses the CC-EDICT dictionary to power the results. I now want to expand each term with examples and extra data in a way that’s easy to store on disk and update via git commit.
In total for each entry I want to store:
- Traditional and simplified characters
- Pinyin reading
- Relative word frequency
- HSK level
- Definitions, including:
- Parts of speech
- Meanings
- Example sentences with translations
Here are the options I considered:
YAML
I initially considered YAML since it’s extremely easy to read. However, there are well-known downsides of using YAML: the syntax is not always obvious, it lacks schema, and has some sharp edge cases.
traditional: 飯
simplified: 饭
percentile: 0.99
hsk: 1
definitions:
- meaning: food, meal
part_of_speech: noun
examples:
- example: 煮飯吧
translation: Let's cook
- example: 吃饭了
translation: Time to eat!
- meaning: cooked rice
part_of_speech: noun
examples:
- example: 两碗饭
translation: Two bowls of rice
Creating this example, it’s clear that as I add layers of nesting (the
examples
are a list inside a dict inside a list inside a dict), it adds more
mental overhead to make sure everything is okay. It’s still quite readable,
the indentation just gets a little precarious the further down you go.
TOML
I have enjoyed the times I’ve used TOML (like for this site), and it seems like a pretty sane configuration language.
traditional = "飯"
simplified = "饭"
percentile = 0.99
hsk = 1
[[definitions]]
meaning = "food, meal"
part_of_speech = "noun"
[[definitions.examples]]
example = "煮飯吧"
translation = "Let's cook"
[[definitions.examples]]
example = "吃饭了"
translation = "Time to eat!"
[[definitions]]
meaning = "cooked rice"
part_of_speech = "noun"
[[definitions.examples]]
example = "两碗饭"
translation = "Two bowls of rice"
For this use case, I really like that the list items are clearly separated in their own sections, however the extra syntax (double quotes, double brackets), does make it a little more effort to edit. Additionally it still runs into the indentation / formatting complexity that YAML exhibits with nested lists of dicts..
TextProto
Protobufs bring something to the table the above two options do not: schema. It would be nice to be able to enforce that every file adheres to a predefined proto, although effectively this is about the same as getting a runtime error when the server starts up for malformed YAML/TOML in the cases above. The benefit appears when working cross-language, for example if I decide to load these objects in JavaScript in the browser.
The definition:
message Entry {
optional string traditional = 1;
optional string simplified = 2;
optional float32 percentile = 3;
optional int32 hsk = 4;
repeated Definition definitions = 5;
}
message Definition {
optional string meaning = 1;
optional string part_of_speech = 2;
repeated Example examples = 3;
}
message Example {
optional string example = 1;
optional string translation = 2;
}
The file:
entry {
traditional: "飯"
simplified: "饭"
percentile: 0.99
hsk = 1
definition {
meaning = "cooked rice"
part_of_speech = "noun"
example {
example = "煮飯吧"
translation = "Let's cook"
}
example {
example = "吃饭了"
translation = "Time to eat!"
}
}
definition {
meaning = "food, meal"
part_of_speech = "noun"
example {
example = "两碗饭"
translation = "Two bowls of rice"
}
}
}
Alternatives not considered
JSON: too cumbersome for humans to write, and doesn’t produce nice diffs.
Conclusion
Overall this exercise gave me a better idea for the format I want to choose for serializing thousands of dictionary entries. My #1 priority is ease of use for updates, and in that respect YAML is the winner.