Considering File Formats for ChineseDict

Published Dec 5, 2023 in Chinese, Software Engineering

Recently I’ve been hacking on ChineseDict, a fast web frontend to look up Chinese words. I started the project originally to fulfill a specific use case: when practicing chatting with friends or watching a video, I need search results in milliseconds, not seconds.

The website uses the CC-EDICT dictionary to power the results. I now want to expand each term with examples and extra data in a way that’s easy to store on disk and update via git commit.

In total for each entry I want to store:

Traditional and simplified characters
Pinyin reading
Relative word frequency
HSK level
Definitions, including:
- Parts of speech
- Meanings
- Example sentences with translations

Here are the options I considered:

YAML

I initially considered YAML since it’s extremely easy to read. However, there are well-known downsides of using YAML: the syntax is not always obvious, it lacks schema, and has some sharp edge cases.

traditional: 飯
simplified: 饭
percentile: 0.99
hsk: 1
definitions:
  - meaning: food, meal
    part_of_speech: noun
    examples:
      - example: 煮飯吧
        translation: Let's cook
      - example: 吃饭了
        translation: Time to eat!
  - meaning: cooked rice
    part_of_speech: noun
    examples:
      - example: 两碗饭
        translation: Two bowls of rice

Creating this example, it’s clear that as I add layers of nesting (the examples are a list inside a dict inside a list inside a dict), it adds more mental overhead to make sure everything is okay. It’s still quite readable, the indentation just gets a little precarious the further down you go.

TOML

I have enjoyed the times I’ve used TOML (like for this site), and it seems like a pretty sane configuration language.

traditional = "飯"
simplified = "饭"
percentile = 0.99
hsk = 1

[[definitions]]
meaning = "food, meal"
part_of_speech = "noun"

  [[definitions.examples]]
  example = "煮飯吧"
  translation = "Let's cook"

  [[definitions.examples]]
  example = "吃饭了"
  translation = "Time to eat!"

[[definitions]]
meaning = "cooked rice"
part_of_speech = "noun"

  [[definitions.examples]]
  example = "两碗饭"
  translation = "Two bowls of rice"

For this use case, I really like that the list items are clearly separated in their own sections, however the extra syntax (double quotes, double brackets), does make it a little more effort to edit. Additionally it still runs into the indentation / formatting complexity that YAML exhibits with nested lists of dicts..

TextProto

Protobufs bring something to the table the above two options do not: schema. It would be nice to be able to enforce that every file adheres to a predefined proto, although effectively this is about the same as getting a runtime error when the server starts up for malformed YAML/TOML in the cases above. The benefit appears when working cross-language, for example if I decide to load these objects in JavaScript in the browser.

The definition:

message Entry {
  optional string traditional = 1;
  optional string simplified = 2;
  optional float32 percentile = 3;
  optional int32 hsk = 4;
  repeated Definition definitions = 5;
}

message Definition {
  optional string meaning = 1;
  optional string part_of_speech = 2;
  repeated Example examples = 3;
}

message Example {
  optional string example = 1;
  optional string translation = 2;
}

The file:

entry {
  traditional: "飯"
  simplified: "饭"
  percentile: 0.99
  hsk = 1
  definition {
    meaning = "cooked rice"
    part_of_speech = "noun"
    example {
      example = "煮飯吧"
      translation = "Let's cook"
    }
    example {
      example = "吃饭了"
      translation = "Time to eat!"
    }
  }
  definition {
    meaning = "food, meal"
    part_of_speech = "noun"
    example {
      example = "两碗饭"
      translation = "Two bowls of rice"
    }
  }
}

Alternatives not considered

JSON: too cumbersome for humans to write, and doesn’t produce nice diffs.

Conclusion

Overall this exercise gave me a better idea for the format I want to choose for serializing thousands of dictionary entries. My #1 priority is ease of use for updates, and in that respect YAML is the winner.

jeffcarp

Software

Life