Step-by-Step Guide
Step-by-Step Guide
You can run each stage by hand. This makes it easy to tweak parameters and understand what each step does.
Prerequisites
- Python 3 and pip
- ffmpeg on your PATH
- Several GB of disk for datasets and artifacts
- A few minutes to tens of minutes depending on your connection and CPU
On macOS you can install ffmpeg with:
brew install ffmpeg
Stage 1: Extract English Lemmas from Wiktionary
Input: data/wiktionary/enwiktionary-latest-pages-articles.xml.bz2 (the script will download it in quickstart).
Output: artifacts/lemmas/lemmas.tsv.
python -m src.ingest.wiktionary_extract \
data/wiktionary/enwiktionary-latest-pages-articles.xml.bz2 \
artifacts/lemmas/lemmas.tsv
What this does: Stream the XML dump, keep main-namespace pages that contain an English section and have titles that look like plain lowercase words ([a–z]+). Each line in the output is a lemma.
Tips:
- The parser streams and clears elements to keep memory stable.
- Lemma titles are lowercased to reduce case duplication later.
Stage 2: Infer a Robust First Year from Google 1-grams
Inputs: artifacts/lemmas/lemmas.tsv and all .gz shards under data/ngrams/.
Output: artifacts/years/first_years.tsv with two columns: word<TAB>first_year (empty if unknown).
python -m src.ingest.ngram_first_year \
artifacts/lemmas/lemmas.tsv \
data/ngrams \
artifacts/years/first_years.tsv \
--tau 1e-9 --window 3 --guard 3 --start-year 1800 --end-year 2019
What this does: For each lemma, read matching 1-gram rows, sum match and volume counts per year, compute relative frequency match/volume, smooth with a 3-year centered moving average, then pick the earliest year ≥ 1800 where the smoothed frequency crosses tau and at least guard years in the next five have non-zero frequency. That gives a first year that is less sensitive to noise.
Tuning knobs:
--tauraises or lowers the minimum frequency. Higher means stricter.--guardrequires more non-zero support after the candidate year.--windowcontrols smoothing; keep it small to avoid blurring.
Stage 3: Build Cumulative Prefix Counts (depth 6)
Input: artifacts/years/first_years.tsv.
Output: artifacts/trie/prefix_counts.jsonl with newline-delimited JSON records like {prefix, depth, year, cumulative_count}.
python -m src.build.build_prefix_trie \
artifacts/years/first_years.tsv \
artifacts/trie/prefix_counts.jsonl \
--depth 6 --start 1800 --end 2019
What this does: For each word with a first year, add its 1–6 letter prefixes starting at that year; then sweep years from 1800 to 2019 and accumulate to produce a non-decreasing series per prefix. Depth 6 is a sweet spot: detailed enough to see families, small enough to render quickly.
Stage 4: Render the Animation (Radial Layout)
Input: artifacts/trie/prefix_counts.jsonl.
Output: PNG frames under outputs/frames/.
python -m src.viz.render_frames \
artifacts/trie/prefix_counts.jsonl \
outputs/frames \
--width 1920 --height 1080 \
--min-radius 10 --max-radius 120 \
--title-font-size 112 --detail-font-size 42 \
--base-edge-alpha 25 --edge-depth 6 \
--start-progress 0.25 --end-progress 1.0
What this does: Place prefixes on a polar layout that allocates angular sectors by ancestry; ease the global scale in so early frames are readable; draw edges and nodes by depth with a warm-to-cool palette; overlay the year and simple counters.
Reading the frame: Larger circles mean more cumulative words under that prefix. Edges connect a prefix to its parent. The base letters sit roughly around the circle; descendants branch outward.
Alternative: Rectangular Layout with Labels
If you prefer a stable grid where labels can sit near nodes, compute positions once and reuse them for every year:
python -m src.viz.layout \
artifacts/trie/prefix_counts.jsonl \
artifacts/layout/prefix_positions.json
python -m src.viz.render_frames_rectangular \
artifacts/trie/prefix_counts.jsonl \
artifacts/layout/prefix_positions.json \
outputs/frames \
--width 1920 --height 1080 --padding 120 \
--min-radius 5 --max-radius 64 \
--label-limit 8 --label-depth 4 --label-spacing 20
The rectangular view stacks depths vertically and spreads siblings horizontally by a deterministic rule. Labels indicate prominent prefixes for the current year if there is space.
Stage 5: Encode MP4 and GIF
Input: frames in outputs/frames/.
Outputs: outputs/english_trie_timelapse.mp4 and outputs/english_trie_timelapse.gif.
python -m src.viz.encode \
outputs/frames \
outputs/english_trie_timelapse.mp4 \
outputs/english_trie_timelapse.gif \
--fps 7.333 --gif-fps 12 --gif-width 1280
The MP4 is ideal for high-quality playback on most platforms. The GIF is palette-optimized for sharing where video is awkward.
Next Steps
- Learn about advanced tuning options for customizing the visualization
- Understand how to interpret the results
- Explore troubleshooting tips if you encounter issues