Step-by-Step Guide

You can run each stage by hand. This makes it easy to tweak parameters and understand what each step does.

Prerequisites

Python 3 and pip
ffmpeg on your PATH
Several GB of disk for datasets and artifacts
A few minutes to tens of minutes depending on your connection and CPU

On macOS you can install ffmpeg with:

brew install ffmpeg

Stage 1: Extract English Lemmas from Wiktionary

Input: data/wiktionary/enwiktionary-latest-pages-articles.xml.bz2 (the script will download it in quickstart).
Output: artifacts/lemmas/lemmas.tsv.

python -m src.ingest.wiktionary_extract \
  data/wiktionary/enwiktionary-latest-pages-articles.xml.bz2 \
  artifacts/lemmas/lemmas.tsv

What this does: Stream the XML dump, keep main-namespace pages that contain an English section and have titles that look like plain lowercase words ([a–z]+). Each line in the output is a lemma.

Tips:

The parser streams and clears elements to keep memory stable.
Lemma titles are lowercased to reduce case duplication later.

Stage 2: Infer a Robust First Year from Google 1-grams

Inputs: artifacts/lemmas/lemmas.tsv and all .gz shards under data/ngrams/.
Output: artifacts/years/first_years.tsv with two columns: word<TAB>first_year (empty if unknown).

python -m src.ingest.ngram_first_year \
  artifacts/lemmas/lemmas.tsv \
  data/ngrams \
  artifacts/years/first_years.tsv \
  --tau 1e-9 --window 3 --guard 3 --start-year 1800 --end-year 2019

What this does: For each lemma, read matching 1-gram rows, sum match and volume counts per year, compute relative frequency match/volume, smooth with a 3-year centered moving average, then pick the earliest year ≥ 1800 where the smoothed frequency crosses tau and at least guard years in the next five have non-zero frequency. That gives a first year that is less sensitive to noise.

Tuning knobs:

--tau raises or lowers the minimum frequency. Higher means stricter.
--guard requires more non-zero support after the candidate year.
--window controls smoothing; keep it small to avoid blurring.

Stage 3: Build Cumulative Prefix Counts (depth 6)

Input: artifacts/years/first_years.tsv.
Output: artifacts/trie/prefix_counts.jsonl with newline-delimited JSON records like {prefix, depth, year, cumulative_count}.

python -m src.build.build_prefix_trie \
  artifacts/years/first_years.tsv \
  artifacts/trie/prefix_counts.jsonl \
  --depth 6 --start 1800 --end 2019

What this does: For each word with a first year, add its 1–6 letter prefixes starting at that year; then sweep years from 1800 to 2019 and accumulate to produce a non-decreasing series per prefix. Depth 6 is a sweet spot: detailed enough to see families, small enough to render quickly.

Stage 4: Render the Animation (Radial Layout)

Input: artifacts/trie/prefix_counts.jsonl.
Output: PNG frames under outputs/frames/.

python -m src.viz.render_frames \
  artifacts/trie/prefix_counts.jsonl \
  outputs/frames \
  --width 1920 --height 1080 \
  --min-radius 10 --max-radius 120 \
  --title-font-size 112 --detail-font-size 42 \
  --base-edge-alpha 25 --edge-depth 6 \
  --start-progress 0.25 --end-progress 1.0

What this does: Place prefixes on a polar layout that allocates angular sectors by ancestry; ease the global scale in so early frames are readable; draw edges and nodes by depth with a warm-to-cool palette; overlay the year and simple counters.

Reading the frame: Larger circles mean more cumulative words under that prefix. Edges connect a prefix to its parent. The base letters sit roughly around the circle; descendants branch outward.

Alternative: Rectangular Layout with Labels

If you prefer a stable grid where labels can sit near nodes, compute positions once and reuse them for every year:

python -m src.viz.layout \
  artifacts/trie/prefix_counts.jsonl \
  artifacts/layout/prefix_positions.json

python -m src.viz.render_frames_rectangular \
  artifacts/trie/prefix_counts.jsonl \
  artifacts/layout/prefix_positions.json \
  outputs/frames \
  --width 1920 --height 1080 --padding 120 \
  --min-radius 5 --max-radius 64 \
  --label-limit 8 --label-depth 4 --label-spacing 20

The rectangular view stacks depths vertically and spreads siblings horizontally by a deterministic rule. Labels indicate prominent prefixes for the current year if there is space.

Stage 5: Encode MP4 and GIF

Input: frames in outputs/frames/.
Outputs: outputs/english_trie_timelapse.mp4 and outputs/english_trie_timelapse.gif.

python -m src.viz.encode \
  outputs/frames \
  outputs/english_trie_timelapse.mp4 \
  outputs/english_trie_timelapse.gif \
  --fps 7.333 --gif-fps 12 --gif-width 1280

The MP4 is ideal for high-quality playback on most platforms. The GIF is palette-optimized for sharing where video is awkward.

Next Steps

Learn about advanced tuning options for customizing the visualization
Understand how to interpret the results
Explore troubleshooting tips if you encounter issues