Step-by-Step Guide

Step-by-Step Guide

You can run each stage by hand. This makes it easy to tweak parameters and understand what each step does.

Prerequisites

On macOS you can install ffmpeg with:

brew install ffmpeg

Stage 1: Extract English Lemmas from Wiktionary

Input: data/wiktionary/enwiktionary-latest-pages-articles.xml.bz2 (the script will download it in quickstart).
Output: artifacts/lemmas/lemmas.tsv.

python -m src.ingest.wiktionary_extract \
  data/wiktionary/enwiktionary-latest-pages-articles.xml.bz2 \
  artifacts/lemmas/lemmas.tsv

What this does: Stream the XML dump, keep main-namespace pages that contain an English section and have titles that look like plain lowercase words ([a–z]+). Each line in the output is a lemma.

Tips:

Stage 2: Infer a Robust First Year from Google 1-grams

Inputs: artifacts/lemmas/lemmas.tsv and all .gz shards under data/ngrams/.
Output: artifacts/years/first_years.tsv with two columns: word<TAB>first_year (empty if unknown).

python -m src.ingest.ngram_first_year \
  artifacts/lemmas/lemmas.tsv \
  data/ngrams \
  artifacts/years/first_years.tsv \
  --tau 1e-9 --window 3 --guard 3 --start-year 1800 --end-year 2019

What this does: For each lemma, read matching 1-gram rows, sum match and volume counts per year, compute relative frequency match/volume, smooth with a 3-year centered moving average, then pick the earliest year ≥ 1800 where the smoothed frequency crosses tau and at least guard years in the next five have non-zero frequency. That gives a first year that is less sensitive to noise.

Tuning knobs:

Stage 3: Build Cumulative Prefix Counts (depth 6)

Input: artifacts/years/first_years.tsv.
Output: artifacts/trie/prefix_counts.jsonl with newline-delimited JSON records like {prefix, depth, year, cumulative_count}.

python -m src.build.build_prefix_trie \
  artifacts/years/first_years.tsv \
  artifacts/trie/prefix_counts.jsonl \
  --depth 6 --start 1800 --end 2019

What this does: For each word with a first year, add its 1–6 letter prefixes starting at that year; then sweep years from 1800 to 2019 and accumulate to produce a non-decreasing series per prefix. Depth 6 is a sweet spot: detailed enough to see families, small enough to render quickly.

Stage 4: Render the Animation (Radial Layout)

Input: artifacts/trie/prefix_counts.jsonl.
Output: PNG frames under outputs/frames/.

python -m src.viz.render_frames \
  artifacts/trie/prefix_counts.jsonl \
  outputs/frames \
  --width 1920 --height 1080 \
  --min-radius 10 --max-radius 120 \
  --title-font-size 112 --detail-font-size 42 \
  --base-edge-alpha 25 --edge-depth 6 \
  --start-progress 0.25 --end-progress 1.0

What this does: Place prefixes on a polar layout that allocates angular sectors by ancestry; ease the global scale in so early frames are readable; draw edges and nodes by depth with a warm-to-cool palette; overlay the year and simple counters.

Reading the frame: Larger circles mean more cumulative words under that prefix. Edges connect a prefix to its parent. The base letters sit roughly around the circle; descendants branch outward.

Alternative: Rectangular Layout with Labels

If you prefer a stable grid where labels can sit near nodes, compute positions once and reuse them for every year:

python -m src.viz.layout \
  artifacts/trie/prefix_counts.jsonl \
  artifacts/layout/prefix_positions.json

python -m src.viz.render_frames_rectangular \
  artifacts/trie/prefix_counts.jsonl \
  artifacts/layout/prefix_positions.json \
  outputs/frames \
  --width 1920 --height 1080 --padding 120 \
  --min-radius 5 --max-radius 64 \
  --label-limit 8 --label-depth 4 --label-spacing 20

The rectangular view stacks depths vertically and spreads siblings horizontally by a deterministic rule. Labels indicate prominent prefixes for the current year if there is space.

Stage 5: Encode MP4 and GIF

Input: frames in outputs/frames/.
Outputs: outputs/english_trie_timelapse.mp4 and outputs/english_trie_timelapse.gif.

python -m src.viz.encode \
  outputs/frames \
  outputs/english_trie_timelapse.mp4 \
  outputs/english_trie_timelapse.gif \
  --fps 7.333 --gif-fps 12 --gif-width 1280

The MP4 is ideal for high-quality playback on most platforms. The GIF is palette-optimized for sharing where video is awkward.

Next Steps