Methodology
Methodology
Brief Overview
We collect lowercase English lemmas from Wiktionary, compute each lemma’s yearly relative frequency in Google 1-grams, smooth it, and take the earliest year that clears a small threshold and persists for several years as its first appearance. We then aggregate 1–6-letter prefixes and compute a cumulative sum from 1800 to 2006 so values only grow, yielding a compact signal per prefix.
Finally, we render a fixed-layout animation. Depth stacks vertically, horizontal space is deterministic (no jitter), node size scales with the square root of cumulative count, colors encode depth, the background stays dark for contrast, and an eased scale keeps growth natural.
Why Prefixes, and Why a Trie?
Words collect in families. A prefix tries to capture the earliest letters those families share. A trie lets us see that growth at multiple levels at once. One-letter nodes tell you broad shifts. Deeper nodes show fine structure without losing the big picture. The fixed layout helps your eye compare the same family across years.
The Pipeline
1. Extract English Lemmas from Wiktionary
We start with lemmas. The Wiktionary dump is a large XML file. We stream it page by page, keep main-namespace entries tagged as English, and keep titles that look like words: lowercase letters a–z. We write one lemma per line.
What this does: Stream the XML dump, keep main-namespace pages that contain an English section and have titles that look like plain lowercase words ([a–z]+). Each line in the output is a lemma.
Tips:
- The parser streams and clears elements to keep memory stable.
- Lemma titles are lowercased to reduce case duplication later.
2. Infer a Robust First Year from Google 1-grams
We estimate a first year for each lemma. We read Google Books 1-gram shards and collect two numbers per year: match count and volume count. We compute a relative frequency per year by dividing matches by volumes. We smooth with a short moving average so single spikes do not dominate the output. We take the earliest year at or after 1800 where the smoothed value crosses a small threshold and several of the following years are non-zero. That gives us a robust first appearance.
What this does: For each lemma, read matching 1-gram rows, sum match and volume counts per year, compute relative frequency match/volume, smooth with a 3-year centered moving average, then pick the earliest year ≥ 1800 where the smoothed frequency crosses tau and at least guard years in the next five have non-zero frequency. That gives a first year that is less sensitive to noise.
Heuristic rationale: Raw n-gram counts have OCR and sampling noise. A short moving average kills flicker. The guard looks for persistence, which real adoption tends to have. The cost is missing rare or short-lived words; the benefit is a cleaner animation.
3. Build Cumulative Prefix Counts
We then build prefix signals. For every word with a first year, we take its first 1 to 6 letters. We add those prefixes to a per-year counter starting at that year. Then we run a cumulative sum from 1800 to 2019. The result is a compact time series for each prefix where values only increase.
What this does: For each word with a first year, add its 1–6 letter prefixes starting at that year; then sweep years from 1800 to 2019 and accumulate to produce a non-decreasing series per prefix. Depth 6 is a sweet spot: detailed enough to see families, small enough to render quickly.
4. Render the Animation
We keep the layout fixed. We place nodes by depth vertically. We assign horizontal space deterministically so denser subtrees get more room. Positions do not change over time. That makes the animation easy to read.
We render one PNG per year. Node radius follows the square-root of the cumulative count so large values do not overwhelm. Edges fade when a node and its parent have no value that year. We draw short labels for base letters where there is space. We overlay year and simple totals.
Reading the frame: Larger circles mean more cumulative words under that prefix. Edges connect a prefix to its parent. The base letters sit roughly around the circle; descendants branch outward.
5. Encode MP4 and GIF
We encode an MP4 and a GIF. The MP4 uses H.264 and plays back smoothly. The GIF uses a palette for good color at a smaller size. The whole pipeline is one command. It downloads the data, writes artifacts, renders frames, and writes media.
Limitations
There are limitations. Google Books reflects its corpus, not the whole world. OCR and metadata add noise. Frequency is a proxy metric, not a legal attestation. Smoothing and thresholds trade sensitivity for stability. The method aims for robustness, not perfection.
Learn More
- Step-by-Step Guide – Detailed instructions for running each stage
- Advanced Tuning – Parameter tuning and customization options
- Interpreting Results – How to read and understand the visualization