English Lexicon Time Machine

Watch the entire English language blossom from Wiktionary + Google Books N-grams, rendered as a living, breathing prefix galaxy.

Overview

What you’re seeing is a timelapse of English vocabulary growth from 1800 to 2019. Each node represents a letter prefix, and its size reflects how many words with that prefix have appeared up to that year. The layout is stable over time so your eye can track change.

We built this from two public datasets. Wiktionary provides a list of English lemmas. Google Books 1-grams provides yearly counts and volumes. We combine them to estimate a robust first year for each word, then accumulate by prefixes up to six letters.

A lemma is the canonical or dictionary form of a set of related words. For a given set of forms of a word, the lemma is the base form.

English Lexicon Time Machine visualizes the evolution of the English language through time. By combining Wiktionary lemmas with Google Books N-gram data, we trace when words first appeared and render their growth as a radial prefix trie that expands across decades.

Key Features

Quickstart

bash setup.sh

The script will:

  1. Create/upgrade venv/ with Python 3
  2. Download Wiktionary + Google Books 1-gram shards (az)
  3. Extract English lemmas, infer first-use years, aggregate prefix counts
  4. Render 220 radial frames (outputs/frames/frame-0000.pngframe-0219.png)
  5. Encode outputs/english_trie_timelapse.mp4 and a share-ready GIF

Rerun the script anytime—artifact caching means future passes jump straight to rendering.

Pipeline Architecture

Stage Script Output
Lemma extraction src/ingest/wiktionary_extract.py artifacts/lemmas/lemmas.tsv
First-year inference src/ingest/ngram_first_year.py artifacts/years/first_years.tsv
Prefix aggregation src/build/build_prefix_trie.py artifacts/trie/prefix_counts.jsonl
Layout generation src/viz/layout.py artifacts/layout/prefix_positions.json
Frame rendering src/viz/render_frames.py outputs/frames/
Encoding src/viz/encode.py outputs/english_trie_timelapse.mp4 + .gif

Render Only (after initial run)

source venv/bin/activate
python -m src.viz.render_frames artifacts/trie/prefix_counts.jsonl outputs/frames
python -m src.viz.encode outputs/frames outputs/english_trie_timelapse.mp4 outputs/english_trie_timelapse.gif

Use flags such as --min-radius, --max-radius, --base-edge-alpha, or --start-progress to tune the visualization.

Neo4j Integration

Load artifacts/years/first_years.tsv to explore the word data in Neo4j (compatible with both Community and Enterprise editions):

:param batch => $rows;
UNWIND $rows AS row
WITH row WHERE row.word IS NOT NULL AND row.word <> ""
MERGE (w:Word {text: row.word})
SET w.first_year = CASE
  WHEN row.first_year = "" THEN NULL
  ELSE toInteger(row.first_year)
END;

Documentation

Community