English Lexicon Time Machine

Watch the entire English language blossom from Wiktionary + Google Books N-grams, rendered as a living, breathing prefix galaxy.

Overview

What you’re seeing is a timelapse of English vocabulary growth from 1800 to 2019. Each node represents a letter prefix, and its size reflects how many words with that prefix have appeared up to that year. The layout is stable over time so your eye can track change.

We built this from two public datasets. Wiktionary provides a list of English lemmas. Google Books 1-grams provides yearly counts and volumes. We combine them to estimate a robust first year for each word, then accumulate by prefixes up to six letters.

A lemma is the canonical or dictionary form of a set of related words. For a given set of forms of a word, the lemma is the base form.

English Lexicon Time Machine visualizes the evolution of the English language through time. By combining Wiktionary lemmas with Google Books N-gram data, we trace when words first appeared and render their growth as a radial prefix trie that expands across decades.

Key Features

Zero-config setup – ./setup.sh spins up the virtualenv, fetches every dataset, caches the heavy lifts, and ships final MP4/GIF output
Radial growth cinematics – the trie erupts from the core alphabet, framing decades of linguistic evolution as a neon fractal
Repeatable science – every artifact (lemmata, first-year inference, trie counts, layouts) checkpoints to disk and into a reusable tarball for instant re-renders
Battle-tested – streams 26 full 1-gram shards, handles 1.4GB Wiktionary dumps, and renders 220 frames in glorious 1080p

Quickstart

bash setup.sh

The script will:

Create/upgrade venv/ with Python 3
Download Wiktionary + Google Books 1-gram shards (a–z)
Extract English lemmas, infer first-use years, aggregate prefix counts
Render 220 radial frames (outputs/frames/frame-0000.png → frame-0219.png)
Encode outputs/english_trie_timelapse.mp4 and a share-ready GIF

Rerun the script anytime—artifact caching means future passes jump straight to rendering.

Pipeline Architecture

Stage	Script	Output
Lemma extraction	`src/ingest/wiktionary_extract.py`	`artifacts/lemmas/lemmas.tsv`
First-year inference	`src/ingest/ngram_first_year.py`	`artifacts/years/first_years.tsv`
Prefix aggregation	`src/build/build_prefix_trie.py`	`artifacts/trie/prefix_counts.jsonl`
Layout generation	`src/viz/layout.py`	`artifacts/layout/prefix_positions.json`
Frame rendering	`src/viz/render_frames.py`	`outputs/frames/`
Encoding	`src/viz/encode.py`	`outputs/english_trie_timelapse.mp4` + `.gif`

Render Only (after initial run)

source venv/bin/activate
python -m src.viz.render_frames artifacts/trie/prefix_counts.jsonl outputs/frames
python -m src.viz.encode outputs/frames outputs/english_trie_timelapse.mp4 outputs/english_trie_timelapse.gif

Use flags such as --min-radius, --max-radius, --base-edge-alpha, or --start-progress to tune the visualization.

Neo4j Integration

Load artifacts/years/first_years.tsv to explore the word data in Neo4j (compatible with both Community and Enterprise editions):

:param batch => $rows;
UNWIND $rows AS row
WITH row WHERE row.word IS NOT NULL AND row.word <> ""
MERGE (w:Word {text: row.word})
SET w.first_year = CASE
  WHEN row.first_year = "" THEN NULL
  ELSE toInteger(row.first_year)
END;

Documentation

Getting Started – Quick setup guide
Methodology – How the visualization works
Step-by-Step Guide – Detailed instructions for each stage
Advanced Tuning – Parameter customization options
Interpreting Results – Understanding the visualization
Troubleshooting – Common issues and solutions

Community

GitHub Organization – More helpful resources
GitHub Repository – Source code and issues
X Community – Join discussions on Knowledge Graphs, GNNs, and Graph Databases