Troubleshooting

Common issues and solutions when working with the English Lexicon Time Machine.

Setup Issues

ffmpeg Not Found

Problem: The encode step fails with “ffmpeg not found”.

Solution: Install ffmpeg and ensure it’s on your PATH:

# macOS
brew install ffmpeg

# Linux (Ubuntu/Debian)
sudo apt-get install ffmpeg

# Verify installation
ffmpeg -version

Then re-run the encode step.

Fonts Missing

Problem: Text rendering looks poor or uses bitmap fonts.

Solution: The renderer tries DejaVuSans.ttf, then Arial.ttf, then a default bitmap font. Install DejaVu Sans or Arial for better text rendering:

# macOS
brew install font-dejavu

# Linux
sudo apt-get install fonts-dejavu

Virtual Environment Fails

Problem: Python virtual environment creation fails.

Solution:

Ensure Python 3.8+ is installed and accessible
Check that python3 and pip3 are in your PATH
Try creating the venv manually: python3 -m venv venv

Download Errors

Problem: Dataset downloads fail or are corrupted.

Solution:

Check your internet connection and retry
The setup script validates gzip headers; if a shard fails validation, re-download
Check your connection or try again later
For slow networks, run downloads overnight

Processing Issues

Memory Errors

Problem: Out of memory errors during processing.

Solution:

The dataset processing can be memory-intensive; ensure sufficient RAM
The pipeline streams data where possible, but some stages load data into memory
Try processing smaller time ranges or reducing prefix depth
Close other applications to free up memory

Shards Corrupted

Problem: N-gram shard files are corrupted.

Solution:

The setup script validates gzip headers
If a shard fails validation, delete it and re-download
Check your connection stability
Try downloading shards individually if needed

Slow Network

Problem: Downloads take a very long time.

Solution:

Run the steps overnight
The compute stages are streaming and modest on memory; download time dominates on fresh runs
Consider using a faster connection or downloading during off-peak hours

Rendering Issues

Missing Frames

Problem: Some frames are missing from the output.

Solution:

Re-run the rendering step after ensuring artifacts exist
Check that artifacts/trie/prefix_counts.jsonl is complete
Verify the time range parameters match your data

Low Quality Output

Problem: Rendered frames look pixelated or low quality.

Solution:

Adjust resolution parameters in the rendering script
Increase --width and --height values
Check that your display settings match the output resolution

Performance Issues

Problem: Rendering takes too long.

Solution:

Use artifact caching to skip re-processing on subsequent runs
Reduce prefix depth (e.g., --depth 4 instead of 6)
Lower output resolution
Process smaller time ranges
The pipeline is designed to cache intermediate results

Encoding Issues

MP4 Encoding Fails

Problem: MP4 encoding produces errors or fails.

Solution:

Verify ffmpeg is installed and working: ffmpeg -version
Check that input frames exist and are valid PNG files
Ensure sufficient disk space for output files
Try encoding with different codec options

GIF Too Large

Problem: GIF file size is too large for sharing.

Solution:

Reduce GIF width: --gif-width 1280 or lower
Lower GIF frame rate: --gif-fps 8 or lower
Use MP4 instead for high-quality sharing (most platforms support it)

Encoding Quality Issues

Problem: Encoded video/GIF quality is poor.

Solution:

Increase input frame resolution
Adjust encoding parameters (bitrate, quality settings)
Use MP4 format for better quality (GIF has inherent limitations)

Data Issues

No Words Found

Problem: First-year inference finds very few words.

Solution:

Check that lemma extraction completed successfully
Verify n-gram data is downloaded and accessible
Lower the --tau threshold (e.g., 1e-10)
Reduce the --guard requirement
Check that your time range (--start-year, --end-year) is appropriate

Unexpected First Years

Problem: First years seem incorrect or inconsistent.

Solution:

Remember: frequency is a proxy metric, not a legal attestation
OCR and metadata add noise to the underlying data
Smoothing and thresholds trade sensitivity for stability
The method aims for robustness, not perfection
Adjust --tau, --guard, and --window parameters if needed

General Tips

Performance and Caching

The pipeline streams data where possible and writes intermediate artifacts. If you work iteratively, you can skip stages that are already complete:

Lemma extraction: Skip if artifacts/lemmas/lemmas.tsv exists
First-year inference: Skip if artifacts/years/first_years.tsv exists
Prefix aggregation: Skip if artifacts/trie/prefix_counts.jsonl exists

The provided setup script also preserves a simple artifact cache under artifacts/ so repeats are fast. The heavy hitters are downloading n-gram shards and scanning them once.

Getting Help

If you encounter issues not covered here:

Open an issue on GitHub
Include error messages and relevant log output
Specify your operating system and Python version
Describe what you were trying to do when the error occurred

Troubleshooting

Troubleshooting

Setup Issues

ffmpeg Not Found

Fonts Missing

Virtual Environment Fails

Download Errors

Processing Issues

Memory Errors

Shards Corrupted

Slow Network

Rendering Issues

Missing Frames

Low Quality Output

Performance Issues

Encoding Issues

MP4 Encoding Fails

GIF Too Large

Encoding Quality Issues

Data Issues

No Words Found

Unexpected First Years

General Tips

Performance and Caching

Getting Help

See Also