Troubleshooting
Troubleshooting
Common issues and solutions when working with the English Lexicon Time Machine.
Setup Issues
ffmpeg Not Found
Problem: The encode step fails with “ffmpeg not found”.
Solution: Install ffmpeg and ensure it’s on your PATH:
# macOS
brew install ffmpeg
# Linux (Ubuntu/Debian)
sudo apt-get install ffmpeg
# Verify installation
ffmpeg -version
Then re-run the encode step.
Fonts Missing
Problem: Text rendering looks poor or uses bitmap fonts.
Solution: The renderer tries DejaVuSans.ttf, then Arial.ttf, then a default bitmap font. Install DejaVu Sans or Arial for better text rendering:
# macOS
brew install font-dejavu
# Linux
sudo apt-get install fonts-dejavu
Virtual Environment Fails
Problem: Python virtual environment creation fails.
Solution:
- Ensure Python 3.8+ is installed and accessible
- Check that
python3andpip3are in your PATH - Try creating the venv manually:
python3 -m venv venv
Download Errors
Problem: Dataset downloads fail or are corrupted.
Solution:
- Check your internet connection and retry
- The setup script validates gzip headers; if a shard fails validation, re-download
- Check your connection or try again later
- For slow networks, run downloads overnight
Processing Issues
Memory Errors
Problem: Out of memory errors during processing.
Solution:
- The dataset processing can be memory-intensive; ensure sufficient RAM
- The pipeline streams data where possible, but some stages load data into memory
- Try processing smaller time ranges or reducing prefix depth
- Close other applications to free up memory
Shards Corrupted
Problem: N-gram shard files are corrupted.
Solution:
- The setup script validates gzip headers
- If a shard fails validation, delete it and re-download
- Check your connection stability
- Try downloading shards individually if needed
Slow Network
Problem: Downloads take a very long time.
Solution:
- Run the steps overnight
- The compute stages are streaming and modest on memory; download time dominates on fresh runs
- Consider using a faster connection or downloading during off-peak hours
Rendering Issues
Missing Frames
Problem: Some frames are missing from the output.
Solution:
- Re-run the rendering step after ensuring artifacts exist
- Check that
artifacts/trie/prefix_counts.jsonlis complete - Verify the time range parameters match your data
Low Quality Output
Problem: Rendered frames look pixelated or low quality.
Solution:
- Adjust resolution parameters in the rendering script
- Increase
--widthand--heightvalues - Check that your display settings match the output resolution
Performance Issues
Problem: Rendering takes too long.
Solution:
- Use artifact caching to skip re-processing on subsequent runs
- Reduce prefix depth (e.g.,
--depth 4instead of6) - Lower output resolution
- Process smaller time ranges
- The pipeline is designed to cache intermediate results
Encoding Issues
MP4 Encoding Fails
Problem: MP4 encoding produces errors or fails.
Solution:
- Verify ffmpeg is installed and working:
ffmpeg -version - Check that input frames exist and are valid PNG files
- Ensure sufficient disk space for output files
- Try encoding with different codec options
GIF Too Large
Problem: GIF file size is too large for sharing.
Solution:
- Reduce GIF width:
--gif-width 1280or lower - Lower GIF frame rate:
--gif-fps 8or lower - Use MP4 instead for high-quality sharing (most platforms support it)
Encoding Quality Issues
Problem: Encoded video/GIF quality is poor.
Solution:
- Increase input frame resolution
- Adjust encoding parameters (bitrate, quality settings)
- Use MP4 format for better quality (GIF has inherent limitations)
Data Issues
No Words Found
Problem: First-year inference finds very few words.
Solution:
- Check that lemma extraction completed successfully
- Verify n-gram data is downloaded and accessible
- Lower the
--tauthreshold (e.g.,1e-10) - Reduce the
--guardrequirement - Check that your time range (
--start-year,--end-year) is appropriate
Unexpected First Years
Problem: First years seem incorrect or inconsistent.
Solution:
- Remember: frequency is a proxy metric, not a legal attestation
- OCR and metadata add noise to the underlying data
- Smoothing and thresholds trade sensitivity for stability
- The method aims for robustness, not perfection
- Adjust
--tau,--guard, and--windowparameters if needed
General Tips
Performance and Caching
The pipeline streams data where possible and writes intermediate artifacts. If you work iteratively, you can skip stages that are already complete:
- Lemma extraction: Skip if
artifacts/lemmas/lemmas.tsvexists - First-year inference: Skip if
artifacts/years/first_years.tsvexists - Prefix aggregation: Skip if
artifacts/trie/prefix_counts.jsonlexists
The provided setup script also preserves a simple artifact cache under artifacts/ so repeats are fast. The heavy hitters are downloading n-gram shards and scanning them once.
Getting Help
If you encounter issues not covered here:
- Open an issue on GitHub
- Include error messages and relevant log output
- Specify your operating system and Python version
- Describe what you were trying to do when the error occurred
See Also
- Getting Started – Initial setup guide
- Step-by-Step Guide – Detailed stage-by-stage instructions
- Advanced Tuning – Parameter customization options