Zingarelli 1922 Dictionary – OCR Digitization & JSONL
OCR • NLP • Data Parsing • Automation
OCR • NLP • Data Parsing • Automation
This project focused on transforming the Vocabolario della Lingua Italiana Zingarelli 1922, available only as TIFF scans on Internet Archive, into a fully digital and structured dataset.
A custom OCR and parsing pipeline restored degraded pages, extracted high-quality text, and converted every lemma, meaning, derivative and example into modern JSONL files organized by letter.
The final Python program was delivered with complete documentation and later published by the client as an open-source project.
The Zingarelli 1922 Italian Dictionary exists only as TIFF scans on Internet Archive.
The pages are over 100 years old, containing degraded text, irregular formatting, damaged sections and non-standard symbols.
The client needed a way to transform this material into a clean, structured digital dataset with thousands of lemmas and meanings, something impossible to achieve manually.
A custom OCR and parsing pipeline was developed to restore, extract and structure the entire dictionary.
After enhancing the scans, an advanced OCR engine recovered the text, while a dedicated parser reconstructed the lexicographic hierarchy: lemmas, meanings, derivatives, examples and the original OCR line.
The final dataset was exported as JSONL files organized per letter, fully ready for NLP, linguistic analysis or integration into digital tools.
The TIFF pages were enhanced for clarity, corrected for distortions and processed through a high-accuracy OCR system.
Damaged scans were handled manually when needed, ensuring complete coverage of the entire dictionary.
A custom text parser separated lemmas, meanings, derived forms and examples following the dictionary’s original structure.
All entries were exported into JSONL files, one per letter of the alphabet, ready for open-source publication.
This project delivers a reliable digitization workflow that converts the 1922 Zingarelli dictionary from raw TIFF scans into clean JSONL data. The system processes historical pages with enhanced preprocessing, applies high-accuracy OCR and organizes each entry into a structured, searchable format.
The parsing logic reconstructs lemmas, meanings and derivatives with consistency even on degraded pages, producing a stable dataset suitable for research and digital publishing. The final pipeline also creates a solid base for future extensions such as richer linguistic metadata or integration into interactive dictionary tools.
• Fully digitized a historical dictionary previously available only as raw scans.
• Generated structured JSONL suitable for NLP applications and linguistic research.
• Automated every step, eliminating manual transcription work.
• Delivered a Python pipeline with instructions and ongoing support.
• The client published the project on GitLab with public attribution and excellent feedback.