Written in cursive, historical scripts usually employ irregular characters and
capitalization, abbreviations, archaic spelling, and linked words.
Preprocessing techniques are applied to clean the images without affecting the written
content.
Paleography experts actively engage in the process of information extraction to obtain
accurate information from the images.
Optical character recognition (ORC) is used to automatically convert printed or handwritten
text into machine-readable, editable, and searchable text. In order to enable OCR tasks,
researchers apply different methods. In recent years, deep learning has achieved remarkable
success for image understanding and classification, image segmentation, speech recognition,
and natural language processing.
Acknowledgments
We thank the National Endowment for Humanities (NEH,
Grant No. HAA-271747-20 and
Grant No. HAA-287903-22). Missouri Institute for Defense and Energy (UMKC MIDE),
UMKC Funding for Excellence Program, UMKC/IDEAS Collaborative Data Science Grant, and the
University of Missouri System Tier 3 Strategic Investment Grant for supporting this project
This is an ongoing collaboration between University of
Missouri-Kansas City, the University of Missouri-Columbia, and the National Archives of
Argentina.