1. Silence in OCR: What Could Handwritten Documents Tell Us?
- Author
-
Zhang, Theo
- Subjects
OCR ,Archives as Data ,Health Science ,Medical Humanities - Abstract
This report, produced as part of the UCSF Archives and Special Collections Summer Fellowship program, explores the efficacy of Optical Character Recognition (OCR) technology in processing archival documents. OCR technology, which automates the extraction of text from images, has significantly advanced recently, providing substantial benefits for archival organizations by making vast amounts of previously “hidden” data more accessible. This study specifically examines the disparities in OCR quality between handwritten and typewritten documents, highlighting that OCR’s effectiveness is considerably lower for handwritten texts. This discrepancy results in biases and underrepresentation in datasets, particularly affecting the accessibility and utility of handwritten documents from historical archives.Utilizing a dataset comprising documents related to AIDS/HIV activism from the 1980s and 1990s, this project evaluates the performance of three OCR tools—Tesseract, Google Cloud Document AI, and Amazon Textract—across different document types. The objective is to identify the most effective OCR solution for enhancing the accessibility of handwritten documents within the UCSF Archives and Special Collections. The findings aim to contribute to the broader archival field by addressing the challenges of digitizing and utilizing handwritten archival materials, thus supporting more inclusive and comprehensive historical research.
- Published
- 2024