1. 2024 Industry Documents Undergraduate Summer Fellowship - JUUL Labs Collection Final Report
- Author
-
Lichtstein, Gordon
- Subjects
Data science ,Natural Language Processing ,Optical Character Recognition (OCR) ,Embedding Search Algorithms ,Large Language Model (LLM) - Abstract
This report, developed as part of the 2024 UCSF Industry Documents Library Undergraduate Summer Fellowship, examines four distinct projects that leverage natural language processing and data science within the context of the JUUL Labs Collection and the broader IDL. Project One investigates the optical character recognition (OCR) accuracy of low-quality and handwritten documents in the absence of ground truth data. Project Two explores the implementation of embedding search algorithms and visualizations aimed at enhancing the relevance of document recommendations for users. Project Three employs txt-ferret to conduct a thorough scan of a substantial corpus of industry documents to identify sensitive information, including credit card numbers. Finally, Project Four assesses the biases present in large language model (LLM) summarization through the lens of sentiment analysis.
- Published
- 2024