Back to Search Start Over

Extraction of Tabular Data from PDF to CSV Files

Authors :
Shefali Athavale
Sanket Gokhale
Naman Varyomalani
Abha Tewari
Rishil Kirtikar
Grishma Gurbani
Yogita Bhatia
Gresha Bhatia
Source :
Data Management, Analytics and Innovation ISBN: 9789811556159
Publication Year :
2020
Publisher :
Springer Singapore, 2020.

Abstract

Companies generate their reports in the form of PDF files. For further data analysis, the statistics or quantitative data in these reports have to be converted to CSV (.csv) or Excel (.xlsx) files. This is done manually by companies. This consumes a lot of time and manual work which can be reduced for better utilization of resources. Forecomp is a web application to automatically convert the tables in the PDF to CSV files. The tables could be present in text format or as an image. The web application is built keeping flexibility in mind such that the user can select the process used to convert the PDF into CSV files based on the tables in their PDF. Different technologies used in this application include YOLO model for machine learning, Tesseract OCR, Tabula, and an inbuilt snipping tool. This paper introduces the concepts behind Forecomp focussing on the methodology employed and the various results obtained.

Details

Database :
OpenAIRE
Journal :
Data Management, Analytics and Innovation ISBN: 9789811556159
Accession number :
edsair.doi...........bffd653c7b2465e7b2f1b8aceee6e204