Start Over

Generic Data Harvester

Authors :: Asp, William
Valck, Johannes
Asp, William
Valck, Johannes
Publication Year :: 2022
Abstract: This report goes through the process of developing a generic article scraper which shall extract relevant information from an arbitrary web article. The extraction is implemented by searching and examining the HTML of the article, by using Python and XPath. The data that shall be extracted is the title, summary, publishing date and body text of the article. As there is no standard way that websites, and in particular news articles, is built, the extraction needs to be adapted for every different structure and language of articles. The resulting program should provide a proof of concept method of extracting the data showing that future development is possible. The thesis host company Acuminor is working with financial crime intelligence and are collecting information through articles and reports. To scale up the data collection and minimize the maintenance of the scraping programs, a general article scraper is needed. There exist an open source alternative called Newspaper, but since this is no longer being maintained and it can be argued is not properly designed, an internal implementation for the company could be beneficial. The program consists of a main class that imports extractor classes that have an API for extracting the data. Each extractor are decoupled from the rest in order to keep the program as modular as possible. The extraction for title, summary and date are similar, with the extractors looking for specific HTML tags that contain some common attribute that most websites implement. The text extraction is implemented using a tree that is built up from the existing text on the page and then searching the tree for the most likely node containing only the body text, using attributes such as amount of text, depth and number of text nodes. The resulting program does not match the performance of Newspaper, but shows promising results on every part of the extraction. The text extraction is very slow and often takes too much text of the article but provides a<br />Den här rapporten går igenom processen av att utveckla en generisk artikelskrapare som ska extrahera reöevamt information från en godtycklig artikelhemsida. Extraheringen kommer bli implementerad genom att söka igenom och undersöka HTML-en i artikeln, genom att använda Python och XPath. Datan som skall extraheras är titeln, summering, publiceringsdatum och brödtexten i artikeln. Eftersom det inte finns något standard sätt som hemsidor, och mer specifikt nyhetsartiklar är uppbyggda, extraheringen måste anpassas för varje olika struktur och språk av artiklar. Det resulterande programmed skall visa på ett bevis för ett koncept sätt att extrahera datan som visar på att framtida utveckling är möjlig. Projektets värdföretag Acuminor jobbar inom finansiell brottsintelligens och samlar ihop information genom artiklar och rapporter. För att skala upp insamlingen av data och minimera underhåll av skrapningsprogrammen, behövs en generell artikelskrapare. Det existerar ett öppen källkodsalternativ kallad Newspaper, men eftersom denna inte länge är underhållen och det kan argumenteras att den inte är så bra designad, är en intern implementation för företaget fördelaktigt. Programmet består av en huvudklass som importerar extraheringsklasser som har ett API för att extrahera datan. Varje extraherare är bortkopplad från resten av programmet för att hålla programmet så moodulärt som möjligt. Extraheringen för titel, summering och datum är liknande, där extragherarna tittar efter specifika HTML taggar som innehåller något gemensamt attribut som de flesta hemsidor implementerar. Textextraheringen är implementerad med ett träd som byggs upp från grunden från den existerande texten på sidan och sen söks igenom för att hitta den mest troliga noden som innehåller brödtexten, där den använder attribut såsom text, djup och antal textnoder. Det resulterande programmet matchar inte prestandan av Newspaper, men visar på lovande resultat vid varje del av extraheringen. Textextraheringen är väl

Details

Database :: OAIster
Notes :: application/pdf, English
Publication Type :: Electronic Resource
Accession number :: edsoai.on1372267896
Document Type :: Electronic Resource

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Generic Data Harvester

Abstract

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Generic Data Harvester

Abstract

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources