Back to Search
Start Over
Knjižnica za tekstovno analitiko v programskem okolju Orange
- Publication Year :
- 2016
-
Abstract
- Razvili smo sistem za analizo besedil in ga osnovali kot dodatek za programsko okolje Orange. Orange združuje bogat nabor metod za nadzorovano in nenadzorovano strojno učenje, zato je odličen temelj za razvoj takega sistema. S pregledom literature in odprtih orodij smo določili kaj so temeljne metode, ki se uporabljajo na tem področju in na podlagi le-tega osnovali funkcionalnosti naše knjižnice. Dodali smo gradnike za zajem podatkov s spletnih virov kot sta PubMed in New York Times. Implementirali smo metode za predobdelavo, ki vključujejo pretvorbo besedil v vektorje, odstranjevanje odvečnih besed, lematizacijo in krnjenje, tok dela pa nato podprli z vizualizacijami, na primer z oblakom besed. Naš cilj je bil razviti gradnike, ki se med seboj dobro povezujejo z vizualnim programiranjem, so dobro povezljivi z ostalimi gradniki sistema Orange, ter jih je moč enostavno nadgraditi z razvojem novih gradnikov. We have developed a text mining system that can be used as an add-on for Orange, a data mining platform. Orange envelops a set of supervised and unsupervised machine learning methods that benefit a typical text mining platform and therefore offers an excellent foundation for development. We have studied the field of text mining and reviewed several open-source toolkits to define its base components. We have included widgets that enable retrieval of data from remote repositories, such as PubMed and New York Times. The pre-processing was designed to include transformation of documents to vectors, stop word removal, lemmatization and stemming. The results can be visualized via widgets such as the word cloud. Our goal was to develop widgets that can be easily incorporated into the existing Orange workflow, can be upgraded with additional widgets, and perform well in a visual programming environment.
Details
- Language :
- Slovenian
- Database :
- OpenAIRE
- Accession number :
- edsair.od......3505..f24d6bf3787bc10489db3838dddca2e8