71 results on '"Webscraping"'
Search Results
2. Using Web-Data to Estimate Spatial Regression Models.
- Author
-
Arbia, Giuseppe and Nardelli, Vincenzo
- Subjects
- *
JOB applications , *REGRESSION analysis , *CONVENIENCE sampling (Statistics) , *RESEARCH personnel , *CROWDSOURCING , *BIG data - Abstract
Macro econometrics has been recently affected by the so-called 'Google Econometrics'. Comparatively less attention has been paid to the subject by the regional and spatial sciences where the Big Data revolution is challenging the conventional econometric techniques with the availability of a variety of non- traditionally collected data (such as, e. g., crowdsourcing, web scraping, etc) which are almost invariably geo-coded. However, these unconventionally collected data represent only what in statistics is known as a "convenience sample" that does not allow any sound probabilistic inference. This paper aims at making aware researchers of the consequence of the unwise use of such data in the applied work and to propose a technique to minimize such the negative effects in the estimation of spatial regression. The method consists of manipulating the data prior their use in an inferential context. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Robotic colorectal surgery: quality assessment of patient information available on the internet using webscraping.
- Author
-
Taha, Anas, Taha-Mehlitz, Stephanie, Bach, Laura, Ochs, Vincent, Bardakcioglu, Ovunc, Honaker, Michael D., and Cattin, Philippe C.
- Subjects
PROCTOLOGY ,SURGICAL robots ,WEB portals ,PYTHON programming language ,HEALTH facilities ,INTERNET - Abstract
The primary goal of this study is to assess current patient information available on the internet concerning robotic colorectal surgery. Acquiring this information will aid in patients understanding of robotic colorectal surgery. Data was acquired through a web-scraping algorithm. The algorithm used two Python packages: Beautiful Soup and Selenium. The long-chain keywords incorporated into Google, Bing and Yahoo search engines were 'Da Vinci Colon-Rectal Surgery', 'Colorectal Robotic Surgery' and 'Robotic Bowel Surgery'. 207 websites resulted, were sorted and evaluated according to the ensuring quality information for patients (EQIP) score. Of the 207 websites visited, 49 belonged to the subgroup of hospital websites (23.6%), 46 to medical centers (22.2%), 45 to practitioners (21.7%), 42 to health care systems (20,2%), 11 to news services (5.3%), 7 to web portals (3.3%), 5 to industry (2.4%), and 2 to patient groups (0.9%). Only 52 of the 207 websites received a high rating. The quality of available information on the internet concerning robotic colorectal surgery is low. The majority of information was inaccurate. Medical facilities involved in robotic colorectal surgery, robotic bowel surgery and related robotic procedures should develop websites with credible information to guide patient decisions. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
4. Automatisierte Datenerhebung
- Author
-
Jünger, Jakob, Gärtner, Chantal, Jünger, Jakob, and Gärtner, Chantal
- Published
- 2023
- Full Text
- View/download PDF
5. Robotic colorectal surgery: quality assessment of patient information available on the internet using webscraping
- Author
-
Anas Taha, Stephanie Taha-Mehlitz, Laura Bach, Vincent Ochs, Ovunc Bardakcioglu, Michael D. Honaker, and Philippe C. Cattin
- Subjects
Robotic colorectal surgery ,patient information ,webscraping ,EQIP ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Surgery ,RD1-811 - Abstract
AbstractThe primary goal of this study is to assess current patient information available on the internet concerning robotic colorectal surgery. Acquiring this information will aid in patients understanding of robotic colorectal surgery. Data was acquired through a web-scraping algorithm. The algorithm used two Python packages: Beautiful Soup and Selenium. The long-chain keywords incorporated into Google, Bing and Yahoo search engines were ‘Da Vinci Colon-Rectal Surgery’, ‘Colorectal Robotic Surgery’ and ‘Robotic Bowel Surgery’. 207 websites resulted, were sorted and evaluated according to the ensuring quality information for patients (EQIP) score. Of the 207 websites visited, 49 belonged to the subgroup of hospital websites (23.6%), 46 to medical centers (22.2%), 45 to practitioners (21.7%), 42 to health care systems (20,2%), 11 to news services (5.3%), 7 to web portals (3.3%), 5 to industry (2.4%), and 2 to patient groups (0.9%). Only 52 of the 207 websites received a high rating. The quality of available information on the internet concerning robotic colorectal surgery is low. The majority of information was inaccurate. Medical facilities involved in robotic colorectal surgery, robotic bowel surgery and related robotic procedures should develop websites with credible information to guide patient decisions.
- Published
- 2023
- Full Text
- View/download PDF
6. Government websites as data: a methodological pipeline with application to the websites of municipalities in the United States.
- Author
-
Neumann, Markus, Linder, Fridolin, and Desmarais, Bruce
- Subjects
- *
GOVERNMENT websites , *WEBSITES , *CITIES & towns , *MUNICIPAL government , *INFORMATION policy - Abstract
The content of a government's website is an important source of information about policy priorities, procedures, and services. Existing research on government websites has relied on manual methods of website content collection and processing, which imposes cost limitations on the scale of website data collection. In this research note, we propose that the automated collection of website content from large samples of government websites can offer relief from the costs of manual collection, and enable contributions through large-scale comparative analyses. We also provide software to ease the use of this data collection method. In an illustrative application, we collect textual content from the websites of over two hundred municipal governments in the United States, and study how website content is associated with mayoral partisanship. Using statistical topic modeling, we find that the partisanship of the mayor predicts differences in the contents of city websites that align with differences in the platforms of Democrats and Republicans. The application illustrates the utility of website content data extracted via our methodological pipeline. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
7. A 'Europe of Lawyers'? The Making of a Database on Cases and Lawyers of the CJEU
- Author
-
Lola Avril and Constantin Brissaud
- Subjects
quantitative research ,database ,court of justice ,lawyers ,webscraping ,transversal analysis ,Law ,Law of Europe ,KJ-KKZ - Abstract
(Series Information) European Papers - A Journal on Law and Integration, 2021 6(2), 913-921 | Article | (Table of Contents) I. Introduction: why a database of lawyers at the Court of Justice is an important new tool for socio-legal inquiry. - II. The construction of the database. - III. The Court as a place of confluence. - IV. How to complement the analysis? - V. Concluding remarks. | (Abstract) This Article presents a database of lawyers being built within the Court of Justice in the Archives project. Recent studies, relying on actor-centred approaches, have fostered a renewed interest in European lawyers. While vis-its of these lawyers in Luxembourg have fostered the development of transnational legal networks and participated to the acculturation of the Court's informal and formal rules, they remain largely under-studied. We therefore suggest to analyse the Court as a "place of confluence", where different professional groups meet during the course of the proceedings. The database precisely aims at mapping the networks of lawyers that take shape in Luxembourg. Providing statistical analysis of the structuration and evolutions of the Europe of lawyers (agents of the European institutions or Member States, law professors or private practitioners), we suggest that the database could contribute to a better understanding of transformations in the European legal field.
- Published
- 2021
- Full Text
- View/download PDF
8. Marketing attributes in yogurt weekly pricing in Argentina.
- Author
-
Larrosa, Juan M. C., Giordano, Victoria, Ramírez Muñoz de Toro, Gonzalo R., and Uriarte, Juan I.
- Subjects
YOGURT ,MARKETING effectiveness ,PRODUCT attributes ,SALES promotion - Abstract
Pricing under inflation proves to affect marketing effectiveness. We analyze the weekly price evolution of yogurt segments offered in supermarkets in Argentina for the period 2015–2020. By taking into account the strong macroeconomic instability of the period, we try to figure out what role marketing variables such as product attributes and promotions have played in pricing. We find that specific attributes such as flavor and texture and time effects significant. As expected, much of the pricing was affected by general and sectorial inflation of the period. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
9. Mapping Emerging and Legacy Outlets Online by Their Democratic Functions—Agonistic, Deliberative, or Corrosive?
- Author
-
Freudenthaler, Rainer and Wessler, Hartmut
- Subjects
- *
REFUGEES , *HUMAN rights workers , *PUBLIC sphere , *MEDIA rights , *PARTISANSHIP , *CONTENT analysis , *MASS media - Abstract
In this study, we offer a novel approach to research on migration reporting by focusing on the argumentative substance prevalent in different online outlets. Taking German refugee policy as our case in point we map the role that moral, ethical–cultural, legal, and pragmatic argumentations play within journalistic, partisan, and activist outlets; and how these coincide with incivility and impoliteness. Using dictionary-based content analysis on a data set of 34,819 articles from thirty online news outlets published between April 10, 2017, and April 10, 2018, we find that legacy mainstream media, partisan media, and activist media perform vastly different functions for the larger public sphere. We observe that human rights activist media perform an advocatory function by making the moral case for refugees, whereas corrosive partisan media at the fringe—particularly within the contra-refugee camp—often present opponents as inherently illegitimate enemies. Implications for public sphere theory and directions for future research on emerging and legacy media are discussed. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
10. Retail coffee pricing dynamics in Argentina
- Author
-
Larrosa, Juan M. C., Meller, Leandro, Uriarte, Juan I., and Ramírez Muñoz de Toro, Gonzalo R.
- Published
- 2023
- Full Text
- View/download PDF
11. Dashboard of Sentiment in Austrian Social Media During COVID-19
- Author
-
Max Pellert, Jana Lasser, Hannah Metzler, and David Garcia
- Subjects
COVID-19 ,collective emotions ,real-time monitoring ,social media ,digital traces ,webscraping ,Information technology ,T58.5-58.64 - Abstract
To track online emotional expressions on social media platforms close to real-time during the COVID-19 pandemic, we built a self-updating monitor of emotion dynamics using digital traces from three different data sources in Austria. This allows decision makers and the interested public to assess dynamics of sentiment online during the pandemic. We used web scraping and API access to retrieve data from the news platform derstandard.at, Twitter, and a chat platform for students. We documented the technical details of our workflow to provide materials for other researchers interested in building a similar tool for different contexts. Automated text analysis allowed us to highlight changes of language use during COVID-19 in comparison to a neutral baseline. We used special word clouds to visualize that overall difference. Longitudinally, our time series showed spikes in anxiety that can be linked to several events and media reporting. Additionally, we found a marked decrease in anger. The changes lasted for remarkably long periods of time (up to 12 weeks). We have also discussed these and more patterns and connect them to the emergence of collective emotions. The interactive dashboard showcasing our data is available online at http://www.mpellert.at/covid19_monitor_austria/. Our work is part of a web archive of resources on COVID-19 collected by the Austrian National Library.
- Published
- 2020
- Full Text
- View/download PDF
12. Detecting innovative companies via their website.
- Author
-
Daas, Piet J. H. and van der Doef, Suzanne
- Abstract
Producing an overview of innovative companies in a country is a challenging task. Traditionally, this is done by sending a questionnaire to a sample of companies. This approach, however, usually only focuses on large companies. We therefore investigated an alternative approach: determining if a company is innovative by studying the text on its website. For this task a model was developed based on the texts of the websites of companies included in the Community Innovation Survey of the Netherlands. The latter is a survey carried out every two years that focusses on the detection of innovative companies with 10 or more working persons. We found that the text-based model developed was able to reproduce the result from the Community Innovation Survey and was also able to detect innovative companies with less than 10 employees, such as startups. Model stability, model bias, the minimal number of words extracted from a website and companies without a website were found to be important issues in producing high quality results. How these issues were dealt with and the findings on the number of innovative companies with large and small numbers of employees are discussed in the paper. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
13. Mining real estate ads and property transactions for building and amenity data acquisition
- Author
-
Chen, Xinyu and Biljecki, Filip
- Published
- 2022
- Full Text
- View/download PDF
14. Computational Methods für die Sozial- und Geisteswissenschaften
- Author
-
Jünger, Jakob and Gärtner, Chantal
- Subjects
Computermethoden ,Textanalyse ,Netzwerkanalyse ,Simulationsverfahren ,Webscraping ,Digital Humanities ,Computational Social Science ,bic Book Industry Communication::J Society & social sciences::JF Society & culture: general::JFD Media studies ,bic Book Industry Communication::D Literature & literary studies::DS Literature: history & criticism ,bic Book Industry Communication::J Society & social sciences - Abstract
Mit Computational Methods lassen sich digitale Welten wissenschaftlich erforschen und gestalten. Das Open-Access-Lehrbuch vermittelt zunächst grundlegende Kompetenzen für die automatisierte Erhebung und Aufbereitung von Daten und für den Umgang mit Datenbanken. Eine Einführung in die Programmiersprachen R und Python sowie in Versionsverwaltungen und Cloud-Computing eröffnet Wege für kreative Analyseansätze beim Umgang mit großen und kleinen Datensätzen. Schließlich werden Szenarien in sozial- und geisteswissenschaftlichen Anwendungsfeldern durchgespielt. Dazu zählen die automatisierte Datenerhebung über Programmierschnittstellen und Webscraping, automatisierte Textanalysen, Netzwerkanalysen, maschinelles Lernen und Simulationsverfahren. Neben einer konzeptionellen Einführung in die jeweiligen Themenfelder geht es vor allem darum, in kurzen Tutorials selbst erste praktische Erfahrungen zu sammeln sowie weiterführende Möglichkeiten, aber auch Limitationen, von Computational Methods kennenzulernen.
- Published
- 2023
- Full Text
- View/download PDF
15. Análisis de la depreciación en el mercado de cabezas tractoras
- Author
-
Costan Macareño, Manuel Alejandro
- Subjects
Secondary market ,Modelos de regresión ,Cabezas tractoras ,Tractor heads ,Grado en Administración y Dirección de Empresas-Grau en Administració i Direcció d'Empreses ,Maquinaria de ocasión ,WebScraping ,Depreciation ,WebS- craping ,OLS regressions ,Depreciación ,ECONOMIA, SOCIOLOGIA Y POLITICA AGRARIA ,Mercado de ocasión - Abstract
[ES] En este trabajo final de grado se busca analizar el comportamiento de los valores y depreciación en el mercado de la maquinaria de ocasión, específicamente el mercado de cabezas tractoras. Para ello, se empleó el lenguaje de programación R mediante el que se pretende alcanzar una técnica interactiva para la recolección de la información y construcción de modelos OLS de regresión. El cálculo de la depreciación de la maquinaria con fines valorativos suele hacerse asumiendo una depreciación lineal a imitación de la amortización lineal comúnmente empleada en el ámbito contable y fiscal. Sin embargo, este patrón de depreciación puede no ser adecuado en todas la ocasiones. En aquellos casos con un mercado secundario es posible contrastar si otro tipo de modelos de depreciación reflejan mejor el comportamiento del valor. Actualmente, existe una gran cantidad de información disponible en internet, principalmente de activos en mercados secundarios, que permiten obtener los datos para poder estudiar el comportamiento de el precio de estos activos en función de diversas variables. En este trabajo se emplea una técnica para la recolección de la información conocida como WebScraping, la cual es capaz de obtener datos específicos, precisos, fiables y ajustados a las necesidades de cada caso de forma automatizada. Posteriormente se lleva a cabo el tratamiento de la base de datos bruta, para eliminar aquellos valores que se desvien mucho de las observaciones y generen valoraciones incorrectas. Con el fin de obtener una visión general representativa de la depreciación que llegan a sufrir estos activos, se desarrollan diversos modelos de regresión mínimo cuadrática (lineal, exponencial y potencial) y se relaciona la antigüedad de estos activos con su valor y a su vez, se relaciona el efecto que puede llegar a tener la marca en el modelo de la cabeza tractora o el país de donde provenga la maquinaria., [EN] In this final degree work we seek to analyze the behavior of the values and depreciation in the used machinery market, specifically the tractor heads market. For this purpose, the R programming language will be used to achieve an interactive technique for the collection of information and construction of OLS regression models. The calculation of machinery depreciation for valuation purposes is usually made assuming straight-line depreciation in imitation of the straight-line depreciation commonly used in the accounting and tax fields. However, this depreciation pattern may not be appropriate in all cases. In those cases with a secondary market, it is possible to test whether other types of depreciation models better reflect the behavior of the value. Currently, there is a large amount of information available on the Internet, mainly of assets in secondary markets, which allows to obtain the information to study the behavior of the price of these assets as a function of several variables. In this work we use a technique for the collection of information known as WebScraping, which is capable of obtaining specific, accurate, reliable and adjusted to the needs of each case in an automated way. Subsequently, the raw database is processed to eliminate those values that deviate greatly from the observations and generate incorrect valuations. In order to obtain a representative overview of the depreciation that these assets undergo, various minimum quadratic regression models (linear, exponential and potential) are developed and the age of these assets is related to their value and, in turn, the effect that the make and model of the tractor unit or the country of origin of the machinery may have is related.
- Published
- 2023
16. Big Text Data Collection Software
- Author
-
Олійник, Юрій Олександрович
- Subjects
вебскрапінг ,великі дані ,data collection ,big data ,збір даних ,data structuring ,webscraping ,структуризація даних ,004.912 - Abstract
Розмір пояснювальної записки – 107 аркушів, містить 20 ілюстрацій, 28 таблиць, 3 додатки, 21 посиланя на джерела. Актуальність теми. З кожним роком даних стає все більше, вони можуть принести користь в будь-якій сфері нашого життя за умови правильної обробки. Тема роботи є актуальною, оскільки на сьогодні універсального засобу для збору надвеликих масивів текстових даних з різних джерел не існує. Метою роботи є створення уніфікації структури та формату надвеликих масивів текстових даних за рахунок використання архітектурних рішень, які дозволяють користувачам розширювати його для власних цілей з мінімальними зусиллями. Для досягнення цієї мети необхідно вирішити такі задачі: - порівняльний аналіз наявних рішень для збору надвеликих масивів текстових даних; - формулювання технічних особливостей збору надвеликих масивів текстових даних; - розробка уніфікованої структури надвеликих текстових даних, зібраних з різних джерел; - розробка програмного забезпечення для збору надвеликих масивів текстових даних; - реалізація модульної архітектури в програмному рішенні; - оцінка ефективності запропонованого рішення. Об'єктом дослідження роботи є математичне, інформаційне та програмне забезпечення збору надвеликих масивів текстових даних. Предметом дослідження є методи збору надвеликих масивів текстових даних. Науковою новизною роботи є створення уніфікованого структури даних для джерел великих текстових даних різної природи, що включає зберігання мітки часу та джерела даних, а також декларування строгої структури. Практичне значення отриманих результатів полягає у можливості використання запропонованої уніфікованої структури для інтеграції між різними системами збору надвеликих масивів текстових даних. Зв’язок роботи з науковими програмами, планами, темами: дисертаційна робота виконувалась на кафедрі інформатики та програмної інженерії Національного технічного університету України «Київський політехнічний інститут ім. Ігоря Сікорського» в рамках теми «Методи та технології високопродуктивних обчислень та обробки надвеликих масивів даних». Державний реєстраційний номер 0117U000924. Апробація: Основні положення роботи доповідались і обговорювались на III Всеукраїнській науково-практичній конференції молодих вчених та студентів «Інженерія програмного забезпечення і передові інформаційні технології (Soft-Tech-2022)». Публікації. Наукові положення дисертації опубліковані в: 1) Кувічка М.Є. Уніфікація структури надвеликих масивів текстових даних, зібраних з різних джерел / М.Є. Кувічка, Ю.О. Олійник // Матеріали III Всеукраїнської науково-практичної конференції молодих вчених та студентів «Інженерія програмного забезпечення і передові інформаційні технології» (SoftTech-2022 осінь) – м. Київ: НТУУ «КПІ ім. Ігоря Сікорського», 23-25 листопада 2022 р. Розмір пояснювальної записки – 107 аркушів, містить 20 ілюстрацій, 28 таблиць, 3 додатки, 21 посиланя на джерела. Актуальність теми. З кожним роком даних стає все більше, вони можуть принести користь в будь-якій сфері нашого життя за умови правильної обробки. Тема роботи є актуальною, оскільки на сьогодні універсального засобу для збору надвеликих масивів текстових даних з різних джерел не існує. Метою роботи є створення уніфікації структури та формату надвеликих масивів текстових даних за рахунок використання архітектурних рішень, які дозволяють користувачам розширювати його для власних цілей з мінімальними зусиллями. Для досягнення цієї мети необхідно вирішити такі задачі: - порівняльний аналіз наявних рішень для збору надвеликих масивів текстових даних; - формулювання технічних особливостей збору надвеликих масивів текстових даних; - розробка уніфікованої структури надвеликих текстових даних, зібраних з різних джерел; - розробка програмного забезпечення для збору надвеликих масивів текстових даних; - реалізація модульної архітектури в програмному рішенні; - оцінка ефективності запропонованого рішення. Об'єктом дослідження роботи є математичне, інформаційне та програмне забезпечення збору надвеликих масивів текстових даних. Предметом дослідження є методи збору надвеликих масивів текстових даних. Науковою новизною роботи є створення уніфікованого структури даних для джерел великих текстових даних різної природи, що включає зберігання мітки часу та джерела даних, а також декларування строгої структури. Практичне значення отриманих результатів полягає у можливості використання запропонованої уніфікованої структури для інтеграції між різними системами збору надвеликих масивів текстових даних. Зв’язок роботи з науковими програмами, планами, темами: дисертаційна робота виконувалась на кафедрі інформатики та програмної інженерії Національного технічного університету України «Київський політехнічний інститут ім. Ігоря Сікорського» в рамках теми «Методи та технології високопродуктивних обчислень та обробки надвеликих масивів даних». Державний реєстраційний номер 0117U000924. Апробація: Основні положення роботи доповідались і обговорювались на III Всеукраїнській науково-практичній конференції молодих вчених та студентів «Інженерія програмного забезпечення і передові інформаційні технології (Soft-Tech-2022)». Публікації. Наукові положення дисертації опубліковані в: 1) Кувічка М.Є. Уніфікація структури надвеликих масивів текстових даних, зібраних з різних джерел / М.Є. Кувічка, Ю.О. Олійник // Матеріали III Всеукраїнської науково-практичної конференції молодих вчених та студентів «Інженерія програмного забезпечення і передові інформаційні технології» (SoftTech-2022 осінь) – м. Київ: НТУУ «КПІ ім. Ігоря Сікорського», 23-25 листопада 2022 р. Explanatory note size – 107 pages, contains 20 illustrations, 28 tables, 3 applications, 21 references. Topicality. Every year, the amount of data is increasing, it can be useful in any area of our life, provided it is properly processed. The topic of the work is relevant, because today there is no universal tool for collecting extremely large arrays of text data from various sources. The goal of the work is to unify the structure and format of super-large arrays of text data through the use of architectural solutions that allow users to expand it for their own purposes with minimal effort. To achieve this goal, it is necessary to solve the following problems: - perform the comparative analysis of available solutions for collecting super-large arrays of text data; - formulation of the technical features of the collection of extremely large arrays of text data; - development of a unified structure of super-large text data collected from various sources; - development of software for collecting extremely large arrays of text data; - implementation of modular architecture in a software solution; - evaluation of the effectiveness of the proposed solution. The object of research of the work is mathematical, informational and software for collecting super-large arrays of text data. The subject of research is methods of collecting extremely large arrays of textual data. The scientific novelty of the work is the creation of a unified data structure for the sources of large text data of various nature, which includes the storage of the time stamp and data source, as well as the declaration of a strict structure. The practical significance of the obtained results lies in the possibility of using the proposed unified structure for integration between different systems for collecting extremely large arrays of text data. Relationship with working with scientific programs, plans, topics. The work was performed at the Department of Informatics and Software Engineering of the National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute” in the framework of the topic “Methods and technologies of highperformance computing and big data processing”. State registration number 0117U000924. Approbation. The main provisions of the work were reported and discussed at the III All-Ukrainian scientific and practical conference of young scientists and students "Software engineering and advanced information technologies (Soft-Tech- 2022)". Publications. The scientific provisions of the dissertation were published in: 1) Kuvichka M.Y. Unification of the structure of super-large arrays of text data collected from various sources / M.Y. Kuvichka, Yu.O. Oliinyk // Materials of the III All-Ukrainian scientific and practical conference of young scientists and students "Software engineering and advanced information technologies" (SoftTech- 2022 autumn) - Kyiv: NTUU "KPI them. Igor Sikorsky", November 23-25, 2022. Explanatory note size – 107 pages, contains 20 illustrations, 28 tables, 3 applications, 21 references. Topicality. Every year, the amount of data is increasing, it can be useful in any area of our life, provided it is properly processed. The topic of the work is relevant, because today there is no universal tool for collecting extremely large arrays of text data from various sources. The goal of the work is to unify the structure and format of super-large arrays of text data through the use of architectural solutions that allow users to expand it for their own purposes with minimal effort. To achieve this goal, it is necessary to solve the following problems: - perform the comparative analysis of available solutions for collecting super-large arrays of text data; - formulation of the technical features of the collection of extremely large arrays of text data; - development of a unified structure of super-large text data collected from various sources; - development of software for collecting extremely large arrays of text data; - implementation of modular architecture in a software solution; - evaluation of the effectiveness of the proposed solution. The object of research of the work is mathematical, informational and software for collecting super-large arrays of text data. The subject of research is methods of collecting extremely large arrays of textual data. The scientific novelty of the work is the creation of a unified data structure for the sources of large text data of various nature, which includes the storage of the time stamp and data source, as well as the declaration of a strict structure. The practical significance of the obtained results lies in the possibility of using the proposed unified structure for integration between different systems for collecting extremely large arrays of text data. Relationship with working with scientific programs, plans, topics. The work was performed at the Department of Informatics and Software Engineering of the National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute” in the framework of the topic “Methods and technologies of highperformance computing and big data processing”. State registration number 0117U000924. Approbation. The main provisions of the work were reported and discussed at the III All-Ukrainian scientific and practical conference of young scientists and students "Software engineering and advanced information technologies (Soft-Tech- 2022)". Publications. The scientific provisions of the dissertation were published in: 1) Kuvichka M.Y. Unification of the structure of super-large arrays of text data collected from various sources / M.Y. Kuvichka, Yu.O. Oliinyk // Materials of the III All-Ukrainian scientific and practical conference of young scientists and students "Software engineering and advanced information technologies" (SoftTech- 2022 autumn) - Kyiv: NTUU "KPI them. Igor Sikorsky", November 23-25, 2022.
- Published
- 2022
17. Plataforma web per l'scouting de jugadors de basquetbol
- Author
-
Vindel Quintana, Aleix, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Llorente Viejo, Silvia
- Subjects
python ,Big data ,Informàtica::Sistemes d'informació [Àrees temàtiques de la UPC] ,Dades massives ,webscraping ,Basketball ,basquetbol ,bigdata ,scouting - Abstract
En el segle XXI, l'estadística i el big data és un fenomen en auge, no només en empreses de màrqueting i publicitat, també apareixen en esports col·lectius com el futbol o el basquetbol amb l'objectiu de maximitzar les probabilitats de guanyar. Però no tots els equips disposen de pressupostos enormes al que dedicar-hi esforços econòmics majúsculs. Aquest treball, amb la finalitat d'apropar aquest aspecte als clubs amb recursos limitats, mostra el desenvolupament d'una plataforma web per l'scouting de jugadors de basquetbol en l'àmbit semiprofessional. Definim scouting com a analitzar, observar i recopilar informació de jugadors amb dos motius, fitxar-los pels seus equips o veure'ls per jugar en contra seva. Per tal de fer aquest desenvolupament s'ha accedit a la pàgina web de la Federació Espanyola de Basquetbol, i mitjançant tècniques de webscraping i bases de dades, s'ha recopilat tota la informació de tots els partits jugats des de fa trenta anys. Amb aquestes dades, s'ha fet una pàgina web que les mostri d'una forma accessible i ordenada, per tal d'oferir aquesta informació a clubs, directors esportius, o cossos tècnics en general, d'una manera còmoda per a tothom. Com a pas previ a la creació de la pàgina web, s'ha consultat a diferents experts en temes de basquetbol quin tipus de dades es creien rellevants per mostrar, i posteriorment s'ha fet una especificació UML dels requisits obtinguts, que amb aquests, s'ha pogut elaborar un disseny final. La implementació dels algorismes de webscraping ha sigut amb les llibreries Selenium i BeautifulSoup de Python. En el backend, s'ha utilitzat Postgres com a base de dades SQL, i Node.JS com a API per accedir-hi. Finalment, s'ha emprat React per el frontend. In XXI century, statistics and big data have become a trend, not only in advertising and marketing companies, but in team sports as soccer or basketball too, with the objective of increasing their winning probabilities. But not all teams have a huge budget to invest in such a financial effort. This project, with the intention of bringing this aspect to the clubs with limited resources, show the development of a web platform for the scouting of basketball players in the semi-professional field. We define scouting as the analysis, observation and gathering of player's information for two purposes, signing them for their teams, or watching them play against you. For this development, we have accessed the website of the Spanish Basketball Federation, and through web scraping techniques and databases, we have collected all the information of all the games played for thirty years. With this data, a website has been created, that shows them in an accessible and orderly way, with the aim of offering this information to clubs, sports directors or staff in general, in a comfortable way for everyone. Previously, different experts in basketball were asked on what types of data were considered relevant to shows, and later we made a UML with the results obtained, which were used to create a final design. Web scraping algorithms were implemented with Selenium and BeautifulSoup libraries in Python. In backend, Postgres has been used as SQL database, and Node.JS as an API to access it. Finally, React has been used for the frontend.
- Published
- 2022
18. Report on Amazon´s Project: Statistical evaluation on socio-economic variables across Germany
- Author
-
De la Serna, Sebastian
- Subjects
Georeferenced ,Normalization ,Webscraping ,Statistics - Abstract
[EN] In this report we define a proxy that can explain Germany´s precariousness at the district level by relating socio-economic variables to the distribution of parcel centers for the years 2011 and 2019. This precariousness indicator is an aggregated indicator which is composed by 5 socio-economic variables and its consequent normalization processes. These 5 socio-economic variables, which are mainly related to unemployment, form the normalized indicator "precariousness" on a scale from 0 (less) to 8 (most), with equal weighting. The challenge in this project is to webscrape all relevant logistics centres of different competitors in the courier industry and map them on a district level in order to later, when matching the socioeconomic indicators with this dataset, highlight the regions where Amazon operates. Therefore, we could raise the question whether Amazon operates systematically in regions with relative precariousness or not. To answer this question, we address the chosen socio-economic variables by running descriptive statistics by looking at how the standard deviations perform. Consequently, we examine the collinearity of the 5 variables by means of a correlation matrix and PCA. Finally, we compare the two resulting maps for 2011 and 2019 and assess their precariousness.
- Published
- 2022
19. DATA MINING Y ANÁLISIS MATEMÁTICO DE LAS CUOTAS DE LAS CASAS DE APUESTAS DEPORTIVAS ONLINE.
- Author
-
TORRES - CABRERA, GONZALO PÉREZ - SEOANE and QUESADA GONZÁLEZ, CARLOS
- Abstract
Copyright of Rect@: Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA is the property of ASEPUMA Asociacion Espanola de Profesores Universitarios de Matematicas and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2018
- Full Text
- View/download PDF
20. Plataforma web per l'scouting de jugadors de basquetbol
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Llorente Viejo, Silvia, Vindel Quintana, Aleix, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Llorente Viejo, Silvia, and Vindel Quintana, Aleix
- Abstract
En el segle XXI, l'estadística i el big data és un fenomen en auge, no només en empreses de màrqueting i publicitat, també apareixen en esports col·lectius com el futbol o el basquetbol amb l'objectiu de maximitzar les probabilitats de guanyar. Però no tots els equips disposen de pressupostos enormes al que dedicar-hi esforços econòmics majúsculs. Aquest treball, amb la finalitat d'apropar aquest aspecte als clubs amb recursos limitats, mostra el desenvolupament d'una plataforma web per l'scouting de jugadors de basquetbol en l'àmbit semiprofessional. Definim scouting com a analitzar, observar i recopilar informació de jugadors amb dos motius, fitxar-los pels seus equips o veure'ls per jugar en contra seva. Per tal de fer aquest desenvolupament s'ha accedit a la pàgina web de la Federació Espanyola de Basquetbol, i mitjançant tècniques de webscraping i bases de dades, s'ha recopilat tota la informació de tots els partits jugats des de fa trenta anys. Amb aquestes dades, s'ha fet una pàgina web que les mostri d'una forma accessible i ordenada, per tal d'oferir aquesta informació a clubs, directors esportius, o cossos tècnics en general, d'una manera còmoda per a tothom. Com a pas previ a la creació de la pàgina web, s'ha consultat a diferents experts en temes de basquetbol quin tipus de dades es creien rellevants per mostrar, i posteriorment s'ha fet una especificació UML dels requisits obtinguts, que amb aquests, s'ha pogut elaborar un disseny final. La implementació dels algorismes de webscraping ha sigut amb les llibreries Selenium i BeautifulSoup de Python. En el backend, s'ha utilitzat Postgres com a base de dades SQL, i Node.JS com a API per accedir-hi. Finalment, s'ha emprat React per el frontend., In XXI century, statistics and big data have become a trend, not only in advertising and marketing companies, but in team sports as soccer or basketball too, with the objective of increasing their winning probabilities. But not all teams have a huge budget to invest in such a financial effort. This project, with the intention of bringing this aspect to the clubs with limited resources, show the development of a web platform for the scouting of basketball players in the semi-professional field. We define scouting as the analysis, observation and gathering of player's information for two purposes, signing them for their teams, or watching them play against you. For this development, we have accessed the website of the Spanish Basketball Federation, and through web scraping techniques and databases, we have collected all the information of all the games played for thirty years. With this data, a website has been created, that shows them in an accessible and orderly way, with the aim of offering this information to clubs, sports directors or staff in general, in a comfortable way for everyone. Previously, different experts in basketball were asked on what types of data were considered relevant to shows, and later we made a UML with the results obtained, which were used to create a final design. Web scraping algorithms were implemented with Selenium and BeautifulSoup libraries in Python. In backend, Postgres has been used as SQL database, and Node.JS as an API to access it. Finally, React has been used for the frontend.
- Published
- 2022
21. Recreational Psychedelic Users Frequently Encounter Complete Mystical Experiences: Trip Content and Implications for Wellbeing
- Author
-
Qiu, Tianhong and Minda, John
- Subjects
sentiment ,lda ,topic modeling ,ComputingMilieux_PERSONALCOMPUTING ,webscraping ,psychedelics ,psychology ,naturalistic ,drugs ,mystical ,recreational ,neuroscience ,erowid ,experience ,wellbeing ,pharmacology ,textmining - Abstract
A growing proportion of the population is engaging in recreational psychedelic use. Psychedelics are uniquely capable of reliably occasioning mystical experiences in ordinary humans without contemplative or religious backgrounds. While clinical research has made efforts to characterize psychedelic experiences, comparably little is understood about how humans naturalistically engage with psychedelics. The present study employs a mixed-methods approach to examine the content and implications of psychedelic and mystical experiences, occurring outside of laboratory settings. We use text mining analyses to arrive at a qualitative description of psychedelic experiential content by abstracting from over two-thousand written reports of first-person psychedelic experiences. Following up, we conducted quantitative analyses on psychometric data from a large survey (N = 1424) to reveal associations between psychedelic use practices, complete mystical experiences, and psychological wellbeing. Topic-modelling and sentiment analyses present a bottom-up description of human interactions with psychedelic compounds and the content of such experiences. Psychometric results suggest psychedelic users encounter complete mystical experiences in high proportions, dependent on factors such as drug type and dose-response effects. Furthermore, a salient association was established between diverse metrics of wellbeing and those with complete mystical experiences. Our results paint a new picture of the growing relationships between humans and psychedelic experiences in the real-world use context. Ordinary humans appear to encounter complete mystical experiences via recreational psychedelic use, and such experiences are strongly associated with improved psychological wellbeing.
- Published
- 2022
- Full Text
- View/download PDF
22. Vernetzung von Unternehmen und Forschungseinrichtungen in regionalen Innovationssystemen durch Webscraping
- Author
-
Meub, Lukas, Proeger, Till, Bizer, Kilian, and Lahner, Jörg
- Subjects
Webscraping ,Wirtschaftsförderung ,Regionale Innovationssysteme - Abstract
Diese Studie zeigt am Beispiel von Südniedersachsen, wie mit der Methodik des Webscrapings effizient Informationen zu regionalen Akteuren wie Forschungseinrichtungen und Unternehmen gewonnen werden können. Webscraping beschreibt das systematische Auslesen der Inhalte von Webseiten und deren anschließende statistischer Analyse z.B. im Hinblick auf spezifische Technologien oder Strukturmerkmale von Unternehmen. Auf diesem Wege können gemeinsame technologische Schwerpunkte identifiziert werden, wodurch effizient Netzwerkaktivitäten und Projektverbünde zwischen den Akteuren aufgebaut werden können. Akteuren der Wirtschafts- und Innovationsförderung bietet dies ein innovatives Werkzeug, um die regionalen Innovationsnetzwerke strukturiert auszubauen und neue Kooperationen zu etablieren. Die Studie skizziert verschiedene Anwendungsmöglichkeiten am Beispiel von Südniedersachsen. Zum einen werden regionale Spezialisierungen bei Unternehmen und Forschungseinrichtungen in den Bereichen Wasserstoff und Lasertechnologie aufgezeigt. Ferner erfolgt eine Zuordnung von Unternehmen und Forschungseinrichtungen zu verschiedenen Technologieschwerpunkten. Übergreifend stellt Webscraping ein innovatives und effizient einsetzbares Instrument für regionale Innovationsakteure dar, das die regionale Koordination und den strukturierten, themenbezogenen Ausbau von Innovationsnetzwerken erleichtert., Göttinger Beiträge zur Handwerksforschung ; 62
- Published
- 2022
- Full Text
- View/download PDF
23. El framing como estrategia metodológica: entornos digitales y campaña electoral en La Plata (2019)
- Author
-
Lanusse, Nazareno
- Subjects
Webscraping ,Campaña electoral ,Political communication ,Methodology ,Framing ,Comunicación política ,Comunicación ,Metodología ,Election campaign - Abstract
El trabajo se enmarca en una investigación sobre comunicación política en entornos digitales, que pretende analizar los procesos de construcción de discursos de medios digitales de La Plata y los de campaña de los/as políticos/as en Twitter, y en paralelo, realizar un seguimiento diario durante el período de campaña electoral para las elecciones generales de La Plata en 2019., The work is framed in a research on political communication in digital environments, which aims to analyze the processes of construction of discourses of digital media in La Plata and those of politicians' campaign on Twitter, and in In parallel, carry out daily monitoring during the electoral campaign period for the general elections of La Plata in 2019., Facultad de Periodismo y Comunicación Social
- Published
- 2022
24. Development of an R package to analyze corporate websites
- Author
-
Serrano Moliner, Arturo
- Subjects
ECONOMIA APLICADA ,Webscraping ,Web analysis ,Programming libraries ,Business ,Grado en Ciencia de Datos-Grau en Ciència de Dades - Abstract
[ES] La utilización de librerías en lenguajes de programación de alto nivel como R es una práctica que supone ahorros de tiempo y recursos a la hora de llevar a cabo proyectos de cualquier ámbito o escala. Es tal esta afirmación, que no se concibe el desarrollo de ningún proyecto sin la utilización de estas a lo largo del mismo. En este trabajo pretendemos aportar nuestro granito de arena creando uno de estos paquetes y enfocando el mismo hacia el análisis web de empresas, con el objetivo de poder emplear la información presente en estas para fines de análisis web, marketing digital, toma de decisiones de negocio o cualquier otra tarea relacionada con el mundo de la empresa.
- Published
- 2022
25. A 'Europe of Lawyers'? The Making of a Database on Cases and Lawyers of the CJEU
- Author
-
Avril, Lola, Brissaud, Constantin, and Administrateur, Paris Dauphine-PSL
- Subjects
transversal analysis ,Law of Europe ,Court of Justice ,lawyers ,webscraping ,[SHS] Humanities and Social Sciences ,Law ,KJ-KKZ ,database ,quantitative research - Abstract
European Papers - A Journal on Law and Integration, 2021 6(2), 913-921, I. Introduction: why a database of lawyers at the Court of Justice is an important new tool for socio-legal inquiry. - II. The construction of the database. - III. The Court as a place of confluence. - IV. How to complement the analysis? - V. Concluding remarks., This Article presents a database of lawyers being built within the Court of Justice in the Archives project. Recent studies, relying on actor-centred approaches, have fostered a renewed interest in European lawyers. While vis-its of these lawyers in Luxembourg have fostered the development of transnational legal networks and participated to the acculturation of the Court's informal and formal rules, they remain largely under-studied. We therefore suggest to analyse the Court as a "place of confluence", where different professional groups meet during the course of the proceedings. The database precisely aims at mapping the networks of lawyers that take shape in Luxembourg. Providing statistical analysis of the structuration and evolutions of the Europe of lawyers (agents of the European institutions or Member States, law professors or private practitioners), we suggest that the database could contribute to a better understanding of transformations in the European legal field.
- Published
- 2021
- Full Text
- View/download PDF
26. The World Wide Web as Complex Data Set: Expanding the Digital Humanities into the Twentieth Century and Beyond through Internet Research.
- Author
-
Black, Michael L.
- Subjects
- *
INTELLECTUAL property , *WORLD Wide Web , *RESEARCH , *TEXT mining , *DATA mining , *INFORMATION retrieval - Abstract
While intellectual property protections effectively frame digital humanities text mining as a field primarily for the study of the nineteenth century, the Internet offers an intriguing object of study for humanists working in later periods. As a complex data source, the World Wide Web presents its own methodological challenges for digital humanists, but lessons learned from projects studying large nineteenth century corpora offer helpful starting points. Complicating matters further, legal and ethical questions surrounding web scraping, or the practice of large scale data retrieval over the Internet, will require humanists to frame their research to distinguish it from commercial and malicious activities. This essay reviews relevant research in the digital humanities and new media studies in order to show how web scraping might contribute to humanities research questions. In addition to recommendations for addressing the complex concerns surrounding web scraping this essay also provides a basic overview of the process and some recommendations for resources. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
27. Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives.
- Author
-
Milligan, Ian
- Subjects
- *
WORLD Wide Web , *RESEARCH , *HISTORIANS , *WEB archives , *DIGITAL resources for archives , *WEB archiving - Abstract
Contemporary and future historians need to grapple with and confront the challenges posed by web archives. These large collections of material, accessed either through the Internet Archive's Wayback Machine or through other computational methods, represent both a challenge and an opportunity to historians. Through these collections, we have the potential to access the voices of millions of non-elite individuals (recognizing of course the cleavages in both Web access as well as method of access). To put this in perspective, the Old Bailey Online currently describes its monumental holdings of 197,745 trials between 1674 and 1913 as the 'largest body of texts detailing the lives of non-elite people ever published.' GeoCities.com, a platform for everyday web publishing in the mid-to-late 1990s and early 2000s, amounted to over thirty-eight million individual webpages. Historians will have access, in some form, to millions of pages: written by everyday people of various classes, genders, ethnicities, and ages. While the Web was not a perfect democracy by any means - it was and is unevenly accessed across each of those categories - this still represents a massive collection of non-elite speech. Yet a figure like thirty-eight million webpages is both a blessing and a curse. We cannot read every website, and must instead rely upon discovery tools to find the information that we need. Yet these tools largely do not exist for web archives, or are in a very early state of development: what will they look like? What information do historians want to access? We cannot simply map over web tools optimized for discovering current information through online searches or metadata analysis. We need to find information that mattered at the time, to diverse and very large communities. Furthermore, web pages cannot be viewed in isolation, outside of the networks that they inhabited. In theory, amongst corpuses of millions of pages, researchers can find whatever they want to confirm. The trick is situating it into a larger social and cultural context: is it representative? Unique? In this paper, 'Lost in the Infinite Archive,' I explore what the future of digital methods for historians will be when they need to explore web archives. Historical research of periods beginning in the mid-1990s will need to use web archives, and right now we are not ready. This article draws on first-hand research with the Internet Archive and Archive-It web archiving teams. It draws upon three exhaustive datasets: the large Web ARChive (WARC) files that make up Wide Web Scrapes of the Web; the metadata-intensive WAT files that provide networked contextual information; and the lifted-straight-from-the-web guerilla archives generated by groups like Archive Team. Through these case studies, we can see - hands-on - what richness and potentials lie in these new cultural records, and what approaches we may need to adopt. It helps underscore the need to have humanists involved at this early, crucial stage. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
28. En jämförelse av prestanda mellan centraliserad och decentraliserad datainsamling
- Author
-
Hidén, Filip, Qvarnström, Magnus, Hidén, Filip, and Qvarnström, Magnus
- Abstract
In the modern world, data and information is used on a larger scale than ever before. Much of this information is stored on the internet in many different shapes, like articles, files and webpages, among others. If you try to start a new project or company that depends on this data there is a need for a way to efficiently search for, sort and gather what you need to process. A common method to achieve this is called Web scraping, that can be implemented in several different ways to search and gather data. This can be an expensive investment for smaller companies, as Web scraping is an intensive process that requires that you pay for a powerful enough server to manage everything. The purpose of this report is to investigate whether there exist other cheaper alternatives to implement Web scraping, that don’t require access to expensive servers. To find an answer to this, it was necessary to research the subject of Web scraping further along with different system architectures that are used in the industry to implement it. This research was then used to develop a Web scraping application that was implemented on both a centralised server and as a decentralised implementation on an Android device. Finally all the summarized research and results from performance tests of the two applications were used in order to provide a result. The conclusion drawn from these results was that decentralised android implementations is a valid and functional solution for Web scraping today, however the difference in performance means it’s not always useful for every situation. Instead it must be handled based on the specifications and requirements of the particular company. There is also a very limited amount of research done on this topic, which means it needs further investigation in order to keep developing implementations and knowledge on this particular subject., I den moderna världen används data och information i en större skala än någonsin tidigare. Mycket av denna information och data kan hittas på internet i många olika former som artiklar, filer, webbsidor med mera. Om man försöker att starta ett nytt projekt eller företag som är beroende av delar av denna data behövs det ett sätt att effektivt söka igenom den, sortera ut det som söks och samla in den för att hanteras. Ett vanligt sätt att göra detta är en metod som kallas Web scraping, som kan implementeras på flera olika sätt för att söka och samla in den funna datan. För små företag kan detta bli en kostsam satsning, då Web scraping är en intensiv process som vanligtvis kräver att man måste betala för att driva en tillräckligt kraftfull server som kan hantera datan. Syftet med denna rapport är att undersöka om det finns giltiga och billigare alternativ för att implementera Web scraping lösningar, som inte kräver tillgång till kostsamma serverlösningar. För att svara på detta utfördes en undersökning runt Web scraping, samt olika systemarkitekturer som används för att utveckla dessa system i den nuvarande marknaden samt hur de kan implementeras. Med denna kunskap utvecklades en Web scraping applikation som anpassades för att samla in ingredienser från recept artiklar på internet. Denna implementation anpassades sedan för två olika lösningar, en centraliserad på en server och en decentraliserad, för Android enheter. Till slut summerades all den insamlade faktan, tillsammans med enhetstester utförda på test implementationerna för att få ut ett resultat. Slutsatsen som drogs av detta resultat var att decentraliserade Android implementationer är en giltig och funktionell lösning för Web scraping idag, men skillnaden i prestanda innebär att det inte alltid är en användbar lösning, istället måste det bestämmas beroende på ett företags behov och specifikationer. Dessutom är forskningen runt detta ämne begränsat, och kräver vidare undersökning och fördjupning för att förbätt
- Published
- 2021
29. Model degradation in web derived text-based models
- Author
-
Daas, Piet, Jansen, Jelmer, Daas, Piet, and Jansen, Jelmer
- Abstract
[EN] Getting an overview of the innovative companies in a country is a challenging task. Traditionally, this is done by sending a questionnaire to a sample of large companies. For this an alternative approach has been developed: determining if a company is innovative by studying the text on the main page of its website. The text-based model created is able to reproduce the results from the survey and is also able to detect small innovative companies, such as startups. However, model stability was found to be a serious problem. It suffered from model degradation which resulted in a gradual decrease in the detection of innovative companies. The accuracy of the model dropped from 93% to 63% over a period of one year. In this paper this phenomenon is described and the data underlying it is studied in great detail. It was found that the combination of the inactivity of a subset of websites and changes in the composition of the words on company websites over time produced this effect. A solution for dealing with this phenomenon is presented and future research is discussed.
- Published
- 2020
30. Analyse des Digitalisierungsgrads von Bildungseinrichtungen auf Basis von Webscraping - eine methodische Vorstudie
- Author
-
Proeger, Till, Meub, Lukas, and Pölert, Hauke
- Subjects
Digitalisierung ,Webscraping ,Bildungseinrichtungen ,ddc:330 - Abstract
Diese Vorstudie zeigt beispielhaft, wie mit der Methodik des Webscrapings, d.h. einem systematischen Abruf und statistischer Analyse von Webseiten, Erkenntnisse über Bildungseinrichtungen gewonnen werden. Anhand von zwei Teilstudien über 200 niedersächsische sowie 173 Bildungseinrichtungen aus Nordrhein-Westfalen wird eine strukturelle und inhaltsbezogene Analyse der Webseiten durchgeführt. Der Fokus liegt dabei auf dem Digitalisierungsgrad und der Nutzung digitaler Technologien durch die Bildungseinrichtungen. Übergreifend stellt Webscraping ein innovatives und effizient einsetzbares Instrument für die systematische Analyse der Entwicklung von Bildungseinrichtungen im Feld der Digitalisierung dar.
- Published
- 2021
31. Webscraping als Instrument zur tagesaktuellen und umfassenden digitalen Analyse des Handwerks
- Author
-
Proeger, Till, Meub, Lukas, and Bizer, Kilian
- Subjects
Digitalisierung ,Webscraping ,Regionalanalyse ,Innovation - Abstract
Auf Basis der Handwerksrolle der Handwerkskammer Hildesheim-S��dniedersachsen analysiert diese Studie die digitalen Pr��senzen der Mitgliedsbetriebe. Von den 7.422 Betrieben des Kammerbezirks (Stand Fr��hjahr 2020) sind rund 90 % der Betriebe im Internet auffindbar; eine eigene Website unterhalten 3.661 Betriebe. Diese Webseiten wurden f��r eine Webscraping-Analyse genutzt. Die Studie illustriert dabei das gro��e Potenzial des Webscrapings f��r die Informations-gewinnung ��ber die Betriebe eines Kammerbezirks. Durch das effiziente Durchsuchen aller durch die Betriebe ver��ffentlichten Inhalte k��nnen allgemeine Aussagen ��ber den Kammerbezirk, ��ber Gewerbegruppen oder einzelne Gewerke getroffen werden, die auf einer strategischen Ebene bei der Entscheidungsfindung unterst��tzen k��nnen. Ebenso kann auf operativer Ebene die Suche nach Betrieben mit spezifischen Eigenschaften, Zielen oder Technologien deutlich effizienter und umfassender erfolgen, was f��r die Ausrichtung von Angeboten der Kammer von gro��em Vorteil sein kann. Um die breite Einsetzbarkeit des Webscrapings zu illustrieren, werden folgende Auswertungen beispielhaft vorgenommen: ��� Strukturanalyse der digitalen Pr��senz des Handwerks im Kammerbezirk, die eine umfassende St��rken-Schw��chen-Analyse ihres Digitalisierungsgrads im Bereich des Marketings erm��glicht. ��� Eine Analyse der Verlinkungen auf allen Webseiten, die Aufschluss ��ber Vernetzungsstrukturen und Einbindung in Wertsch��pfungsketten der Betriebe gibt. ��� Eine Strukturanalyse der Nutzung von Social Media als Erg��nzung oder Substitut der Webseiten. ��� Eine Detailsuche nach der DSGVO-Kompatibilit��t der untersuchten Webseiten. ��� Eine Strukturanalyse der digitalen Fachkr��ftegewinnung sowie Identifikation von digital besonders aktiven Betrieben, die als Best-Practice-Beispiel dienen k��nnen. ��� Eine Analyse der Innovationst��tigkeit der Betriebe durch eine Stichwortsuche aktueller, innovativer Technologien. ��� Eine Detailsuche nach der Nutzung von 3D-Druck und Drohnen und eine kartografische Darstellung der Betriebe, die diese Technologien nutzen. ��� Eine Analyse, welche Betriebe Forschungsvernetzungen pflegen und welche Betriebe Online-Shops betreiben. ��� Eine Strukturanalyse der Reaktion der Betriebe auf die Corona-Lage. F��r alle diese beispielhaft herausgegriffenen Bereiche k��nnen sowohl strukturelle Aussagen getroffen als auch individuelle Betriebe identifiziert werden., G��ttinger Beitr��ge zur Handwerksforschung ; 55
- Published
- 2021
- Full Text
- View/download PDF
32. Insolvenzstatistik in der Corona-Pandemie – aktuellere Ergebnisse durch Webscraping
- Author
-
Alter, Hannah, Feuerhake, Jörg, and Jacob, Simon
- Subjects
ddc:519 ,Webscraping ,Corona-Pandemie ,experimental data ,insolvecies ,Insolvenzen ,web-scraping ,timeliness ,Experimentelle Daten ,Aktualität ,coronavirus pandemic - Abstract
Die Insolvenzstatistik ist seit Beginn der Corona-Pandemie noch stärker in den öffentlichen Fokus gerückt. Seit Mai 2020 veröffentlicht das Statistische Bundesamt einen Frühindikator, der die amtlichen Ergebnisse der Insolvenzstatistik ergänzt. Dadurch ist es möglich, Entwicklungen im bundesweiten Insolvenzgeschehen zwei Monate früher als zuvor zu identifizieren. Für den Frühindikator verwendet das Statistische Bundesamt die Angaben einer öffentlichen Internetseite zu Insolvenzbekanntmachungen, eine bisher ungenutzte Datenquelle. Der Aufsatz beschreibt die Entwicklung sowie die ersten Ergebnisse des Frühindikators und analysiert das aktuelle Insolvenzgeschehen in Deutschland. Since the beginning of the coronavirus pandemic, insolvency statistics have increasingly become a focus of public attention. Since May 2020, the Federal Statistical Office has published a flash indicator to supplement the results of the official insolvency statistics. This enables trends in the total insolvency figures for Germany to be identified two months earlier than before. The Federal Statistical Office uses a new data source, namely a public website with insolvency announcements, for compiling the flash indicator. This article describes the development and first results of the flash indicator, and it analyses the current insolvency situation in Germany.
- Published
- 2021
33. We can Help Gather Data to Power your Research Project or News Story
- Author
-
Hir Infotech
- Subjects
Data ,Webscraping ,UAE ,newsstory ,news ,newswebsite ,France ,bigdata ,USA ,webscrapingapi ,Crawling - Abstract
Researchers and #Journalists require large amounts of data to provide accurate #analysis or reports. We can help #gather data about weather, third world development, crime, local and global #trends data to power your next research project or news story. Visit: https://hirinfotech.com/website-scraping/
- Published
- 2020
- Full Text
- View/download PDF
34. Make the Right Decision, at the Right Time Employ our State of the Art Data Capturing Services
- Author
-
Hir Infotech
- Subjects
BusinessGrowth ,UAE ,DataAnalysis ,WebScraping ,BigData ,France ,DataScience ,DataCapturing ,DataStorage ,USA ,DataExtraction - Abstract
Data has the power to introduce clarity to empower your business to make the right decisions, at the right time. Presenting cutting edge #datascraping services to empower you with real knowledge. Interested to know more? Read here -https://hirinfotech.com/enterprise-web-crawling-service/
- Published
- 2020
- Full Text
- View/download PDF
35. Model degradation in web derived text-based models
- Author
-
Piet J. H. Daas and Jelmer Jansen
- Subjects
Qca ,Webscraping ,Computer science ,business.industry ,Big data ,Web data ,Conference ,Pls ,Text analysis ,computer.software_genre ,Text mining ,Sem ,Data mining ,business ,Innovation ,computer ,Degradation (telecommunications) ,Internet data - Abstract
[EN] Getting an overview of the innovative companies in a country is a challenging task. Traditionally, this is done by sending a questionnaire to a sample of large companies. For this an alternative approach has been developed: determining if a company is innovative by studying the text on the main page of its website. The text-based model created is able to reproduce the results from the survey and is also able to detect small innovative companies, such as startups. However, model stability was found to be a serious problem. It suffered from model degradation which resulted in a gradual decrease in the detection of innovative companies. The accuracy of the model dropped from 93% to 63% over a period of one year. In this paper this phenomenon is described and the data underlying it is studied in great detail. It was found that the combination of the inactivity of a subset of websites and changes in the composition of the words on company websites over time produced this effect. A solution for dealing with this phenomenon is presented and future research is discussed.
- Published
- 2020
- Full Text
- View/download PDF
36. Detecting innovative companies via their website
- Author
-
Suzanne van der Doef, Piet J. H. Daas, Statistics, and Stochastic Operations Research
- Subjects
Economics and Econometrics ,Knowledge management ,Concept drift ,Computer science ,media_common.quotation_subject ,Big data ,Stability (learning theory) ,Sample (statistics) ,concept drift ,01 natural sciences ,Management Information Systems ,Task (project management) ,010104 statistics & probability ,Quality (business) ,0101 mathematics ,Innovation ,media_common ,Model bias ,business.industry ,010401 analytical chemistry ,webscraping ,text analysis ,0104 chemical sciences ,Statistics, Probability and Uncertainty ,business - Abstract
Producing an overview of innovative companies in a country is a challenging task. Traditionally, this is done by sending a questionnaire to a sample of companies. This approach, however, usually only focuses on large companies. We therefore investigated an alternative approach: determining if a company is innovative by studying the text on its website. For this task a model was developed based on the texts of the websites of companies included in the Community Innovation Survey of the Netherlands. The latter is a survey carried out every two years that focusses on the detection of innovative companies with 10 or more working persons. We found that the text-based model developed was able to reproduce the result from the Community Innovation Survey and was also able to detect innovative companies with less than 10 employees, such as startups. Model stability, model bias, the minimal number of words extracted from a website and companies without a website were found to be important issues in producing high quality results. How these issues were dealt with and the findings on the number of innovative companies with large and small numbers of employees are discussed in the paper.
- Published
- 2020
37. Headless Chrome Automation with the crrri package
- Author
-
Lesur, Romain
- Subjects
browsers ,webscraping ,automation - Abstract
Headless Chrome is a headless web browser that became extremely popular thanks to the node.js chrome-remote-interface and puppeteer libraries. A headless web browser can be used for different goals: web scrapping, testing web applications (e.g. for Shiny apps), screenshots or PDF generation of web pages. Several R packages as RSelenium and webdriver offer high-level interfaces to headless web browsers. In this communication, we will present an R package named crrri that provides a low-level interface to headless Chrome using theChrome Developer Protocol. It offers an access to the most advanced features of headless Chrome from R. The crrri package has small system dependencies: the only dependency is Chromium/Chrome. Its API is close to node.js libraries: node.js scripts for headless Chrome can be easily transcripted in R. Through several examples, weshow different applications of headless Chrome.
- Published
- 2019
- Full Text
- View/download PDF
38. Data mining y análisis matemático de las cuotas de las casas de apuestas deportivas online
- Author
-
Pérez – Seoane Torres – Cabrera, Gonzalo, Quesada González, Carlos, Pérez – Seoane Torres – Cabrera, Gonzalo, and Quesada González, Carlos
- Abstract
Este artículo aborda el estudio empírico del comportamiento de las cotizaciones de las apuestas 1X2 en las casas de apuestas online en España. En el estudio se presenta un marco teórico en el que el apostante puede obtener una rentabilidad y se estudian posibles estrategias, a través de la modelización matemática, mediante las cuales dicha disquisición teórica se podría llevar a cabo en la práctica. Para la realización del estudio empírico se ha obtenido una base de datos correspondiente a 115 días de cotizaciones sobre apuestas de 175 partidos de fútbol de la Liga Santander cotizados en la plataforma de apuestas online Sportium mediante técnicas de programación y automatización conocidas como web-scraping. A lo largo del estudio se analizaron cuestiones tales como el juego justo, los márgenes obtenidos por las casas de apuestas y las variables que intervienen en el proceso de asignación de precios a las cotizaciones.
- Published
- 2018
39. Web-based methodology for monitoring new jobs. Updating the occupations observatory
- Author
-
Beblavy, Miroslav, Welter-Médée, Cécile, Lenaerts, Karolien, Akgüç, Mehtap, Kilhoffer, Zachary, and Silva, Ana
- Subjects
online vacancies ,webscraping ,occupations observatory ,new jobs and skills - Abstract
The identification of new and emerging occupations has proven to be a challenging task, in which real-time information on labour market developments is key. At present, the most commonly used data sources do not provide up-to-date information, are narrow in scope or limited in size. In this light, online job portals have been suggested as an interesting data source for real-time labour market analysis. This report aims to contribute to the identification of new and emerging occupations by presenting an updated version of the methodology underpinning the Occupations Observatory developed by Beblavý et al. (2016). We use data extracted from online job boards using web scraping techniques, compare newly identified occupations with existing occupational classifications, and present examples of the tasks and skills required. With this update, we set out to further fine-tune the data collection, processing and analysis steps, but also to make the methodology and outputs more user-friendly, while providing more information at the same time. The proposed revised methodology consists of seven stages, and has been tested for the case of Ireland. ispartof: Web-based methodology for monitoring new jobs. Updating the occupations observatory status: published
- Published
- 2018
40. Web data extraction systems versus research collaboration in sustainable planning for housing: smart governance takes it all
- Author
-
Dewaelheyns, Valerie, Loris, Isabelle, Steenberghen, Thérèse, Schrenk, Manfred, Popovich, Vasily, Zeile, Peter, Elisei, Pietro, and Beyer, Clemens
- Subjects
co-production ,Technology and Engineering ,transdisciplinairy research ,webscraping - Abstract
To date, there are no clear insights in the spatial patterns and micro-dynamics of the housing market. The objective of this study is to collect real estate micro-data for the development of policy-support indicators on housing market dynamics at the local scale. These indicators can provide the requested insights in spatial patterns and micro-dynamics of the housing market. Because the required real estate data are not systematicly published as statistical data or open data, innovative forms of data collection are needed. This paper is based on a case study approach of the greater Leuven area (Belgium). The research question is what are suitable methods or strategies to collect data on micro-dynamics of the housing market. The methodology includes a technical approach for data collection, being Web data extraction, and a governance approach, being explorative interviews. A Web data extraction system collects and extracts unstructured or semi-structured data that are stored or published on Web sources. Most of the required data are publicly and readily available as Web data on real estate portal websites. Web data extraction at the scale of the case study succeeded in collecting the required micro-data, but a trial run at the regional scale encountered a number of practical and legal issues. Simultaneously with the Web data extraction, the dialogue with two real estate portal websites was initiated, using purposive sampling and explorative semi-structured interviews. The interviews were considered as the start of a transdisciplinary research collaboration process. Both companies indicated that the development of indicators about housing market dynamics was a good and relevant idea, yet a challenging task. The companies were familiar with Web data extraction systems, but considered it a suboptimal technique to collect real estate data for the development of housing dynamics indicators. They preferred an active collaboration instead of passive Web scraping. In the frame of a users’ agreement, we received one company’s dataset and calculated the indicators for the case study based on this dataset. The unique micro-data provided by the company proved to be the start of a collaborative planning approach between private partners, the academic world and the Flemish government. All three win from this collaboration on the long run. Smart governance can gain from smart technologies, but should not loose sight of active collaborations.
- Published
- 2016
41. Van webscraping tot collaboratieve planning: hoe een gedeelde ambitie leidt tot trandisciplinaire samenwerking
- Author
-
Dewaelheyns, Valerie, Loris, Isabelle, van der Lecq, René, and Vanempten, Elke
- Subjects
collaboratieve planning ,webscraping ,coproductie ,Science General ,transdisciplinair - Abstract
Binnen het Steunpunt Ruimte werden eerste stappen gezet richting ‘verruimd vakmanschap’ met wederzijds begrip tussen betrokken partijen en de inschakeling van de ruimtelijke planner in een andere logica. Het Witboek Beleidsplan Ruimte Vlaanderen schuift ruimtelijk rendement naar voor als beleidsambitie. Herbruik van panden wordt beschouwd als één van de strategieën om dat rendement te realiseren. Het gebrek aan inzichten in ruimtelijke micro-dynamieken van de woonmarkt in Vlaanderen maakt het moeilijk om beleidsopties rond hergebruik te toetsen aan de realiteit. Lokale dynamieken van de woningmarkt zouden in kaart gebracht kunnen worden met beleidsondersteunende indicatoren. Essentiële informatie voor de berekening van dergelijke indicatoren is terug te vinden op het internet, via portaalsites voor immobiliënzoekertjes. Het is mogelijk om deze rijkdom aan gegevens van het web te ‘scrapen’ en op te slaan in een databank. Dit lijkt een eenvoudige en onafhankelijke manier van gegevensverzameling, maar technische en juridische knelpunten wijzen op de meerwaarde van samenwerking. Binnen een coproductie benadering werd verkend hoe beleid, onderzoek en private bedrijven kunnen samenwerken rond de ontwikkeling van indicatoren over micro-dynamieken van de woningmarkt. Via interviews met vertegenwoordigers van vastgoedportaalsites werden de context en win-wins van samenwerking duidelijk. Een kleinschalige maar effectieve samenwerking verhelderde aandachtspunten rond de afstemming van doelstellingen en belangen, en de organisatie van de samenwerking. Voor elke partij bleek samenwerken duidelijke opportuniteiten te bieden die niet bereikt kunnen worden door elke professional afzonderlijk. Dit was een stimulans om verdere samenwerking tussen beleid en private bedrijven te verkennen en zo een stap te zetten richting verruimd vakmanschap.
- Published
- 2016
42. Dashboard of Sentiment in Austrian Social Media During COVID-19.
- Author
-
Pellert M, Lasser J, Metzler H, and Garcia D
- Abstract
To track online emotional expressions on social media platforms close to real-time during the COVID-19 pandemic, we built a self-updating monitor of emotion dynamics using digital traces from three different data sources in Austria. This allows decision makers and the interested public to assess dynamics of sentiment online during the pandemic. We used web scraping and API access to retrieve data from the news platform derstandard.at, Twitter, and a chat platform for students. We documented the technical details of our workflow to provide materials for other researchers interested in building a similar tool for different contexts. Automated text analysis allowed us to highlight changes of language use during COVID-19 in comparison to a neutral baseline. We used special word clouds to visualize that overall difference. Longitudinally, our time series showed spikes in anxiety that can be linked to several events and media reporting. Additionally, we found a marked decrease in anger. The changes lasted for remarkably long periods of time (up to 12 weeks). We have also discussed these and more patterns and connect them to the emergence of collective emotions. The interactive dashboard showcasing our data is available online at http://www.mpellert.at/covid19_monitor_austria/. Our work is part of a web archive of resources on COVID-19 collected by the Austrian National Library., (Copyright © 2020 Pellert, Lasser, Metzler and Garcia.)
- Published
- 2020
- Full Text
- View/download PDF
43. Tehnike spletnega luščenja podatkov
- Author
-
Grlica, Peter and Lavbič, Dejan
- Subjects
odprta koda ,computer and information science ,avtomatizacija ,računalništvo ,visokošolski strokovni študij ,webscraping ,computer science ,PHP ,legality ,anonymization ,automatization ,anonimizacija ,zakonodaja ,računalništvo in informatika ,open source ,diploma ,Mink ,diplomske naloge ,AJAX ,udc:004.774.2(043.2) ,luščenje podatkov - Published
- 2014
44. Gender, Leadership and Prominence in Entrepreneurial Teams: A framework for web content analysis.
- Author
-
Schillo, R. Sandra and Aidis, Ruta
- Abstract
There is growing recognition of the need to fine tune quantitative methodologies to capture the gendering of entrepreneurship. Web-based content analysis offers the possibilities of utilizing open data to understand the gendered characteristics of entrepreneurial teams. In this paper we present analyses of three VC funded entrepreneurial companies using keyword frequency analysis and discuss the insights, which suggest that more complex analyses are required. In conclusion, we present a framework that draws on more advanced natural language processing methods and on gender content analysis to address some of the limitations and identify avenues for future research. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
45. Analýza dějových linií na základě shrnutí obsahu knih a uživatelských recenzí
- Author
-
Smrž, Pavel, Dočekal, Martin, Smrž, Pavel, and Dočekal, Martin
- Abstract
Cieľom tejto práce je vytvoriť systém pre analýzu a klasifikáciu kľúčových dejových línií zo zhrnutých dejových zápletiek a užívateľských recenzií v anglickom jazyku. Zvolený problém je riešený pomocou techniky strojového učenia založenej na transformeroch. Vo vytvorenom riešení je implementované aj sťahovanie dát a bol vytvorený dataset užívateľských recenzií a informácií o knihách prevyšujúci 23 miliónov recenzií a takmer 900 tisíc informácií o knihách. Systém dokáže predikovať aké typy dejových zápletiek sa v dátach nachádzajú., The aim of this work is to create a system for analysis and classification of plot keywords from summarized storylines and user reviews in English. The chosen problem is solved using a transformer-based machine learning technique. The created solution also implements data downloading and a dataset of user reviews and information about books was created, exceeding 23 million reviews and 900 thousand information about books. The system can predict what plot keywords the data contains.
46. Analýza dějových linií na základě shrnutí obsahu knih a uživatelských recenzí
- Author
-
Smrž, Pavel, Dočekal, Martin, Smrž, Pavel, and Dočekal, Martin
- Abstract
Cieľom tejto práce je vytvoriť systém pre analýzu a klasifikáciu kľúčových dejových línií zo zhrnutých dejových zápletiek a užívateľských recenzií v anglickom jazyku. Zvolený problém je riešený pomocou techniky strojového učenia založenej na transformeroch. Vo vytvorenom riešení je implementované aj sťahovanie dát a bol vytvorený dataset užívateľských recenzií a informácií o knihách prevyšujúci 23 miliónov recenzií a takmer 900 tisíc informácií o knihách. Systém dokáže predikovať aké typy dejových zápletiek sa v dátach nachádzajú., The aim of this work is to create a system for analysis and classification of plot keywords from summarized storylines and user reviews in English. The chosen problem is solved using a transformer-based machine learning technique. The created solution also implements data downloading and a dataset of user reviews and information about books was created, exceeding 23 million reviews and 900 thousand information about books. The system can predict what plot keywords the data contains.
47. Analýza aplikačních firewallů sociálních sítí
- Author
-
Januš, Filip, Malinka, Kamil, Januš, Filip, and Malinka, Kamil
- Abstract
Práce popisuje způsoby, jak lze přistupovat na sociální sítě pomocí automatického robota, smysl tohoto přístupu a důvody vedoucí k používání ochran před automatickými roboty. Cílem je analyzovat aktuálně používané ochrany před automatickými roboty nejznámějších sociálních sítí (Facebook, Twitter, LinkedIn a YouTube) a rozšířit tyto informace mezi ostatní vývojáře. Ti poté mohou využít těchto znalostí pro zlepšení ochrany svých webových stránek. Výstupem této bakalářské práce je popis aktuálně používaných ochran sociálních sítí a dále návrh ochrany, která pomáhá odhalit automatické roboty na základě podezřelého chování., The thesis describes ways to attend social networks using automatic robots, meaning of this approach and the reasons leading social networks to use protection against automated robots. The aim of this thesis is to analyze currently used protections against automatic robots of the most famous social networks (Facebook, Twitter, LinkedIn and YouTube). These informations are available for other developers, which may use these informations to improve their protection of own websites. The output of this bachelor thesis is description of currently used social network protections and propsal of protection that reveals automatic robots based on multiple identical behaviour.
48. Analýza aplikačních firewallů sociálních sítí
- Author
-
Januš, Filip, Malinka, Kamil, Januš, Filip, and Malinka, Kamil
- Abstract
Práce popisuje způsoby, jak lze přistupovat na sociální sítě pomocí automatického robota, smysl tohoto přístupu a důvody vedoucí k používání ochran před automatickými roboty. Cílem je analyzovat aktuálně používané ochrany před automatickými roboty nejznámějších sociálních sítí (Facebook, Twitter, LinkedIn a YouTube) a rozšířit tyto informace mezi ostatní vývojáře. Ti poté mohou využít těchto znalostí pro zlepšení ochrany svých webových stránek. Výstupem této bakalářské práce je popis aktuálně používaných ochran sociálních sítí a dále návrh ochrany, která pomáhá odhalit automatické roboty na základě podezřelého chování., The thesis describes ways to attend social networks using automatic robots, meaning of this approach and the reasons leading social networks to use protection against automated robots. The aim of this thesis is to analyze currently used protections against automatic robots of the most famous social networks (Facebook, Twitter, LinkedIn and YouTube). These informations are available for other developers, which may use these informations to improve their protection of own websites. The output of this bachelor thesis is description of currently used social network protections and propsal of protection that reveals automatic robots based on multiple identical behaviour.
49. Analýza dějových linií na základě shrnutí obsahu knih a uživatelských recenzí
- Author
-
Smrž, Pavel, Dočekal, Martin, Smrž, Pavel, and Dočekal, Martin
- Abstract
Cieľom tejto práce je vytvoriť systém pre analýzu a klasifikáciu kľúčových dejových línií zo zhrnutých dejových zápletiek a užívateľských recenzií v anglickom jazyku. Zvolený problém je riešený pomocou techniky strojového učenia založenej na transformeroch. Vo vytvorenom riešení je implementované aj sťahovanie dát a bol vytvorený dataset užívateľských recenzií a informácií o knihách prevyšujúci 23 miliónov recenzií a takmer 900 tisíc informácií o knihách. Systém dokáže predikovať aké typy dejových zápletiek sa v dátach nachádzajú., The aim of this work is to create a system for analysis and classification of plot keywords from summarized storylines and user reviews in English. The chosen problem is solved using a transformer-based machine learning technique. The created solution also implements data downloading and a dataset of user reviews and information about books was created, exceeding 23 million reviews and 900 thousand information about books. The system can predict what plot keywords the data contains.
50. Analýza aplikačních firewallů sociálních sítí
- Author
-
Januš, Filip, Malinka, Kamil, Zítka, Radim, Januš, Filip, Malinka, Kamil, and Zítka, Radim
- Abstract
Práce popisuje způsoby, jak lze přistupovat na sociální sítě pomocí automatického robota, smysl tohoto přístupu a důvody vedoucí k používání ochran před automatickými roboty. Cílem je analyzovat aktuálně používané ochrany před automatickými roboty nejznámějších sociálních sítí (Facebook, Twitter, LinkedIn a YouTube) a rozšířit tyto informace mezi ostatní vývojáře. Ti poté mohou využít těchto znalostí pro zlepšení ochrany svých webových stránek. Výstupem této bakalářské práce je popis aktuálně používaných ochran sociálních sítí a dále návrh ochrany, která pomáhá odhalit automatické roboty na základě podezřelého chování., The thesis describes ways to attend social networks using automatic robots, meaning of this approach and the reasons leading social networks to use protection against automated robots. The aim of this thesis is to analyze currently used protections against automatic robots of the most famous social networks (Facebook, Twitter, LinkedIn and YouTube). These informations are available for other developers, which may use these informations to improve their protection of own websites. The output of this bachelor thesis is description of currently used social network protections and propsal of protection that reveals automatic robots based on multiple identical behaviour.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.