94,083 results
Search Results
2. A Year of Papers Using Biomedical Texts.
- Author
-
Grouin C and Grabar N
- Subjects
- Information Storage and Retrieval methods, Data Mining methods, Electronic Health Records, Natural Language Processing
- Abstract
Objectives: Analyze papers published in 2019 within the medical natural language processing (NLP) domain in order to select the best works of the field., Methods: We performed an automatic and manual pre-selection of papers to be reviewed and finally selected the best NLP papers of the year. We also propose an analysis of the content of NLP publications in 2019., Results: Three best papers have been selected this year including the generation of synthetic record texts in Chinese, a method to identify contradictions in the literature, and the BioBERT word representation., Conclusions: The year 2019 was very rich and various NLP issues and topics were addressed by research teams. This shows the will and capacity of researchers to move towards robust and reproducible results. Researchers also prove to be creative in addressing original issues with relevant approaches., Competing Interests: Disclosure The authors report no conflicts of interest in this work., (Georg Thieme Verlag KG Stuttgart.)
- Published
- 2020
- Full Text
- View/download PDF
3. The plan to mine the world's research papers.
- Author
-
Pulla P
- Subjects
- Big Data economics, Data Mining trends, Datasets as Topic economics, Datasets as Topic legislation & jurisprudence, India, Open Access Publishing economics, Research Report, Unsupervised Machine Learning legislation & jurisprudence, Unsupervised Machine Learning trends, Big Data supply & distribution, Data Mining methods, Datasets as Topic supply & distribution, Information Dissemination legislation & jurisprudence, Information Dissemination methods, Open Access Publishing legislation & jurisprudence, Research
- Published
- 2019
- Full Text
- View/download PDF
4. Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines.
- Author
-
Babic Z, Capes-Davis A, Martone ME, Bairoch A, Ozyurt IB, Gillespie TH, and Bandrowski AE
- Subjects
- Cell Line, Humans, Periodicals as Topic, PubMed, Bibliometrics, Biomedical Research standards, Cell Line Authentication statistics & numerical data, Data Mining methods
- Abstract
The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database. To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines., Competing Interests: ZB, TG No competing interests declared, AC runs the cell bank in Australia and heads the ICLAC consortium. MM, AB heads the RRID project, and founded SciCrunch, a company that supports the RRID project. AB develops the Cellosaurus database. IO works as a consultant for SciCrunch., (© 2019, Babic et al.)
- Published
- 2019
- Full Text
- View/download PDF
5. 基于机器阅读理解的论文 辅助阅读系统构建.
- Author
-
秘蓉新, 姚文文, and 阮宏坤
- Subjects
- *
LANGUAGE models , *SCIENTIFIC literature , *LITERATURE reviews , *DATA mining , *READING comprehension - Abstract
In the era of informatization and digitization, the rapid increase in the number of scientific papers has given rise to various challenges, such as lengthy articles, difficulty in information extraction and high time costs associated with reading. Literature reading challenges for researchers are increasingly tedious and time-consuming. By utilizing the language models, the assited reading system of scientific papers has been designed to address these challenges. By adopting machine reading comprehension technology as the core, the system parses scientific texts and offers some common questions to achieve automated response capabilities. By fully utilizing the pre-trained language model PERT, the system enhances its capabilities in semantic understanding and information extraction, effectively resolving various challenges in reading scientific papers and helping readers improve the efficiency of scientific literature review. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Artificial intelligence for knowledge management : second IFIP WG 12.6 International Workshop, AI4KM 2014, Warsaw, Poland, September 7-10, 2014, revised selected papers.
- Author
-
Boulanger, Danielle, Mercier-Laurent, Eunika, and Owoc, Mieczysław Lech
- Subjects
Artificial intelligence ,Data mining ,Database management ,Knowledge management - Abstract
Summary: This book features a selection of papers presented at the Second IFIP WG 12.6 International Workshop on Artificial Intelligence for Knowledge Management, AI4KM 2014, held in Wroclaw, Poland, in September 2014, in the framework of the Federated Conferences on Computer Science and Information Systems, FedCSIS 2014. The 9 revised and extended papers and one invited paper were carefully reviewed and selected for inclusion in this volume. They present new research and innovative aspects in the field of knowledge management and are organized in the following topical sections: tools and methods for knowledge acquisition; models and functioning of knowledge management; techniques of artificial intelligence supporting knowlege management; and components of knowledge flow.
- Published
- 2016
7. Identification of data mining research frontier based on conference papers
- Author
-
Huang, Yue, Liu, Hu, and Pan, Jing
- Published
- 2021
- Full Text
- View/download PDF
8. Identification of data mining research frontier based on conference papers
- Author
-
Yue Huang, Hu Liu, and Jing Pan
- Subjects
data mining ,bibliometrics ,citespace ,conference papers ,research frontier ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
PurposeIdentifying the frontiers of a specific research field is one of the most basic tasks in bibliometrics and research published in leading conferences is crucial to the data mining research community, whereas few research studies have focused on it. The purpose of this study is to detect the intellectual structure of data mining based on conference papers.Design/methodology/approachThis study takes the authoritative conference papers of the ranking 9 in the data mining field provided by Google Scholar Metrics as a sample. According to paper amount, this paper first detects the annual situation of the published documents and the distribution of the published conferences. Furthermore, from the research perspective of keywords, CiteSpace was used to dig into the conference papers to identify the frontiers of data mining, which focus on keywords term frequency, keywords betweenness centrality, keywords clustering and burst keywords.FindingsResearch showed that the research heat of data mining had experienced a linear upward trend during 2007 and 2016. The frontier identification based on the conference papers showed that there were five research hotspots in data mining, including clustering, classification, recommendation, social network analysis and community detection. The research contents embodied in the conference papers were also very rich.Originality/valueThis study detected the research frontier from leading data mining conference papers. Based on the keyword co-occurrence network, from four dimensions of keyword term frequency, betweeness centrality, clustering analysis and burst analysis, this paper identified and analyzed the research frontiers of data mining discipline from 2007 to 2016.
- Published
- 2021
- Full Text
- View/download PDF
9. Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines.
- Author
-
Babic, Zeljana, Capes-Davis, Amanda, Martone, Maryann E, Bairoch, Amos, Ozyurt, I Burak, Gillespie, Thomas H, and Bandrowski, Anita E
- Subjects
Cell Line ,Humans ,Biomedical Research ,Bibliometrics ,PubMed ,Periodicals as Topic ,Data Mining ,Cell Line Authentication ,authentication ,cell line ,computational biology ,none ,reproducibility ,rigor ,software ,systems biology ,text mining ,Biochemistry and Cell Biology - Abstract
The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database. To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines.
- Published
- 2019
10. Personalized paper recommendation for postgraduates using multi-semantic path fusion.
- Author
-
Xiao, Xia, Jin, Bo, and Zhang, Chengde
- Subjects
INTERGENERATIONAL mobility ,EDUCATIONAL mobility ,GRADUATE education ,DATA mining ,ELECTRONIC data processing ,SHIFT registers - Abstract
During graduate education, postgraduates have to spend considerable time finding papers to explore the development branches of their field. However, existing paper recommendation methods focus on several attributes (title, author, keyword, venue, etc.). The network schema constructed by these attributes is extremely sparse, which easily causes the loss of important semantic paths between attributes. This results in a lack of correlations among relevant papers, which affects paper recommendation efficiency. Moreover, the relationships between multiple semantic paths can be found through common homogeneous and heterogeneous attributes. These relationships can establish many correlations among relevant papers. To address the above problems, this paper proposes a new approach to fuse multi-semantic paths into a heterogeneous educational network (HEN) for personalized paper recommendation. After data processing, a new HEN schema is built by enriching nodes and edges in heterogeneous networks. Then, different semantic meta-paths are generated by projection sub-nets. Next, a new HEN embedding method is proposed by multi-semantic path fusion to generate rich HEN node sequences. Finally, personalized paper recommendation for postgraduates by targeted path similarity. The proposed method was performed on two paper datasets in the fields of educational intergenerational mobility from 1987 to 2021 and data mining and intelligent media from 1997 to 2021. Substantial experiments demonstrate that the proposed approach is effective. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. Development of an Embedding Framework for Clustering Scientific Papers
- Author
-
Songhee Kim, Suyeong Lee, and Byungun Yoon
- Subjects
Clustering method ,data mining ,text mining ,text analysis ,scientific publishing ,fuel cells ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In this era, research and development are becoming a continuous and accelerating process because technology changes rapidly with a short lifecycle. As a result, various methodologies are being developed to monitor these rapidly changing research trends; In particular, clustering method-related studies in science and technology documents are being developed with a variety of approaches. However, previous studies on document clustering methods focus on a specific field or language but do not take into consideration certain important pieces of information in science and technology documents. Therefore, this study proposes an embedding methodology that uses important content from scientific and technical documents. We took into consideration the importance of information containing core structures in science and technology documents and proposed a clustering methodology that analyzes structured and unstructured data, such as textual information, author information, and citation information. The proposed method combines both textual and structural data from the paper, using a method that focuses on screening important information by sections in science and technology documents. Then, Girvan-Newman clustering and Louvain clustering models are applied to generate embedding vectors and show evaluation results through the clustering indices. As a practical example, we applied the proposed methodology using paper data from the field of hydrogen cell vehicles. The results of this study will be effective in identifying gaps in technology for new technological development, identifying technology trends, and presenting directional information for future technology development.
- Published
- 2022
- Full Text
- View/download PDF
12. Extracting laboratory test information from paper-based reports.
- Author
-
Ma, Ming-Wei, Gao, Xian-Shu, Zhang, Ze-Yu, Shang, Shi-Yu, Jin, Ling, Liu, Pei-Lin, Lv, Feng, Ni, Wei, Han, Yu-Chen, and Zong, Hui
- Subjects
- *
OPTICAL character recognition , *NATURAL language processing , *HEALTH information systems , *TEXT recognition , *RANDOM fields , *DATA mining - Abstract
Background: In the healthcare domain today, despite the substantial adoption of electronic health information systems, a significant proportion of medical reports still exist in paper-based formats. As a result, there is a significant demand for the digitization of information from these paper-based reports. However, the digitization of paper-based laboratory reports into a structured data format can be challenging due to their non-standard layouts, which includes various data types such as text, numeric values, reference ranges, and units. Therefore, it is crucial to develop a highly scalable and lightweight technique that can effectively identify and extract information from laboratory test reports and convert them into a structured data format for downstream tasks. Methods: We developed an end-to-end Natural Language Processing (NLP)-based pipeline for extracting information from paper-based laboratory test reports. Our pipeline consists of two main modules: an optical character recognition (OCR) module and an information extraction (IE) module. The OCR module is applied to locate and identify text from scanned laboratory test reports using state-of-the-art OCR algorithms. The IE module is then used to extract meaningful information from the OCR results to form digitalized tables of the test reports. The IE module consists of five sub-modules, which are time detection, headline position, line normalization, Named Entity Recognition (NER) with a Conditional Random Fields (CRF)-based method, and step detection for multi-column. Finally, we evaluated the performance of the proposed pipeline on 153 laboratory test reports collected from Peking University First Hospital (PKU1). Results: In the OCR module, we evaluate the accuracy of text detection and recognition results at three different levels and achieved an averaged accuracy of 0.93. In the IE module, we extracted four laboratory test entities, including test item name, test result, test unit, and reference value range. The overall F1 score is 0.86 on the 153 laboratory test reports collected from PKU1. With a single CPU, the average inference time of each report is only 0.78 s. Conclusion: In this study, we developed a practical lightweight pipeline to digitalize and extract information from paper-based laboratory test reports in diverse types and with different layouts that can be adopted in real clinical environments with the lowest possible computing resources requirements. The high evaluation performance on the real-world hospital dataset validated the feasibility of the proposed pipeline. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. Analyzing the Accuracy of Answer Sheet Data in Paper-based Test Using Decision Tree
- Author
-
Edy Suharto, Aris Puji Widodo, and Suryono Suryono
- Subjects
data mining ,decision tree ,paper-based test ,education ,Electronic computers. Computer science ,QA75.5-76.95 ,Economic growth, development, planning ,HD72-88 - Abstract
In education quality assurance, the accuracy of test data is crucial. However, there is still a problem regarding to the possibility of incorrect data filled by test taker during paper-based test. On the contrary, this problem does not appear in computer-based test. In this study, a method was proposed in order to analyze the accuracy of answer sheet filling out in paper-based test using data mining technique. A single layer of data comprehension was added within the method instead of raw data. The results of the study were a web-based program for data pre-processing and decision tree models. There were 374 instances which were analyzed. The accuracy of answer sheet filling out attained 95.19% while the accuracy of classification varied from 99.47% to 100% depend on evaluation method chosen. This study could motivate the administrators for test improvement since it preferred computer-based test to paper-based.
- Published
- 2019
- Full Text
- View/download PDF
14. Elsevier opens its papers to text-mining.
- Author
-
Van Noorden R
- Subjects
- Copyright ethics, Copyright legislation & jurisprudence, Humans, Research Personnel, Access to Information, Data Mining trends, Periodicals as Topic, Publishing, Research
- Published
- 2014
- Full Text
- View/download PDF
15. Automatic extraction of significant terms from the title and abstract of scientific papers using the machine learning algorithm: A multiple module approach.
- Author
-
Mukherjee, Bhaskar and Majhi, Debasis
- Abstract
Keyword extraction is the task of identifying important terms or phrase that are most representative of the source document. Although the process of automatic extraction of keywords from title is an old method, it was mainly for extraction from a single web document. Our approach differs from previous research works on keyword extraction in several aspects. For those who are non-expert of the scientific fields, understating scientific research trends is difficult. The purpose of this study is to develop an automatic method of obtaining overviews of a scientific field for non-experts by capturing research trends. This empirical study excavates significant term extraction using Natural Language Processing (NLP) tools. More than 15000 titles saved in a .csv file was our dataset and scripts written in Python were our process to compare how far significant terms of scientific title corpus are similar or different to the terms available in the abstract of that same scientific article corpus. A light-weight unsupervised title extractor, Yet Another Keyword Extractor (YAKE) was used to extract the results. Based on our analysis, it can be concluded that these algorithms can be used for other fields too by the non-experts of that subject field to perform automatic extraction of significant words and understanding trends. Our algorithm could be a solution to reduce the labour-intensive manual indexing process. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
16. References.
- Author
-
Schmidt, Julia, Pilgrim, Graham, and Mourougane, Annabelle
- Subjects
WORKING papers ,LABOR market ,DATA mining - Published
- 2023
- Full Text
- View/download PDF
17. Emotion Mining: from Unimodal to Multimodal Approaches
- Author
-
Zucco, Chiara, Calabrese, Barbara, Cannataro, Mario, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Amunts, Katrin, editor, Grandinetti, Lucio, editor, Lippert, Thomas, editor, and Petkov, Nicolai, editor
- Published
- 2021
- Full Text
- View/download PDF
18. Data-based robust multiobjective optimization of interconnected processes: energy efficiency case study in papermaking.
- Author
-
Afshar P, Brown M, Maciejowski J, and Wang H
- Subjects
- Energy Transfer, Artificial Intelligence, Data Mining methods, Databases, Factual, Feedback, Models, Theoretical, Paper
- Abstract
Reducing energy consumption is a major challenge for "energy-intensive" industries such as papermaking. A commercially viable energy saving solution is to employ data-based optimization techniques to obtain a set of "optimized" operational settings that satisfy certain performance indices. The difficulties of this are: 1) the problems of this type are inherently multicriteria in the sense that improving one performance index might result in compromising the other important measures; 2) practical systems often exhibit unknown complex dynamics and several interconnections which make the modeling task difficult; and 3) as the models are acquired from the existing historical data, they are valid only locally and extrapolations incorporate risk of increasing process variability. To overcome these difficulties, this paper presents a new decision support system for robust multiobjective optimization of interconnected processes. The plant is first divided into serially connected units to model the process, product quality, energy consumption, and corresponding uncertainty measures. Then multiobjective gradient descent algorithm is used to solve the problem in line with user's preference information. Finally, the optimization results are visualized for analysis and decision making. In practice, if further iterations of the optimization algorithm are considered, validity of the local models must be checked prior to proceeding to further iterations. The method is implemented by a MATLAB-based interactive tool DataExplorer supporting a range of data analysis, modeling, and multiobjective optimization techniques. The proposed approach was tested in two U.K.-based commercial paper mills where the aim was reducing steam consumption and increasing productivity while maintaining the product quality by optimization of vacuum pressures in forming and press sections. The experimental results demonstrate the effectiveness of the method.
- Published
- 2011
- Full Text
- View/download PDF
19. What the papers say: text mining for genomics and systems biology.
- Author
-
Harmston N, Filsell W, and Stumpf MP
- Subjects
- Publication Bias, Terminology as Topic, Data Mining methods, Genomics methods, Periodicals as Topic, Systems Biology methods
- Abstract
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining - the automated extraction of information from (electronically) published sources - could potentially fulfil an important role - but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
- Published
- 2010
- Full Text
- View/download PDF
20. FutureCite: Predicting Research Articles' Impact Using Machine Learning and Text and Graph Mining Techniques.
- Author
-
Thafar, Maha A., Alsulami, Mashael M., and Albaradei, Somayah
- Subjects
TEXT mining ,DATA mining ,FEATURE extraction ,CITATION networks ,RESEARCH personnel - Abstract
The growth in academic and scientific publications has increased very rapidly. Researchers must choose a representative and significant literature for their research, which has become challenging worldwide. Usually, the paper citation number indicates this paper's potential influence and importance. However, this standard metric of citation numbers is not suitable to assess the popularity and significance of recently published papers. To address this challenge, this study presents an effective prediction method called FutureCite to predict the future citation level of research articles. FutureCite integrates machine learning with text and graph mining techniques, leveraging their abilities in classification, datasets in-depth analysis, and feature extraction. FutureCite aims to predict future citation levels of research articles applying a multilabel classification approach. FutureCite can extract significant semantic features and capture the interconnection relationships found in scientific articles during feature extraction using textual content, citation networks, and metadata as feature resources. This study's objective is to contribute to the advancement of effective approaches impacting the citation counts in scientific publications by enhancing the precision of future citations. We conducted several experiments using a comprehensive publication dataset to evaluate our method and determine the impact of using a variety of machine learning algorithms. FutureCite demonstrated its robustness and efficiency and showed promising results based on different evaluation metrics. Using the FutureCite model has significant implications for improving the researchers' ability to determine targeted literature for their research and better understand the potential impact of research publications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. A pipeline for the retrieval and extraction of domain-specific information with application to COVID-19 immune signatures.
- Author
-
Newton, Adam J. H., Chartash, David, Kleinstein, Steven H., and McDougal, Robert A.
- Subjects
DATA mining ,COVID-19 ,GENE expression ,SARS-CoV-2 - Abstract
Background: The accelerating pace of biomedical publication has made it impractical to manually, systematically identify papers containing specific information and extract this information. This is especially challenging when the information itself resides beyond titles or abstracts. For emerging science, with a limited set of known papers of interest and an incomplete information model, this is of pressing concern. A timely example in retrospect is the identification of immune signatures (coherent sets of biomarkers) driving differential SARS-CoV-2 infection outcomes. Implementation: We built a classifier to identify papers containing domain-specific information from the document embeddings of the title and abstract. To train this classifier with limited data, we developed an iterative process leveraging pre-trained SPECTER document embeddings, SVM classifiers and web-enabled expert review to iteratively augment the training set. This training set was then used to create a classifier to identify papers containing domain-specific information. Finally, information was extracted from these papers through a semi-automated system that directly solicited the paper authors to respond via a web-based form. Results: We demonstrate a classifier that retrieves papers with human COVID-19 immune signatures with a positive predictive value of 86%. The type of immune signature (e.g., gene expression vs. other types of profiling) was also identified with a positive predictive value of 74%. Semi-automated queries to the corresponding authors of these publications requesting signature information achieved a 31% response rate. Conclusions: Our results demonstrate the efficacy of using a SVM classifier with document embeddings of the title and abstract, to retrieve papers with domain-specific information, even when that information is rarely present in the abstract. Targeted author engagement based on classifier predictions offers a promising pathway to build a semi-structured representation of such information. Through this approach, partially automated literature mining can help rapidly create semi-structured knowledge repositories for automatic analysis of emerging health threats. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
22. Comparison of three filter paper-based devices for safety and stability of viral sample collection in poultry
- Author
-
Suwarak Wannaratana, Aunyaratana Thontiravong, and Somsak Pakpinyo
- Subjects
General Immunology and Microbiology ,Food Animals ,Filter paper ,DNA stability ,viruses ,Stability (learning theory) ,Animal Science and Zoology ,Sample collection ,Data mining ,Biology ,computer.software_genre ,computer - Abstract
General diagnosis of poultry viruses primarily relies on detection of viruses in samples, but many farms are located in remote areas requiring logistic transportation. Filter paper cards are a usef...
- Published
- 2020
23. [Tukey's Paper after 40 Years]: Discussion
- Author
-
Brillinger, David R.
- Published
- 2006
- Full Text
- View/download PDF
24. [Tukey's Paper after 40 Years]: Discussion
- Author
-
Huber, Peter J.
- Published
- 2006
- Full Text
- View/download PDF
25. A Text Mining Approach to Covid-19 Literature.
- Author
-
Liu, Fangyao, Ergu, Daji, Li, Biao, Deng, Wei, Chen, Zhengxin, Lu, Guoqing, and Shi, Yong
- Subjects
TEXT mining ,SARS-CoV-2 ,COVID-19 ,MEDICAL research personnel ,DATA mining - Abstract
The novel coronavirus disease — COVID-19 is a historic catastrophe that has caused many devastating impacts on human life and wellness. Researchers in academia and industry strive to understand the causes of this pandemic disease and find new therapeutics combating it. Consequently, the number of COVID-19 related publications increases rapidly, and it is too difficult for medical researchers and practitioners to keep up with the latest research and development. Literature filtering and categorization, and knowledge discovery can use text mining as a powerful tool. In this paper, we propose a text mining method to explore the categories of COVID-19 related themes and identify the standard methodologies that have been used. We discuss the potential limitations of this preliminary study and present future perspectives related to COVID-19 research. This paper provides an quantitative and qualitative mixed analysis example of using some research papers by data mining method to dig out several hidden information and set up a foundation for data scientists to develop more effective algorithms to deal with COVID-19 related problems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. A Review Paper on Big Data and Data Mining Concepts and Techniques
- Author
-
Prasdika Prasdika and Bambang Sugiantoro
- Subjects
data ,big data ,data mining ,Electronic computers. Computer science ,QA75.5-76.95 ,Economic growth, development, planning ,HD72-88 - Abstract
In the digital era like today the growth of data in the database is very rapid, all things related to technology have a large contribution to data growth as well as social media, financial technology and scientific data. Therefore, topics such as big data and data mining are topics that are often discussed. Data mining is a method of extracting information through from big data to produce an information pattern or data anomaly
- Published
- 2018
- Full Text
- View/download PDF
27. Analyzing the Accuracy of Answer Sheet Data in Paper-based Test Using Decision Tree
- Author
-
Aris Puji Widodo, Edy Suharto, and Suryono Suryono
- Subjects
education ,Computer science ,Decision tree ,Paper based ,data mining ,computer.software_genre ,lcsh:QA75.5-76.95 ,lcsh:HD72-88 ,Test (assessment) ,lcsh:Economic growth, development, planning ,Comprehension ,paper-based test ,Order (business) ,decision tree ,Data mining ,lcsh:Electronic computers. Computer science ,Raw data ,computer ,Single layer ,Test data - Abstract
In education quality assurance, the accuracy of test data is crucial. However, there is still a problem regarding to the possibility of incorrect data filled by test taker during paper-based test. On the contrary, this problem does not appear in computer-based test. In this study, a method was proposed in order to analyze the accuracy of answer sheet filling out in paper-based test using data mining technique. A single layer of data comprehension was added within the method instead of raw data. The results of the study were a web-based program for data pre-processing and decision tree models. There were 374 instances which were analyzed. The accuracy of answer sheet filling out attained 95.19% while the accuracy of classification varied from 99.47% to 100% depend on evaluation method chosen. This study could motivate the administrators for test improvement since it preferred computer-based test to paper-based.
- Published
- 2019
28. Predicting rank for scientific research papers using supervised learning.
- Author
-
El Mohadab, Mohamed, Bouikhalene, Belaid, and Safi, Said
- Subjects
ELECTRONIC data processing ,SUPERVISED learning ,MACHINE learning ,INFORMATION & communication technologies ,INFORMATION technology ,ELECTRONIC services - Abstract
Automatic data processing represents the future for the development of any system, especially in scientific research. In this paper, we describe one of the automatic classification methods applied to scientific research as a supervised learning task. Throughout the process, we identify the main features that are used as keys to play a significant role in terms of predicting the new rank under the supervised learning setup. First, we propose an overview of the work that has been realized in ranking scientific research papers. Second, we evaluate and compare some of state-of-the-art for the classification by supervised learning, semi-supervised learning and non-supervised learning. During the preliminary tests, we have obtained good results for performance on realistic corpus then we have compared performance metrics, such as NDCG, MAP, GMAP, F-Measure, Precision and Recall in order to define the influential features in our work. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
29. Using a Data-Mining Approach to Unveil Greenhouse Gas Emission Intensities of Different Pulp and Paper Products.
- Author
-
Nabinger, Alec, Tomberlin, Kristen, Venditti, Richard, and Yao, Yuan
- Abstract
Life Cycle Assessment (LCA) has been used to evaluate the life-cycle Greenhouse Gas (GHG) emissions of pulp and paper production, and most previous studies rely on process-based models for specific product types (e.g., printing paper), industry-average data, or information from a few mills. In this work, a data-mining approach is used to quantify GHG emissions intensities of different paper products manufactured by the U.S. mills. Facility-level emission data collected from publically available governmental databases and mill-level production data collected from the private sector were integrated to track the GHG emissions for different product lines and paper products in mills (in total, 165 mills were matched and analyzed). The results highlight the ranges of GHG emissions intensities by different product groups and categories, and can be used as a transparent data source for LCA practitioners, policymakers, and the pulp and paper industry to perform further analysis on carbon accounting and strategic planning for GHG mitigation. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
30. The Arquive of Tatuoca Magnetic Observatory Brazil: from paper to intelligent bytes
- Author
-
Cristian Berrio-Zapata, Ester Ferreira da Silva, Mayara Costa Pinheiro, Vinicius Augusto Carvalho de Abreu, Cristiano Mendel Martins, Mario Augusto Gongora, and Kelso Dunman
- Subjects
Big Data ,records management ,Observatories ,Deep learning ,Geomagnetism ,geophysics computing ,information retrieval systems ,Collaboration ,Data mining - Abstract
The Magnetic Observatory of Tatuoca (TTB) was installed by Observatório Nacional (ON) in 1957, near Belém city in the state of Pará, Brazilian Amazon. Its history goes back to 1933, when a Danish mission used this location to collect data, due to its privileged position near the terrestrial equator. Between 1957 and 2007, TTB produced 18,000 magnetograms on paper using photographic variometers, and other associated documents like absolute value forms and yearbooks. Data was obtained manually from these graphs with rulers and grids, taking 24 average readings per day, that is, one per hour. In 2017, the Federal University of Pará (UFPA in the Portuguese acronym) and ON collaborated to rescue this physical archive. In 2022 UFPA took a step forward and proposed not only digitizing the documents but also developing an intelligent agent capable of reading and extracting the information of the curves with a resolution better than an hour, being this the central goal of the project. If the project succeeds, it will rescue 50 years of data imprisoned in paper, increasing measurement sensitivity far beyond what these sources used to give. This will also open the possibility of applying the same AI to similar documents in other observatories or disciplines like seismography. This article recaps the project, and the complex challenges faced in articulating Archival Science principles with AI and Geoscience.
- Published
- 2022
- Full Text
- View/download PDF
31. Auto-generated Test Paper Based on Knowledge Embedding
- Author
-
Guo-Sheng Hao, Fang Luo, Yi-Yang He, Xiao-Dan He, Zeng-Hui Duan, and Xing-Liu Hu
- Subjects
Computer science ,Embedding ,Data mining ,Paper based ,computer.software_genre ,computer ,Computer Science Applications ,Education ,Test (assessment) - Published
- 2019
32. Experimental Comparison in Sensing Breast Cancer Mutations by Signal ON and Signal OFF Paper-Based Electroanalytical Strips
- Author
-
Emily P. Nguyen, Fabiana Arduini, Claudio Parolo, Giulia Cinotti, Danila Moscone, Stefano Cinti, Arben Merkoçi, Veronica Caratelli, Cinti, S., Cinotti, G., Parolo, C., Nguyen, E. P., Caratelli, V., Moscone, D., Arduini, F., and Merkoci, A.
- Subjects
Paper ,DNA, Single-Stranded ,Breast Neoplasms ,STRIPS ,Biosensing Techniques ,010402 general chemistry ,computer.software_genre ,01 natural sciences ,Signal ,Field (computer science) ,Analytical Chemistry ,law.invention ,Biosensing Technique ,DNA-based biosensors ,Breast cancer ,Settore CHIM/01 ,Design and Development ,law ,Experimental comparison ,Detection methods ,medicine ,Humans ,Liquid biopsy ,Protocol (science) ,Electrochemical Technique ,Chemistry ,010401 analytical chemistry ,Analytical performance ,Electrochemical Techniques ,medicine.disease ,Signal on ,0104 chemical sciences ,Emerging technologies ,Mutation ,Single strand DNA ,Female ,Data mining ,Detection protocols ,Biosensor ,computer ,Breast Neoplasm ,Human - Abstract
Altres ajuts: the ICN2 is funded by the CERCA Programme/Generalitat de Catalunya. The development of paper-based electroanalytical strips as powerful diagnostic tools has gained a lot of attention within the sensor community. In particular, the detection of nucleic acids in complex matrices represents a trending topic, especially when focused toward the development of emerging technologies, such as liquid biopsy. DNA-based biosensors have been largely applied in this direction, and currently, there are two main approaches based on target/probe hybridization reported in the literature, namely Signal ON and Signal OFF. In this technical note, the two approaches are evaluated in combination with paper-based electrodes, using a single strand DNA relative to H1047R (A3140G) missense mutation in exon 20 in breast cancer as the model target. A detailed comparison among the analytical performances, detection protocol, and cost associated with the two systems is provided, highlighting the advantages and drawbacks depending on the application. The present work is aimed to a wide audience, particularly for those in the field of point-of-care, and it is intended to provide the know-how to manage with the design and development stages, and to optimize the platform for the sensing of nucleic acids using a paper-based detection method.
- Published
- 2019
33. Low-Rank and Sparse Matrix Factorization for Scientific Paper Recommendation in Heterogeneous Network
- Author
-
Li Zhu, Xiaoyan Cai, Tianyu Gao, Shirui Pan, and Tao Dai
- Subjects
General Computer Science ,Rank (linear algebra) ,Computer science ,heterogeneous network ,General Engineering ,02 engineering and technology ,low rank and sparse matrix factorization ,Recommender system ,computer.software_genre ,Matrix decomposition ,Matrix (mathematics) ,Paper recommendation ,Cold start ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Collaborative filtering ,020201 artificial intelligence & image processing ,General Materials Science ,Data mining ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,computer ,lcsh:TK1-9971 ,Heterogeneous network ,Sparse matrix - Abstract
© 2013 IEEE. With the rapid growth of scientific publications, it is hard for researchers to acquire appropriate papers that meet their expectations. Recommendation system for scientific articles is an essential technology to overcome this problem. In this paper, we propose a novel low-rank and sparse matrix factorization-based paper recommendation (LSMFPRec) method for authors. The proposed method seamlessly combines low-rank and sparse matrix factorization method with fine-grained paper and author affinity matrixes that are extracted from heterogeneous scientific network. Thus, it can effectively alleviate the sparsity and cold start problems that exist in traditional matrix factorization based collaborative filtering methods. Moreover, LSMFPRec can significantly reduce the error propagated from intermediate outputs. In addition, the proposed method essentially captures the low-rank and sparse characteristics that exist in scientific rating activities; therefore, it can generate more reasonable predicted ratings for influential and uninfluential papers. The effectiveness of the proposed LSMFPRec is demonstrated by the recommendation evaluation conducted on the AAN and CiteULike data sets.
- Published
- 2018
34. Data science fundamentals for Python and MongoDB.
- Author
-
Paper, David
- Subjects
MongoDB ,Data mining ,Python (Computer program language) ,COMPUTERS -- General ,Programming & scripting languages - Abstract
Summary: Build the foundational data science skills necessary to work with and better understand complex data science algorithms. This example-driven book provides complete Python coding examples to complement and clarify data science concepts, and enrich the learning experience. Coding examples include visualizations whenever appropriate. The book is a necessary precursor to applying and implementing machine learning algorithms. The book is self-contained. All of the math, statistics, stochastic, and programming skills required to master the content are covered. In-depth knowledge of object-oriented programming isn't required because complete examples are provided and explained. Data Science Fundamentals with Python and MongoDB is an excellent starting point for those interested in pursuing a career in data science. Like any science, the fundamentals of data science are a prerequisite to competency. Without proficiency in mathematics, statistics, data manipulation, and coding, the path to success is "rocky" at best. The coding examples in this book are concise, accurate, and complete, and perfectly complement the data science concepts introduced. What You'll Learn: Prepare for a career in data science Work with complex data structures in Python Simulate with Monte Carlo and Stochastic algorithms Apply linear algebra using vectors and matrices Utilize complex algorithms such as gradient descent and principal component analysis Wrangle, cleanse, visualize, and problem solve with data Use MongoDB and JSON to work with data.
- Published
- 2018
35. Energy assessment of Paper Machines.
- Author
-
Bhutani, Naveen, Lindberg, Carl Fredrik, Starr, Kevin, and Horton, Robert
- Subjects
PAPERMAKING machinery ,PAPER mills ,ENERGY consumption ,ENERGY development ,SENSITIVITY analysis ,PERFORMANCE evaluation ,RATE of return ,DATA mining - Abstract
Abstract: There is a large value in making Pulp and Paper mills more energy efficient. ABB has developed an energy assessment service where opportunities to save energy in the paper machine are identified. The energy assessment is done by quantifying energy flows, benchmarking energy users, data mining and steam sensitivity analysis and by experiments and additional measurements at the paper machine. Energy quantification helped in identifying main energy consumer, benchmarking was useful to assess the gap between operating performance and best performance whereas data mining and steam sensitivity analysis helped in studying the impact of key operating variables on performance of paper machines. After the assessment an action plan was presented to the mill for energy efficiency improvement together with a return on investment. [Copyright &y& Elsevier]
- Published
- 2012
- Full Text
- View/download PDF
36. Identification of data mining research frontier based on conference papers
- Author
-
Jing Pan, Yue Huang, and Hu Liu
- Subjects
Computer science ,020206 networking & telecommunications ,Sample (statistics) ,02 engineering and technology ,Bibliometrics ,computer.software_genre ,Field (computer science) ,Ranking (information retrieval) ,Identification (information) ,Betweenness centrality ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Business, Management and Accounting (miscellaneous) ,020201 artificial intelligence & image processing ,Decision Sciences (miscellaneous) ,Data mining ,Cluster analysis ,Social network analysis ,computer - Abstract
Purpose Identifying the frontiers of a specific research field is one of the most basic tasks in bibliometrics and research published in leading conferences is crucial to the data mining research community, whereas few research studies have focused on it. The purpose of this study is to detect the intellectual structure of data mining based on conference papers. Design/methodology/approach This study takes the authoritative conference papers of the ranking 9 in the data mining field provided by Google Scholar Metrics as a sample. According to paper amount, this paper first detects the annual situation of the published documents and the distribution of the published conferences. Furthermore, from the research perspective of keywords, CiteSpace was used to dig into the conference papers to identify the frontiers of data mining, which focus on keywords term frequency, keywords betweenness centrality, keywords clustering and burst keywords. Findings Research showed that the research heat of data mining had experienced a linear upward trend during 2007 and 2016. The frontier identification based on the conference papers showed that there were five research hotspots in data mining, including clustering, classification, recommendation, social network analysis and community detection. The research contents embodied in the conference papers were also very rich. Originality/value This study detected the research frontier from leading data mining conference papers. Based on the keyword co-occurrence network, from four dimensions of keyword term frequency, betweeness centrality, clustering analysis and burst analysis, this paper identified and analyzed the research frontiers of data mining discipline from 2007 to 2016.
- Published
- 2021
37. ANALYSIS METHOD OF RESEARCH PAPERS PUBLISHED FOR AUDIT
- Author
-
GREAVU-ȘERBAN VALERICĂ
- Subjects
audit ,Google Scholar ,web scraping ,data mining ,DataMiner ,Commercial geography. Economic geography ,HF1021-1027 ,Economics as a science ,HB71-74 - Abstract
Representing a strong instrument of control and feedback used by top management executives, regulators institutions or independent bodies, the audit, its methods and techniques incite the interest of specialists, professionals, professors and researchers from all socio-economic activities. The way domain experts write about audit itself is often reflected in the manner in which they choose the keywords for the title and for the article. This study is a detailed analysis of assignment to the specific thematic areas of articles published in "Financial Audit" journal, for all public appearances in electronic format from the period 2003-2015. The study is different from other similar researches by the methodology and the type of information extracted addressed. The main purpose is to identify the most used keywords in the title and content of articles published over time and insight traceability to future research directions. The conclusions of the analysis from this article give a comprehensive picture of audit multidisciplinary, thus providing researchers, on several economic fields, an image about the content of the publication, quality information for readers, authors and future authors.
- Published
- 2015
38. CORE: A Global Aggregation Service for Open Access Papers.
- Author
-
Knoth, Petr, Herrmannova, Drahomira, Cancellieri, Matteo, Anastasiou, Lucas, Pontika, Nancy, Pearce, Samuel, Gyawali, Bikash, and Pride, David
- Subjects
ELECTRONIC journals ,OPEN access publishing ,TEXT mining ,SCIENTIFIC knowledge ,SCIENTIFIC literature ,DATA mining - Abstract
This paper introduces CORE, a widely used scholarly service, which provides access to the world's largest collection of open access research publications, acquired from a global network of repositories and journals. CORE was created with the goal of enabling text and data mining of scientific literature and thus supporting scientific discovery, but it is now used in a wide range of use cases within higher education, industry, not-for-profit organisations, as well as by the general public. Through the provided services, CORE powers innovative use cases, such as plagiarism detection, in market-leading third-party organisations. CORE has played a pivotal role in the global move towards universal open access by making scientific knowledge more easily and freely discoverable. In this paper, we describe CORE's continuously growing dataset and the motivation behind its creation, present the challenges associated with systematically gathering research papers from thousands of data providers worldwide at scale, and introduce the novel solutions that were developed to overcome these challenges. The paper then provides an in-depth discussion of the services and tools built on top of the aggregated data and finally examines several use cases that have leveraged the CORE dataset and services. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
39. Classification models for likelihood prediction of diabetes at early stage using feature selection
- Author
-
Oladimeji, Oladosu Oyebisi, Oladimeji, Abimbola, and Oladimeji, Olayanju
- Published
- 2024
- Full Text
- View/download PDF
40. Application of COReS to Compute Research Papers Similarity
- Author
-
Muhammad Abdul Qadir, Muhammad Afzal, and Qamar Mahmood
- Subjects
General Computer Science ,Process (engineering) ,Computer science ,content based similarity ,02 engineering and technology ,Ontology (information science) ,computer.software_genre ,Semantics ,ranking ,Similarity (network science) ,Comprehensive similarity computation ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,research paper similarity ,General Materials Science ,ontology ,Cluster analysis ,Measure (data warehouse) ,Information retrieval ,General Engineering ,Encyclopedia ,Ontology ,020201 artificial intelligence & image processing ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Data mining ,lcsh:TK1-9971 ,computer - Abstract
Over the decades, the immense growth has been reported in research publications due to continuous developments in science. To date, various approaches have been proposed that find similarity between research papers by applying different similarity measures collectively or individually based on the content of research papers. However, the contemporary schemes are not conceptualized enough to find related research papers in a coherent manner. This paper is aimed at finding related research papers by proposing a comprehensive and conceptualized model via building ontology named COReS: Content-based Ontology for Research Paper Similarity. The ontology is built by finding the explicit relationships (i.e., super-type sub-type, disjointedness, and overlapping) between state-of-the-art similarity techniques. This paper presents the applications of the COReS model in the form of a case study followed by an experiment. The case study uses InText citation-based and vector space-based similarity measures and relationships between these measures as defined in COReS. The experiment focuses on the computation of comprehensive similarity and other content-based similarity measures and rankings of research papers according to these measures. The obtained Spearman correlation coefficient results between ranks of research papers for different similarity measures and user study-based measure, justify the application of COReS for the computation of document similarity. The COReS is in the process of evaluation for ontological errors. In the future, COReS will be enriched to provide more knowledge to improve the process of comprehensive research paper similarity computation.
- Published
- 2017
41. Process Mining Workshops. ICPM 2022 International Workshops, Bozen-Bolzano, Italy, October 23-28, 2022, Revised Selected Papers.
- Author
-
Montali, Marco, Montali, Marco, Senderovich, Arik, and Weidlich, Matthias
- Subjects
Business mathematics & systems ,Data mining ,Health & safety aspects of IT ,Information technology: general issues ,Machine learning ,business process management ,conformance checking ,data science ,deep learning ,event data ,health informatics ,knowledge graphs ,machine learning ,predictive process monitoring ,process analytics ,process discovery ,process mining ,process querying ,streaming analytics - Abstract
Summary: This open access book constitutes revised selected papers from the International Workshops held at the 4th International Conference on Process Mining, ICPM 2022, which took place in Bozen-Bolzano, Italy, during October 23-28, 2022. The conference focuses on the area of process mining research and practice, including theory, algorithmic challenges, and applications. The co-located workshops provided a forum for novel research ideas. The 42 papers included in this volume were carefully reviewed and selected from 89 submissions. They stem from the following workshops: - 3rd International Workshop on Event Data and Behavioral Analytics (EDBA) - 3rd International Workshop on Leveraging Machine Learning in Process Mining (ML4PM) - 3rd International Workshop on Responsible Process Mining (RPM) (previously known as Trust, Privacy and Security Aspects in Process Analytics) - 5th International Workshop on Process-Oriented Data Science for Healthcare (PODS4H) - 3rd International Workshop on Streaming Analytics for Process Mining (SA4PM) - 7th International Workshop on Process Querying, Manipulation, and Intelligence (PQMI) - 1st International Workshop on Education meets Process Mining (EduPM) - 1st International Workshop on Data Quality and Transformation in Process Mining (DQT-PM)
42. Learning Embeddings for Academic Papers
- Author
-
Zhang, Yi
- Subjects
Skip-gram ,Academic papers ,Networks ,Data mining ,Embeddings - Abstract
Academic papers contain both text and citation links. Representing such data is crucial for many downstream tasks, such as classification, disambiguation, duplicates detection, recommendation and influence prediction. The success of Skip-gram with Negative Sampling model (hereafter SGNS) has inspired many algorithms to learn embeddings for words, documents, and networks. However, there is limited research on learning the representation of linked documents such as academic papers. This dissertation first studies the norm convergence issue in SGNS and propose to use an L2 regularization to fix the problem. Our experiments show that our method improves SGNS and its variants on different types of data. We observe improvements upto 17.47% for word embeddings, 1.85% for document embeddings, and 46.41% for network embeddings. To learn the embeddings for academic papers, we propose several neural network based algorithms that can learn high-quality embeddings from different types of data. The algorithms we proposed are N2V (network2vector) for networks, D2V (document2vector) for documents, and P2V (paper2vector) for academic papers. Experiments show that our models outperform traditional algorithms and the state-of-the-art neural network methods on various datasets under different machine learning tasks. With the high quality embeddings, we design and present four applications on real-world datasets, i.e., academic paper and author search engines, author name disambiguation, and paper influence prediction.
- Published
- 2019
43. A Hybrid Model Based on LFM and BiGRU Toward Research Paper Recommendation
- Author
-
Ziqing Nie, Xu Zhao, Chenkun Meng, Tie Feng, and Hui Kang
- Subjects
Word embedding ,General Computer Science ,Computer science ,Feature vector ,Feature extraction ,02 engineering and technology ,Semantics ,computer.software_genre ,LFM ,Matrix decomposition ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Recommender systems ,General Materials Science ,BiGRU ,user attention ,Artificial neural network ,business.industry ,Deep learning ,General Engineering ,deep learning ,TK1-9971 ,020201 artificial intelligence & image processing ,Artificial intelligence ,Data mining ,Electrical engineering. Electronics. Nuclear engineering ,business ,computer ,Word (computer architecture) - Abstract
To improve the accuracy of user implicit rating prediction, we combine the traditional latent factor model (LFM) and bidirectional gated recurrent unit neural network (BiGRU) model to propose a hybrid model that deeply mines the latent semantics in the unstructured content of the text and generates a more accurate rating matrix. First, we utilize the user’s historical behavior (favorites records) to build a user rating matrix and decompose the matrix to obtain the latent factor vectors of users and literature. We also apply the BERT model for word embedding of the research papers to obtain the sequence of word vectors. Then, we apply the BiGRU with the user attention mechanism to mine the research paper textual content and to generate the new literature latent feature vectors that are used to replace the original literature latent factor vectors decomposed from the rating matrix. Finally, a new rating matrix is generated to obtain users’ ratings of noninteractive research papers and to generate the recommendation list according to the user latent factor vector. We design experiments on the real datasets and verify that the research paper recommendation model is superior to traditional recommendation models in terms of precision, recall, F1-value, coverage, popularity and diversity.
- Published
- 2020
44. Findings Seminal Papers Using Data Mining Techniques
- Author
-
Debrayan Bravo Hidalgo and Alexander Báez Hernández
- Subjects
Entrepreneurship ,business.industry ,Computer science ,Scopus ,Space (commercial competition) ,computer.software_genre ,Publish or perish ,Software ,Index (publishing) ,Similarity (psychology) ,Anomaly detection ,Data mining ,business ,computer - Abstract
The aim of this contribution is to show the detection of seminal papers using data mining techniques. To achieve the objective of this research, Rapidminer Studio software and its data mining tools are used, based on data created with information extracted from Google Scholar and Scopus, in three different areas of knowledge. In this process, other softwares such as Microsoft Excel and Publish or Perish are used. Comparing the results obtained for the searches in Knowledge Management, Entrepreneurship and Marketing, it was obtained that there is no marked similarity between the sets of articles that were obtained in Google Scholar and Scopus. The values for the Similarity Index remained below 0.52%, similar between Knowledge Management and Entrepreneurship but decreasing for Marketing. The detection of outliers using Data Mining techniques and in particular using Rapidminer, allowed to determine the seminals papers for the three search terms analyzed and allowed to characterize these in the space, in Google Scholar and Scopus. It was shown that the seminal articles can be different if Google Scholar or Scopus is used. The results suggest determining for other search terms whether the trend found is maintained or not.
- Published
- 2020
45. (Short Paper) Effectiveness of Entropy-Based Features in High- and Low-Intensity DDoS Attacks Detection
- Author
-
Abigail Koay, Winston K. G. Seah, and Ian Welch
- Subjects
Rényi entropy ,Computer science ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Short paper ,0202 electrical engineering, electronic engineering, information engineering ,020206 networking & telecommunications ,020201 artificial intelligence & image processing ,Denial-of-service attack ,02 engineering and technology ,Data mining ,computer.software_genre ,computer - Abstract
DDoS attack detection using entropy-based features in network traffic has become a popular approach among researchers in the last five years. The use of traffic distribution features constructed using entropy measures has been proposed as a better approach to detect Distributed Denial of Service (DDoS) attacks compared to conventional volumetric methods, but it still lacks in the generality of detecting various intensity DDoS attacks accurately. In this paper, we focus on identifying effective entropy-based features to detect both high- and low-intensity DDoS attacks by exploring the effectiveness of entropy-based features in distinguishing the attack from normal traffic patterns. We hypothesise that using different entropy measures, window sizes, and entropy-based features may affect the accuracy of detecting DDoS attacks. This means that certain entropy measures, window sizes, and entropy-based features may reveal attack traffic amongst normal traffic better than the others. Our experimental results show that using Shannon, Tsallis and Zhou entropy measures can achieve a clearer distinction between DDoS attack traffic and normal traffic than Renyi entropy. In addition, the window size setting used in entropy construction has minimal influence in differentiating between DDoS attack traffic and normal traffic. The result of the effectiveness ranking shows that the commonly used features are less effective than other features extracted from traffic headers.
- Published
- 2019
46. [Paper] Multimodal Stress Estimation Using Multibiological Information: Towards More Accurate and Detailed Stress Estimation
- Author
-
Takumi Nagasawa, Norimichi Tsumura, Ryo Takahashi, and Keiko Ogawa-Ochiai
- Subjects
Computer science ,Signal Processing ,Stress estimation ,Media Technology ,Data mining ,computer.software_genre ,Computer Graphics and Computer-Aided Design ,computer - Published
- 2021
47. Reproducibility Companion Paper
- Author
-
Zhenzhong Kuang, Xinke Li, Zekun Tong, Cise Midoglu, Yabang Zhao, Yuqing Liao, and Andrew Lim
- Subjects
Source code ,Computer science ,business.industry ,media_common.quotation_subject ,Deep learning ,Point cloud ,computer.software_genre ,File format ,Replication (computing) ,Photogrammetry ,Benchmark (surveying) ,Segmentation ,Artificial intelligence ,Data mining ,business ,computer ,media_common - Abstract
This companion paper is to support the replication of paper "Campus3D: A Photogrammetry Point Cloud Benchmark for Outdoor Scene Hierarchical Understanding", which was presented at ACM Multimedia 2020. The supported paper's main purpose was to provide a photogrammetry point cloud-based dataset with hierarchical multilabels to facilitate the area of 3D deep learning. Based on this provided dataset and source code, in this work, we build a complete package to reimplement the proposed methods and experiments (i.e., the hierarchical learning framework and the benchmarks of the hierarchical semantic segmentation task). Specifically, this paper contains the technical details of the package, including file structure, dataset preparation, installation package, and the conduction of the experiment. We also present the replicated experiment results and indicate our contributions to the original implementation.
- Published
- 2021
48. ANALYSIS OF ENERGY SAVING AND EMISSION REDUCTION OF SECONDARY FIBER MILL BASED ON DATA MINING.
- Author
-
Song HU, Jigeng LI, Mengna HONG, and Yi MAN
- Subjects
PAPER recycling ,WASTE paper ,DATA mining ,PAPER pulp ,ENVIRONMENTAL protection ,COMMERCIAL buildings ,BLEACHING (Chemistry) - Abstract
Waste paper recycling is an important way to realize the environmental protection development of the papermaking industry. The quality of the pulp will affect the pulp sales of the secondary fiber paper mills. The waste paper pulp can be adjusted by controlling the pulping process working conditions, but the working conditions of the waste paper pulping process have too many parameters. And the parameters are coupled with each other, it is difficult to control. In order to find the best working conditions and improve the quality of the pulp, this study uses the association rules algorithm to optimize the parameters for the waste paper pulping process. These parameters are power of refiner, waste paper concentration of refiner, the volume of slurry that enters deinked process, deinking agent amount, deinking time, deinking temperature, bleaching agent amount, bleaching time, and bleaching temperature. The test results show that the qualified rate of the pulp produced under the improved working conditions is 92.56%, an increase of 6.93%, and the average electricity consumption per ton of pulp is reduced by 5.76 kWh/t. In addition to potential economic benefits, this method can reduce carbon emissions. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
49. Image Matching Across Wide Baselines: From Paper to Practice
- Author
-
Yuhe Jin, Kwang Moo Yi, Pascal Fua, Eduard Trulls, Jiri Matas, Dmytro Mishkin, and Anastasiia Mishchuk
- Subjects
FOS: Computer and information sciences ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,computer.software_genre ,benchmark ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,dataset ,Structure from motion ,local features ,3d reconstruction ,structure from motion ,stereo ,Benchmarking ,Pipeline (software) ,Pattern recognition (psychology) ,Metric (mathematics) ,Benchmark (computing) ,Embedding ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Data mining ,Heuristics ,computer ,performance ,Software - Abstract
We introduce a comprehensive benchmark for local features and robust estimation algorithms, focusing on the downstream task -- the accuracy of the reconstructed camera pose -- as our primary metric. Our pipeline's modular structure allows easy integration, configuration, and combination of different methods and heuristics. This is demonstrated by embedding dozens of popular algorithms and evaluating them, from seminal works to the cutting edge of machine learning research. We show that with proper settings, classical solutions may still outperform the perceived state of the art. Besides establishing the actual state of the art, the conducted experiments reveal unexpected properties of Structure from Motion (SfM) pipelines that can help improve their performance, for both algorithmic and learned methods. Data and code are online https://github.com/vcg-uvic/image-matching-benchmark, providing an easy-to-use and flexible framework for the benchmarking of local features and robust estimation methods, both alongside and against top-performing methods. This work provides a basis for the Image Matching Challenge https://vision.uvic.ca/image-matching-challenge., Comment: Added: KeyNet-SOSNet, AffNet-HardNet, TFeat, MKD from kornia
- Published
- 2020
50. Keywords-Driven and Weight-aware Paper Recommendation via Paper Correlation Pattern Mining
- Author
-
Jun Hou, Qianmu Li, Jian Jiang, and Hanwen Liu
- Subjects
Correlation ,Computer science ,Data mining ,computer.software_genre ,computer - Abstract
Currently, readers often prefer to search for their interested papers based on a set of typed query keywords. As the keywords of a paper is often limited, paper recommender systems often need to recommend a set of papers which collectively satisfy the readers’ keyword query. However, the topics of recommended papers are probably not correlated with each other, which fail to meet the readers’ requirements on in-depth and continuous academic research. Furthermore, although existing paper citation graphs can model the papers’ correlations, they often face the data sparse problem which blocks accurate paper recommendations. To address these issues, we propose a keywords-driven and weight-aware paper recommendation approach, named LP-PRk+w (link prediction-paper recommendation), based on a weighted paper correlation graph. Concretely, we firstly optimize the existing paper citation graph modes by introducing a weighted similarity, after which we obtain a weighted paper correlation graph. Then we recommend a set of correlated papers based on the weighted paper correlation graph and the query keywords from readers. At last, we conduct large-scale experiments on a real-world Hep-Th dataset. Experimental results demonstrate that our proposal can improve the paper recommendation performances considerably, compared to other related solutions.
- Published
- 2021
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.