316 results
Search Results
2. On Some Scientific Results of the IMTA-VIII-2022: 8th International Workshop "Image Mining: Theory and Applications".
- Author
-
Gurevich, Igor B., Moroni, Davide, Pascali, Maria Antonietta, and Yashina, Vera V.
- Abstract
The publication presents an introductory paper to the Special issue of the international journal Pattern Recognition and Image Analysis: Advances in Mathematical Theory and Applications of the Russian Academy of Sciences. The main scientific results of the 8th International Workshop "Image Mining: Theory and Applications," held on August 21, 2022, Montreal, Canada, are presented. Historical information is given on this series of international workshops, and their significant role in the development of the theory and practice of automation of image analysis, pattern recognition, and artificial intelligence is emphasized. The list of papers of the Special issue of PRIA, prepared based on the invited and regular papers selected and recommended for publication by the Program Committee of the IMTA-VIII-2022, is presented. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
3. Preprocessing and Artificial Intelligence for Increasing Explainability in Mental Health.
- Author
-
Angerri, X. and Gibert, Karina
- Subjects
ARTIFICIAL intelligence ,MENTAL health ,DATA mining ,DATA analysis - Abstract
This paper shows the added value of using the existing specific domain knowledge to generate new derivated variables to complement a target dataset and the benefits of including these new variables into further data analysis methods. The main contribution of the paper is to propose a methodology to generate these new variables as a part of preprocessing, under a double approach: creating 2nd generation knowledge-driven variables, catching the experts criteria used for reasoning on the field or 3rd generation data-driven indicators, these created by clustering original variables. And Data Mining and Artificial Intelligence techniques like Clustering or Traffic light Panels help to obtain successful results. Some results of the project INSESS-COVID19 are presented, basic descriptive analysis gives simple results that even though they are useful to support basic policy-making, especially in health, a much richer global perspective is acquired after including derivated variables. When 2nd generation variables are available and can be introduced in the method for creating 3rd generation data, added value is obtained from both basic analysis and building new data-driven indicators. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
4. Machine learning and big data analytics in bipolar disorder
- Author
-
Luciano Minuzzi, Erkki Isometsä, Elisa Brietzke, Diego Librenza-Garcia, Anne Duffy, Martin Alda, Benson Mwangi, Flávio Kapczinski, Rodrigo B. Mansur, Boris Birmaher, Bartholomeus C M Haarman, Roger S. McIntyre, Lars Vedel Kessing, Raymond W. Lam, Lakshmi N. Yatham, Pedro Ballester, Tomas Hajek, Ives Cavalcante Passos, Carlos López Jaramillo, and Rodrigo C. Barros
- Subjects
SYMPTOMS ,Computer science ,Big data ,Scientific literature ,computer.software_genre ,Field (computer science) ,Terminology ,risk prediction ,0302 clinical medicine ,big data ,SCHIZOPHRENIA ,NEUROPROGRESSION ,bipolar disorder ,RISK ,ASSOCIATION ,Prognosis ,DEPRESSION ,3. Good health ,Psychiatry and Mental health ,Phenotype ,machine learning ,LITHIUM RESPONSE ,MOOD DISORDERS ,Schizophrenia (object-oriented programming) ,Advisory Committees ,Clinical Decision-Making ,education ,Machine learning ,Risk Assessment ,CLASSIFICATION ,Suicidal Ideation ,03 medical and health sciences ,medicine ,Humans ,Bipolar disorder ,Biological Psychiatry ,PREDICTING SUICIDALITY ,business.industry ,Deep learning ,predictive psychiatry ,Data Science ,deep learning ,data mining ,medicine.disease ,personalized psychiatry ,030227 psychiatry ,Position paper ,Artificial intelligence ,business ,computer ,030217 neurology & neurosurgery - Abstract
OBJECTIVES: The International Society for Bipolar Disorders Big Data Task Force assembled leading researchers in the field of bipolar disorder (BD), machine learning, and big data with extensive experience to evaluate the rationale of machine learning and big data analytics strategies for BD.METHOD: A task force was convened to examine and integrate findings from the scientific literature related to machine learning and big data based studies to clarify terminology and to describe challenges and potential applications in the field of BD. We also systematically searched PubMed, Embase, and Web of Science for articles published up to January 2019 that used machine learning in BD.RESULTS: The results suggested that big data analytics has the potential to provide risk calculators to aid in treatment decisions and predict clinical prognosis, including suicidality, for individual patients. This approach can advance diagnosis by enabling discovery of more relevant data-driven phenotypes, as well as by predicting transition to the disorder in high-risk unaffected subjects. We also discuss the most frequent challenges that big data analytics applications can face, such as heterogeneity, lack of external validation and replication of some studies, cost and non-stationary distribution of the data, and lack of appropriate funding.CONCLUSION: Machine learning-based studies, including atheoretical data-driven big data approaches, provide an opportunity to more accurately detect those who are at risk, parse-relevant phenotypes as well as inform treatment selection and prognosis. However, several methodological challenges need to be addressed in order to translate research findings to clinical settings.
- Published
- 2019
5. Towards an ELSA Curriculum for Data Scientists.
- Author
-
Christoforaki, Maria and Beyan, Oya Deniz
- Subjects
CONSCIOUSNESS raising ,DATA mining ,DATA science ,SCIENCE projects ,CURRICULUM - Abstract
The use of artificial intelligence (AI) applications in a growing number of domains in recent years has put into focus the ethical, legal, and societal aspects (ELSA) of these technologies and the relevant challenges they pose. In this paper, we propose an ELSA curriculum for data scientists aiming to raise awareness about ELSA challenges in their work, provide them with a common language with the relevant domain experts in order to cooperate to find appropriate solutions, and finally, incorporate ELSA in the data science workflow. ELSA should not be seen as an impediment or a superfluous artefact but rather as an integral part of the Data Science Project Lifecycle. The proposed curriculum uses the CRISP-DM (CRoss-Industry Standard Process for Data Mining) model as a backbone to define a vertical partition expressed in modules corresponding to the CRISP-DM phases. The horizontal partition includes knowledge units belonging to three strands that run through the phases, namely ethical and societal, legal and technical rendering knowledge units (KUs). In addition to the detailed description of the aforementioned KUs, we also discuss their implementation, issues such as duration, form, and evaluation of participants, as well as the variance of the knowledge level and needs of the target audience. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Deep learning for healthcare: review, opportunities and challenges
- Author
-
Riccardo Miotto, Xiaoqian Jiang, Shuang Wang, Fei Wang, and Joel T. Dudley
- Subjects
Diagnostic Imaging ,Paper ,0301 basic medicine ,Feature engineering ,Computer science ,02 engineering and technology ,Data type ,Domain (software engineering) ,03 medical and health sciences ,Deep Learning ,Health care ,0202 electrical engineering, electronic engineering, information engineering ,Data Mining ,Electronic Health Records ,Humans ,Cluster analysis ,Molecular Biology ,Interpretability ,business.industry ,Deep learning ,Computational Biology ,Genomics ,Data science ,Telemedicine ,030104 developmental biology ,Domain knowledge ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Delivery of Health Care ,Information Systems - Abstract
Gaining knowledge and actionable insights from complex, high-dimensional and heterogeneous biomedical data remains a key challenge in transforming health care. Various types of data have been emerging in modern biomedical research, including electronic health records, imaging, -omics, sensor data and text, which are complex, heterogeneous, poorly annotated and generally unstructured. Traditional data mining and statistical learning approaches typically need to first perform feature engineering to obtain effective and more robust features from those data, and then build prediction or clustering models on top of them. There are lots of challenges on both steps in a scenario of complicated data and lacking of sufficient domain knowledge. The latest advances in deep learning technologies provide new effective paradigms to obtain end-to-end learning models from complex data. In this article, we review the recent literature on applying deep learning technologies to advance the health care domain. Based on the analyzed work, we suggest that deep learning approaches could be the vehicle for translating big biomedical data into improved human health. However, we also note limitations and needs for improved methods development and applications, especially in terms of ease-of-understanding for domain experts and citizen scientists. We discuss such challenges and suggest developing holistic and meaningful interpretable architectures to bridge deep learning models and human interpretability.
- Published
- 2017
7. Discovering and visualizing indirect associations between biomedical concepts
- Author
-
Jun'ichi Tsujii, Makoto Miwa, Yoshimasa Tsuruoka, Kaisei Hamamoto, and Sophia Ananiadou
- Subjects
Statistics and Probability ,PubMed ,Computer science ,Text Mining ,MEDLINE ,0206 medical engineering ,02 engineering and technology ,Biochemistry ,Business process discovery ,03 medical and health sciences ,Artificial Intelligence ,Web application ,Data Mining ,Ismb/Eccb 2011 Proceedings Papers Committee July 17 to July 19, 2011, Vienna, Austria ,Medical Informatics Applications ,Molecular Biology ,030304 developmental biology ,Interpretability ,0303 health sciences ,Internet ,Information retrieval ,business.industry ,Full text search ,Data science ,Original Papers ,United States ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,The Internet ,business ,020602 bioinformatics - Abstract
Motivation: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. Results: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. Availability: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. Contact: tsuruoka@jaist.ac.jp
- Published
- 2011
8. Large-scale extraction of brain connectivity from the neuroscientific literature
- Author
-
Renaud Richardet, Sean Hill, Jean-Cédric Chappelier, and Martin Telefont
- Subjects
Statistics and Probability ,Normalization (statistics) ,Databases, Factual ,Computer science ,Biochemistry ,Brain mapping ,Mice ,03 medical and health sciences ,Atlases as Topic ,0302 clinical medicine ,Artificial Intelligence ,In vivo ,Terminology as Topic ,medicine ,Animals ,Data Mining ,Molecular Biology ,030304 developmental biology ,Brain Mapping ,0303 health sciences ,Information retrieval ,Brain atlas ,Brain ,Neuroinformatics ,Original Papers ,Data science ,Computer Science Applications ,Neuroanatomy ,Computational Mathematics ,Brain region ,medicine.anatomical_structure ,Computational Theory and Mathematics ,Data and Text Mining ,Periodicals as Topic ,Software ,030217 neurology & neurosurgery - Abstract
Motivation: In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles. One challenge for modern neuroinformatics is finding methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and the integration of such data into computational models. A key example of this is metascale brain connectivity, where results are not reported in a normalized repository. Instead, these experimental results are published in natural language, scattered among individual scientific publications. This lack of normalization and centralization hinders the large-scale integration of brain connectivity results. In this article, we present text-mining models to extract and aggregate brain connectivity results from 13.2 million PubMed abstracts and 630 216 full-text publications related to neuroscience. The brain regions are identified with three different named entity recognizers (NERs) and then normalized against two atlases: the Allen Brain Atlas (ABA) and the atlas from the Brain Architecture Management System (BAMS). We then use three different extractors to assess inter-region connectivity. Results: NERs and connectivity extractors are evaluated against a manually annotated corpus. The complete in litero extraction models are also evaluated against in vivo connectivity data from ABA with an estimated precision of 78%. The resulting database contains over 4 million brain region mentions and over 100 000 (ABA) and 122 000 (BAMS) potential brain region connections. This database drastically accelerates connectivity literature review, by providing a centralized repository of connectivity data to neuroscientists. Availability and implementation: The resulting models are publicly available at github.com/BlueBrain/bluima. Contact: renaud.richardet@epfl.ch Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2017
9. Medical artificial intelligence: driven by the fusion of knowledge -guided and data- mining methodologies.
- Author
-
WU Jia-ling and HAN Jian-da
- Subjects
MEDICINE ,DATA science ,THERAPEUTICS ,COMPUTERS in medicine ,ARTIFICIAL intelligence ,INTELLECT ,MEDICAL informatics ,COMPUTER-aided diagnosis ,DATA mining ,DIFFUSION of innovations - Abstract
Medical artificial intelligence (AI) applies AI technology to the medical field to meet clinical needs, combines with the experience of doctors, and relies on the computing, analysis and decision-making ability of AI to provide accurate intelligent assistance for clinical diagnosis and treatment. At present, medical AI is based on knowledge guided and data driven approaches, but each has its own advantages and disadvantages. Combining knowledge -guided AI with data- driven AI and utilizing their respective advantages is expected to break through the application bottle neck of medical AI and promote the development and innovation of medical AI. This paper summarizes the application of medical AI driven by knowledge guidance and data mining and looks forward to the future development direction, in order to promote the innovation and application of AI technology in the medical field. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
10. Letter to the editor : on the term 'interaction' and related phrases in the literature on random forests
- Author
-
Carolin Strobl, Alexander Hapfelmeier, Silke Janitza, Anne-Laure Boulesteix, Kristel Van Steen, University of Zurich, and Boulesteix, Anne-Laure
- Subjects
Letter to the editor ,Computer science ,computer.software_genre ,1710 Information Systems ,Biological Science Disciplines ,1312 Molecular Biology ,Data Mining ,Humans ,Molecular Biology ,Conditional dependence ,Random Forest ,Point (typography) ,business.industry ,10093 Institute of Psychology ,proximity ,Data science ,conditional relationships ,Term (time) ,Random forest ,Papers ,variable importance ,Artificial intelligence ,local importance ,variable interaction ,business ,150 Psychology ,computer ,Natural language processing ,Algorithms ,Information Systems - Abstract
In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
- Published
- 2015
11. Patent Data for Engineering Design: A Review.
- Author
-
Jiang, S., Sarica, S., Song, B., Hu, J., and Luo, J.
- Subjects
ENGINEERING design ,PATENTS ,ARTIFICIAL intelligence ,DATA science ,BIG data - Abstract
Patent data have been utilized for engineering design research for long because it contains massive amount of design information. Recent advances in artificial intelligence and data science present unprecedented opportunities to mine, analyse and make sense of patent data to develop design theory and methodology. Herein, we survey the patent-for-design literature by their contributions to design theories, methods, tools, and strategies, as well as different forms of patent data and various methods. Our review sheds light on promising future research directions for the field. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
12. Production processes of official statistics and analytics processes augmented by trusted smart statistics: Friends or foes?
- Author
-
Kuonen, Diego and Loison, Bertrand
- Subjects
MANUFACTURING processes ,BIG data ,STATISTICS ,ACQUISITION of data ,SECONDARY analysis - Abstract
National statistical institutes are using frameworks to organise and set up their official statistical production, e.g. GSBPM. As a sequential approach of statistical production, GSBPM has become a well-established standard using deductive reasoning as analytics' paradigm. For example, the first GSBPM steps are entirely focused on deductive reasoning based on primary data collection and are not suited for inductive reasoning applied to (already existing) secondary data (e.g. big data resulting, for example, from smart ecosystems). Taken into account the apparent potential of big data in the official statistical production, the GSBPM process needs to adapted to incorporate both complementary approaches of analytics (i.e. inductive and deductive reasoning) and, for example, through the usage of, for example, data-informed continuous evaluation at any GSBPM step. This paper discusses the limitations of GSBPM with respect to the usage of big data (using inductive reasoning as analytics' paradigm), and also with respect to trusted smart statistics. The authors give insights on how to augment and empower current statistical production processes by analytics, and also by (trusted) smart statistics. In addition, the paper also highlights challenges and opportunities that should be addressed to embrace this major paradigm shift. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
13. Analysis of a marine scrubber operation with a combined analytical/AI-based method.
- Author
-
Di Bonito, Luigi Piero, Campanile, Lelio, Napolitano, Erasmo, Iacono, Mauro, Portolano, Alberto, and Di Natale, Francesco
- Subjects
- *
ARTIFICIAL intelligence , *CARGO ships , *CHEMICAL engineering , *DATA mining , *CHEMICAL engineers - Abstract
This paper describes the performances of a marine SO 2 absorption scrubber installed onboard a large Ro-Ro cargo ship. The study is based on the reconstruction of an extensive dataset from one-year continuous monitoring of the scrubber's performances and operating conditions. The dataset has been interpreted with a conventional analytical, physical-mathematical, model for absorbers' rating and its combination with an Artificial Intelligence (AI) one. First, the analytical model has been used to provide a deterministic mathematical framework for the interpretation and the prediction of the scrubber's performances in terms of absorbed SO 2 molar flow and SO 2 concentration at the scrubber exit. Then, data mining and AI techniques have been applied to develop an Artificial Neural Network able to predict the error between the actual SO 2 concentration at the scrubber exit and the corresponding analytical model predictions. The final result is a combined model providing superior robustness and accuracy in the prediction of the scrubber performance while preserving a rationale for process design and operation. This interesting outcome suggests that the development of combined, or hybrid, Analytical/AI models can be a reliable and cost-effective way to improve chemical engineers' ability to design and control marine scrubbers, as well as other chemical equipment. • The performances and operation of a marine scrubber has been collected in a dataset. • An Analytical (A) model is used to describe the scrubber performances. • An Artificial Intelligence (AI) model is used to interpret the error of the A model. • The combined A/AI model greatly improve the prediction of scrubber performances. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. Reliable Physical Unclonable Functions Using Data Retention Voltage of SRAM Cells.
- Author
-
Xu, Xiaolin, Rahmati, Amir, Holcomb, Daniel E., Fu, Kevin, and Burleson, Wayne
- Subjects
RANDOM access memory ,MACHINE learning ,ARTIFICIAL intelligence ,DATA mining ,DATA science - Abstract
Physical unclonable functions (PUFs) are circuits that produce outputs determined by random physical variations from fabrication. The PUF studied in this paper utilizes the variation sensitivity of static random access memory (SRAM) data retention voltage (DRV), the minimum voltage at which each cell can retain state. Prior work shows that DRV can uniquely identify circuit instances with 28% greater success than SRAM power-up states that are used in PUFs
[1] . However, DRV is highly sensitive to temperature, and until now this makes it unreliable and unsuitable for use in a PUF. In this paper, we enable DRV PUFs by proposing a DRV-based hash function that is insensitive to temperature. The new hash function, denoted DRV-based hashing (DH), is reliable across temperatures because it utilizes the temperature-insensitive ordering of DRVs across cells, instead of using the DRVs in absolute terms. To evaluate the security and performance of the DRV PUF, we use DRV measurements from commercially available SRAM chips, and use data from a novel DRV prediction algorithm. The prediction algorithm uses machine learning for fast and accurate simulation-free estimation of any cell’s DRV, and the prediction error in comparison to circuit simulation has a standard deviation of 0.35 mV. We demonstrate the DRV PUF using two applications—secret key generation and identification. In secret key generation, we introduce a new circuit-level reliability knob as an alternative to error correcting codes. In the identification application, our approach is compared to prior work and shown to result in a smaller false-positive identification rate for any desired true-positive identification rate. [ABSTRACT FROM PUBLISHER]- Published
- 2015
- Full Text
- View/download PDF
15. A systematic review of Machine learning techniques for Heart disease prediction.
- Author
-
Udhan, Shivganga and Patil, Bankat
- Subjects
MACHINE learning ,HEART diseases ,DATA mining ,ARTIFICIAL intelligence ,DATA science - Abstract
One of the most common disease today is Heart Disease, and early diagnosis of such disease is very challenging. Machine learning includes artificial intelligence, which is implemented to solve a number of data science problems. The prediction of outcomes based on existing data is a common machine learning application. Different data mining strategies for the prediction of heart disease have been proposed with varying degrees of effectiveness and accuracy. In this paper, author provide an in-depth literature survey on systems for predicting risk of heart disease. [ABSTRACT FROM AUTHOR]
- Published
- 2021
16. Box search for the data mining of the key parameters of an industrial process.
- Author
-
Louveaux, Q., Mathei, A., and Mathieu, S.
- Subjects
DATA mining ,BIG data ,DATA science ,MACHINE learning ,ARTIFICIAL intelligence ,DATABASES - Abstract
To increase their competitiveness, many industrial companies monitor their production process, collecting large amount of measurements. This paper describes a technique using this data to improve the performance of a monitored process. In particular we wish to find a set of rules, i.e. intervals on a reduced number of parameters, for which an output value is maximized. The model-free optimization problem to solve is to find a box, restricted on a limited amount of dimensions, with the maximum mean value of the included points. This article compares a machine learning-based heuristic to the solution computed by a mixed-integer linear program on real-life databases from steel and glass manufacturing. Computational results show that the heuristic obtains comparable solutions to the mixed integer linear approach. However, the exact approach is computationally too expensive to tackle real life databases. Results show that the restriction of five process parameters, on these databases, may improve the quality of the process by 50%. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
17. AN EFFICIENT HIDING METHOD FOR PRIVACY PRESERVING UTILITY MINING.
- Author
-
Ashraf, Mohamed, Rady, Sherine, Abdelkader, Tamer, and Gharib, Tarek F.
- Subjects
DATA mining ,ARTIFICIAL intelligence ,MACHINE learning ,INTERNET of things ,ALGORITHMS - Abstract
Due to the rapid evolution of data saved in electronic form, data mining technologies have become critical and indispensable in looking for nontrivial, implicit, hidden, and possibly beneficial information in enormous volumes of data. High Utility Pattern Mining (HUPM), among the most intriguing data mining techniques, is broadly leveraged to analyze business interactions in market data based on the notion of economic utilities. These economic utilities can be used to examine the factors influencing a customer's purchasing behavior or to come up with new tailored selling and promotion tactics. This in turn has made utility-driven techniques an essential operation and vital activity for many data analysts since they can lead to proper decision-making processes. Nevertheless, such techniques can also lead to major threats regarding privacy and information security if they were misused. Privacy-Preserving Utility Mining (PPUM), also known as High Utility Pattern Hiding (HUPH), has recently emerged to mitigate the security and privacy issues that could happen in the utility framework. In this paper, we propose a heuristic PPUM method, named HUP-Hiding, to protect the results when mining sensitive data using a utility mining algorithm. The proposed method employs a dataset projection mechanism and a new victim item selection technique to efficiently perform the sanitization process. Experiments were performed to verify the reliability of the suggested algorithm. Our experimental results on different datasets confirm that HUP-Hiding has reasonable performance and fewer side effects compared to existing approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
18. Determining the Number of Clusters using Neural Network and Max Stable Set Problem.
- Author
-
Karim, Awatif, Loqman, Chakir, and Boumhidi, Jaouad
- Subjects
ARTIFICIAL neural networks ,ARTIFICIAL intelligence ,CLUSTER analysis (Statistics) ,DATA science ,K-means clustering ,TEXT mining ,DATA mining - Abstract
One of the most difficult problems, in cluster analysis is the determination of the number of clusters in a data set. Solving this problem consists in detecting and finding the best number of clusters, which is an input parameter for the clustering problems. In this paper, we propose a new approach using the Maximum Stable Set Problem (MSSP) combined by Continuous Hopfield Network (CHN) to determine the number of clusters, which is a basic input parameter for K-Means method. By testing the theoretical results, the proposed approach was validated on a real application for the text mining. Some numerical examples and computational experiments assess the effectiveness of this approach as demonstrated in this paper. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
19. Intelligent mining of large-scale bio-data: Bioinformatics applications.
- Author
-
Golestan Hashemi, Farahnaz Sadat, Razi Ismail, Mohd, Rafii Yusop, Mohd, Golestan Hashemi, Mahboobe Sadat, Nadimi Shahraki, Mohammad Hossein, Rastegari, Hamid, Miah, Gous, and Aslani, Farzad
- Subjects
DATA mining ,BIOINFORMATICS ,ARTIFICIAL intelligence ,HEURISTIC algorithms ,DATA extraction ,DATA science - Abstract
Today, there is a collection of a tremendous amount of bio-data because of the computerized applications worldwide. Therefore, scholars have been encouraged to develop effective methods to extract the hidden knowledge in these data. Consequently, a challenging and valuable area for research in artificial intelligence has been created. Bioinformatics creates heuristic approaches and complex algorithms using artificial intelligence and information technology in order to solve biological problems. Intelligent implication of the data can accelerate biological knowledge discovery. Data mining, as biology intelligence, attempts to find reliable, new, useful and meaningful patterns in huge amounts of data. Hence, there is a high potential to raise the interaction between artificial intelligence and bio-data mining. The present paper argues how artificial intelligence can assist bio-data analysis and gives an up-to-date review of different applications of bio-data mining. It also highlights some future perspectives of data mining in bioinformatics that can inspire further developments of data mining instruments. Important and new techniques are critically discussed for intelligent knowledge discovery of different types of row datasets with applicable examples in human, plant and animal sciences. Finally, a broad perception of this hot topic in data science is given. [ABSTRACT FROM PUBLISHER]
- Published
- 2018
- Full Text
- View/download PDF
20. Reviewing the differences between learning analytics and educational data mining: Towards educational data science.
- Author
-
Cerezo, R., Lara, J.-A., Azevedo, R., and Romero, C.
- Subjects
- *
DATA science , *EXPERIMENTAL design , *PUBLISHING , *TEACHING methods , *SYSTEMATIC reviews , *SERIAL publications , *ATTITUDES of medical personnel , *CONFERENCES & conventions , *ARTIFICIAL intelligence , *LEARNING , *INTERPROFESSIONAL relations , *DATA analytics , *DATA mining , *AUTHORSHIP , *MEDICAL research - Abstract
Over the last decade, Educational Data Mining (EDM) and Learning Analytics (LA) have evolved enormously as interrelated research areas and disciplines. Many researchers interested in these areas may wonder why there are two different communities, whether they are the same concept or not, and the differences between them, which is key information for designing their research and publication strategies. To address this, we conducted a systematic review of academic papers about the differences between LA and EDM following the Preferred Reporting Method for Systematic Reviews (PRISMA) guidelines. We selected 10 research works and identified 11 differences. Our conclusions are that, although both use the same data and share similar goals and interests, EDM and LA are different research communities with different origins and focuses, with their respective conferences and journals. However, there is active collaboration between the two communities and their members often tend to publish in both fields' conferences and journals. Additionally, none of the differences are apparently large enough to conclude that LA and EDM follow different paths for improving the teaching-learning process, but rather the opposite. Following a common future line, it seems that the two "sister" communities are working together with the same perspective, along with some "cousin" communities such as AIED (Artificial Intelligence in Education), L@S (Learning at Scale), Learning Science (LS), etc. in the same area that could be called Educational Data Science (EDS). We propose using the term EDS to integrate both LA and EDM with all these related communities. [Display omitted] • Educational Data Mining (EDM) and Learning Analytics (LA) are interrelated and sometimes interchanged terms. • Five original differences between EDM and LA have been identified. • Other six new differences have emerged since these initial five differences. • The term Educational Data Science (EDS) is proposed to integrate EDM and LA. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Data-Driven Design-By-Analogy: State-of-the-Art and Future Directions.
- Author
-
Shuo Jiang, Jie Hu, Wood, Kristin L., and Jianxi Luo
- Subjects
- *
DATABASE design , *ARTIFICIAL intelligence , *DATA science , *ENGINEERING design , *COMPUTER-aided design - Abstract
Design-by-analogy (DbA) is a design methodology wherein new solutions, opportunities, or designs are generated in a target domain based on inspiration drawn from a source domain; it can benefit designers in mitigating design fixation and improving design ideation outcomes. Recently, the increasingly available design databases and rapidly advancing data science and artificial intelligence (AI) technologies have presented new opportunities for developing data-driven methods and tools for DbA support. In this study, we survey existing data-driven DbA studies and categorize individual studies according to the data, methods, and applications into four categories, namely, analogy encoding, retrieval, mapping, and evaluation. Based on both nuanced organic review and structured analysis, this paper elucidates the state-of-the-art of data-driven DbA research to date and benchmarks it with the frontier of data science and AI research to identify promising research opportunities and directions for the field. Finally, we propose a future conceptual data-driven DbA system that integrates all propositions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Applying search algorithms to obtain the optimal configuration of nDT torus nodes.
- Author
-
Andújar‐Muñoz, Francisco‐José, Villar-Ortiz, Juan‐Antonio, Sánchez‐García, José‐Luis, and Alfaro‐Cortés, Francisco‐José
- Subjects
SEARCH algorithms ,TORUS ,ARTIFICIAL intelligence ,DATA mining ,DATA science - Abstract
An nDT torus is a topology where each node comprises 2 identical (n+1)-port communication cards interconnected by 1 port. By using the current switches or communication cards, this node architecture allows to build torus networks having a greater number of dimensions than networks including only 1 card per node. There are multiple ways to use the ports of the 2 cards to connect a node to other nodes on the nDT torus, and therefore, checking all the configurations is only an affordable problem for small values of n. In this paper, we use artificial intelligence and data mining techniques to obtain the optimal port configuration of all the nodes in the network. We include a performance evaluation that shows nDT torus effectively increases the performance compared with the equivalent torus in resources, with synthetic and application trace-based workloads. We also apply these techniques to 3DT and 5DT tori to confirm the increase in the number of dimensions that does not affect to performance of the nDT torus. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
23. Meta-Learning in Data Classification: An Analysis.
- Author
-
Sen, Sanjay Kumar, Pani, Subhendu Kumar, Ojha, Ananta Charan, and Dash, Sujata
- Subjects
DATA mining ,MACHINE learning ,ALGORITHMS ,DATABASES ,DATA science ,ARTIFICIAL intelligence - Abstract
Several data mining tasks and techniques are available to extract hidden knowledge from large databases. Classification is one such task widely used to categorize different input data into target classes. A myriad of machine learning algorithms have been proposed for the purpose. Meta-learning combines multiple base classifiers whose individual performances in some way contribute to the overall classification. There has been significant interest in combining machine learning algorithms to solve many challenging real-world problems. The premise is that meta-learning enhances the data mining task with the ability to learn and adapt from previous experience. In this paper, we investigate the performance of two popular meta-learning approaches, considering several well-known learning algorithms on a variety of public datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2015
24. Monitoring Big Data During Mechanical Ventilation in the ICU.
- Author
-
Smallwood, Craig D.
- Subjects
MORTALITY risk factors ,ARTIFICIAL intelligence ,ARTIFICIAL respiration ,INTENSIVE care units ,LEARNING strategies ,MACHINE learning ,ARTIFICIAL neural networks ,REGRESSION analysis ,SEPSIS ,TERMS & phrases ,TRACHEOTOMY ,DATA mining ,MECHANICAL ventilators ,WORKFLOW ,ACQUISITION of data ,EXTUBATION ,TREATMENT duration ,ELECTRONIC health records ,ACCURACY ,DATA analytics ,DATA science - Abstract
The electronic health record allows the assimilation of large amounts of clinical and laboratory data. Big data describes the analysis of large data sets using computational modeling to reveal patterns, trends, and associations. How can big data be used to predict ventilator discontinuation or impending compromise, and how can it be incorporated into the clinical workflow? This article will serve 2 purposes. First, a general overview is provided for the layperson and introduces key concepts, definitions, best practices, and things to watch out for when reading a paper that incorporates machine learning. Second, recent publications at the intersection of big data, machine learning, and mechanical ventilation are presented. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
25. Pattern Prediction on Uncertain Big Datasets using Combined Light GBM and LSTM Model.
- Author
-
Zion, G. Divya and Tripathy, B. K.
- Subjects
MACHINE learning ,BIG data ,DATABASES ,EARLY detection of cancer ,ARTIFICIAL intelligence ,DIAGNOSIS - Abstract
Mining frequent patterns from voluminous datasets termed under 'Big data' and having inherent uncertainties poses a significant challenge. Minor changes carried out on the databases like; addition, deletion or modification of items should not lead to scanning the whole database. Besides, a number of algorithms proposed to handle these issues are effective, but their basis of mathematics and way of installation are complex. Keeping the above points in mind, we have proposed an approach, which innovatively combines the models Light Gradient Boosting Machine (LightGBM) and Long Short-Term Memory (LSTM) serially to improve the prediction accuracy. Here, the LightGBM brings its tree-based learning algorithms optimized for speed and performance, while LSTM contributes its advanced sequence modeling capabilities, effectively resolving the vanishing gradient dilemma that often plagues recurrent networks. Our approach is applied to the healthcare sector in general and particularly in the early detection of Breast Cancer from a dataset obtained from Kaggle, yielding outstanding results as are evident from the scores; precision rates of 0.92 for predicted negatives and 0.93 for predicted positives, recall rates of 0.96 for negatives and 0.88 for positives, alongside F1-scores of 0.94 and 0.90, respectively. With a comprehensive accuracy of 0.93 across 188 samples, our model demonstrates a remarkable potential for early medical diagnosis, outperforming existing single-model solutions. The robustness of our approach is further validated by the consistency of performance across various metrics, highlighting its suitability for deployment in high-stakes domains where predictive accuracy is paramount. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
26. Relation extraction in Chinese using attention-based bidirectional long short-term memory networks.
- Author
-
Yanzi Zhang
- Subjects
CHINESE language ,KNOWLEDGE graphs ,RECEIVER operating characteristic curves ,DEEP learning ,DATA mining ,NATURAL languages - Abstract
Relation extraction is an important topic in information extraction, as it is used to create large-scale knowledge graphs for a variety of downstream applications. Its goal is to find and extract semantic links between entity pairs in natural language sentences. Deep learning has substantially advanced neural relation extraction, allowing for the autonomous learning of semantic features. We offer an effective Chinese relation extraction model that uses bidirectional LSTM (Bi-LSTM) and an attention mechanism to extract crucial semantic information from phrases without relying on domain knowledge from lexical resources or language systems in this study. The attention mechanism included into the Bi-LSTM network allows for automatic focus on key words. Two benchmark datasets were used to create and test our models: Chinese SanWen and FinRE. The experimental results show that the SanWen dataset model outperforms the FinRE dataset model, with area under the receiver operating characteristic curve values of 0.70 and 0.50, respectively. The models trained on the SanWen and FinRE datasets achieve values of 0.44 and 0.19, respectively, for the area under the precision-recall curve. In addition, the results of repeated modeling experiments indicated that our proposed method was robust and reproducible. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
27. Deep learning-based automatic action extraction from structured chemical synthesis procedures.
- Author
-
Vaškevičius, Mantas, Kapočiūtė-Dzikienė, Jurgita, Vaškevičius, Arnas, and Šlepikas, Liudas
- Subjects
NATURAL language processing ,MACHINE learning ,CHEMICAL synthesis ,ARTIFICIAL neural networks ,ARTIFICIAL intelligence ,DEEP learning ,SYNTHETIC biology - Abstract
This article proposes a methodology that uses machine learning algorithms to extract actions from structured chemical synthesis procedures, thereby bridging the gap between chemistry and natural language processing. The proposed pipeline combines ML algorithms and scripts to extract relevant data from USPTO and EPO patents, which helps transform experimental procedures into structured actions. This pipeline includes two primary tasks: classifying patent paragraphs to select chemical procedures and converting chemical procedure sentences into a structured, simplified format. We employ artificial neural networks such as long short-term memory, bidirectional LSTMs, transformers, and fine-tuned T5. Our results show that the bidirectional LSTM classifier achieved the highest accuracy of 0.939 in the first task, while the Transformer model attained the highest BLEU score of 0.951 in the second task. The developed pipeline enables the creation of a dataset of chemical reactions and their procedures in a structured format, facilitating the application of AI-based approaches to streamline synthetic pathways, predict reaction outcomes, and optimize experimental conditions. Furthermore, the developed pipeline allows for creating a structured dataset of chemical reactions and procedures, making it easier for researchers to access and utilize the valuable information in synthesis procedures. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Evolution of data science and its education in iSchools: An impressionistic study using curriculum analysis.
- Author
-
Urs, Shalini R. and Minhaj, Mohamed
- Subjects
DATA science ,BIOLOGICAL evolution ,DIGITAL technology ,HEALTH occupations students ,NATURAL language processing ,CURRICULUM ,MACHINE learning ,ARTIFICIAL intelligence ,PARADIGMS (Social sciences) ,ABILITY ,TRAINING ,DATABASE management ,GRADUATE education ,CLUSTER analysis (Statistics) ,ONTOLOGIES (Information retrieval) ,MEDICAL informatics ,DATA mining - Abstract
Data Science (DS) has emerged from the shadows of its parents—statistics and computer science—into an independent field since its origin nearly six decades ago. Its evolution and education have taken many sharp turns. We present an impressionistic study of the evolution of DS anchored to Kuhn's four stages of paradigm shifts. First, we construct the landscape of DS based on curriculum analysis of the 32 iSchools across the world offering graduate‐level DS programs. Second, we paint the "field" as it emerges from the word frequency patterns, ranking, and clustering of course titles based on text mining. Third, we map the curriculum to the landscape of DS and project the same onto the Edison Data Science Framework (2017) and ACM Data Science Knowledge Areas (2021). Our study shows that the DS programs of iSchools align well with the field and correspond to the Knowledge Areas and skillsets. iSchool's DS curriculums exhibit a bias toward "data visualization" along with machine learning, data mining, natural language processing, and artificial intelligence; go light on statistics; slanted toward ontologies and health informatics; and surprisingly minimal thrust toward eScience/research data management, which we believe would add a distinctive iSchool flavor to the DS. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
29. A New K-Nearest Neighbors Classifier for Big Data Based on Efficient Data Pruning.
- Author
-
Saadatfar, Hamid, Khosravi, Samiyeh, Joloudari, Javad Hassannataj, Mosavi, Amir, and Shamshirband, Shahaboddin
- Subjects
BIG data ,DATABASES ,PRUNING ,NEAREST neighbor analysis (Statistics) ,K-means clustering ,DATA mining ,MACHINE learning ,REINFORCEMENT learning - Abstract
The K-nearest neighbors (KNN) machine learning algorithm is a well-known non-parametric classification method. However, like other traditional data mining methods, applying it on big data comes with computational challenges. Indeed, KNN determines the class of a new sample based on the class of its nearest neighbors; however, identifying the neighbors in a large amount of data imposes a large computational cost so that it is no longer applicable by a single computing machine. One of the proposed techniques to make classification methods applicable on large datasets is pruning. LC-KNN is an improved KNN method which first clusters the data into some smaller partitions using the K-means clustering method; and then applies the KNN for each new sample on the partition which its center is the nearest one. However, because the clusters have different shapes and densities, selection of the appropriate cluster is a challenge. In this paper, an approach has been proposed to improve the pruning phase of the LC-KNN method by taking into account these factors. The proposed approach helps to choose a more appropriate cluster of data for looking for the neighbors, thus, increasing the classification accuracy. The performance of the proposed approach is evaluated on different real datasets. The experimental results show the effectiveness of the proposed approach and its higher classification accuracy and lower time cost in comparison to other recent relevant methods. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
30. Advanced methods for missing values imputation based on similarity learning
- Author
-
Ahmad Taher Azar, Mona M. Arafa, Mahmoud M. Ismail, and Khaled M. Fouad
- Subjects
General Computer Science ,Mean squared error ,Computer science ,Missing data ,Data Mining and Machine Learning ,Data preprocessing ,02 engineering and technology ,computer.software_genre ,Large scale data ,Set (abstract data type) ,Similarity (network science) ,Artificial Intelligence ,020204 information systems ,Similarity learning ,0202 electrical engineering, electronic engineering, information engineering ,Statistics::Methodology ,Imputation (statistics) ,Cluster analysis ,Imputation ,Statistics::Applications ,Data Science ,QA75.5-76.95 ,Missing data types ,Data_GENERAL ,Electronic computers. Computer science ,020201 artificial intelligence & image processing ,Data mining ,Data pre-processing ,computer - Abstract
The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.
- Published
- 2021
31. DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes.
- Author
-
Zhou, Nina, Wu, Qiucheng, Wu, Zewen, Marino, Simeone, and Dinov, Ivo D.
- Subjects
DATABASES ,DATA science ,PRIVACY ,MEDICAL information storage & retrieval systems ,ELECTRONIC data interchange ,SAMPLE size (Statistics) ,ARTIFICIAL intelligence ,MACHINE learning ,DATABASE management ,AUTOMATIC data collection systems ,DATA security ,MEDICAL ethics ,QUALITY assurance ,DESCRIPTIVE statistics ,RESEARCH funding ,DATA analytics ,DATA mining ,ALGORITHMS ,PROBABILITY theory - Abstract
Petabytes of health data are collected annually across the globe in electronic health records (EHR), including significant information stored as unstructured free text. However, the lack of effective mechanisms to securely share clinical text has inhibited its full utilization. We propose a new method, DataSifterText, to generate partially synthetic clinical free-text that can be safely shared between stakeholders (e.g., clinicians, STEM researchers, engineers, analysts, and healthcare providers), limiting the re-identification risk while providing significantly better utility preservation than suppressing or generalizing sensitive tokens. The method creates partially synthetic free-text data, which inherits the joint population distribution of the original data, and disguises the location of true and obfuscated words. Under certain obfuscation levels, the resulting synthetic text was sufficiently altered with different choices, orders, and frequencies of words compared to the original records. The differences were comparable to machine-generated (fully synthetic) text reported in previous studies. We applied DataSifterText to two medical case studies. In the CDC work injury application, using privacy protection, 60.9-86.5% of the synthetic descriptions belong to the same cluster as the original descriptions, demonstrating better utility preservation than the naïve content suppressing method (45.8-85.7%). In the MIMIC III application, the generated synthetic data maintained over 80% of the original information regarding patients' overall health conditions. The reported DataSifterText statistical obfuscation results indicate that the technique provides sufficient privacy protection (low identification risk) while preserving population-level information (high utility). [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. Towards a Continuous Process Model for Data Science Projects
- Author
-
Kutzias, Damian, Dukino, Claudia, Kett, Holger, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Leitner, Christine, editor, Ganz, Walter, editor, Satterfield, Debra, editor, and Bassano, Clara, editor
- Published
- 2021
- Full Text
- View/download PDF
33. Analyzing and Visualizing Knowledge Structures of Health Informatics from 1974 to 2018: A Bibliometric and Social Network Analysis
- Author
-
Mohammad Saheb and Tahereh Saheb
- Subjects
Decision support system ,Computer science ,business.industry ,Deep learning ,Computer applications to medicine. Medical informatics ,Biomedical Engineering ,R858-859.7 ,Health Informatics ,Subject (documents) ,Review Article ,data mining ,algorithms ,Data science ,Health informatics ,Field (computer science) ,Patient safety ,machine learning ,Health Information Management ,Health care ,medical informatics ,Artificial intelligence ,publications ,business ,Social network analysis - Abstract
Objectives This paper aims to provide a theoretical clarification of the health informatics field by conducting a quantitative review analysis of the health informatics literature. And this paper aims to map scientific networks; to uncover the explicit and hidden patterns, knowledge structures, and sub-structures in scientific networks; to track the flow and burst of scientific topics; and to discover what effects they have on the scientific growth of health informatics. Methods This study was a quantitative literature review of the health informatics field, employing text mining and bibliometric research methods. This paper reviews 30,115 articles with health informatics as their topic, which are indexed in the Web of Science Core Collection Database from 1974 to 2018. This study analyzed and mapped four networks: author co-citation network, co-occurring author keywords and keywords plus, co-occurring subject categories, and country co-citation network. We used CiteSpace 5.3 and VOSviewer to analyze data, and we used Gephi 0.9.2 and VOSviewer to visualize the networks. Results This study found that the three major themes of the literature from 1974 to 2018 were the utilization of computer science in healthcare, the impact of health informatics on patient safety and the quality of healthcare, and decision support systems. The study found that, since 2016, health informatics has entered a new era to provide predictive, preventative, personalized, and participatory healthcare systems. Conclusions This study found that the future strands of research may be patient-generated health data, deep learning algorithms, quantified self and self-tracking tools, and Internet of Things based decision support systems.
- Published
- 2019
34. Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives
- Author
-
Dehan Elinor, Jia Xu, Fang Wang, Baiju Parikh, Kirk A. Beaty, Shang Xue, Marta Sanchez-Martin, Pengwei Yang, and Bhuvan Sharma
- Subjects
Big data ,MEDLINE ,Context (language use) ,Review ,Biology ,03 medical and health sciences ,Neoplasms ,Health care ,Genetics ,Animals ,Data Mining ,Humans ,Diagnosis, Computer-Assisted ,Precision Medicine ,Genetics (clinical) ,Natural Language Processing ,030304 developmental biology ,Electronic Data Processing ,0303 health sciences ,business.industry ,Deep learning ,030305 genetics & heredity ,Genomics ,Precision medicine ,Data science ,ComputingMethodologies_PATTERNRECOGNITION ,Workflow ,Artificial intelligence ,Applications of artificial intelligence ,business - Abstract
In the field of cancer genomics, the broad availability of genetic information offered by next-generation sequencing technologies and rapid growth in biomedical publication has led to the advent of the big-data era. Integration of artificial intelligence (AI) approaches such as machine learning, deep learning, and natural language processing (NLP) to tackle the challenges of scalability and high dimensionality of data and to transform big data into clinically actionable knowledge is expanding and becoming the foundation of precision medicine. In this paper, we review the current status and future directions of AI application in cancer genomics within the context of workflows to integrate genomic analysis for precision cancer care. The existing solutions of AI and their limitations in cancer genetic testing and diagnostics such as variant calling and interpretation are critically analyzed. Publicly available tools or algorithms for key NLP technologies in the literature mining for evidence-based clinical recommendations are reviewed and compared. In addition, the present paper highlights the challenges to AI adoption in digital healthcare with regard to data requirements, algorithmic transparency, reproducibility, and real-world assessment, and discusses the importance of preparing patients and physicians for modern digitized healthcare. We believe that AI will remain the main driver to healthcare transformation toward precision medicine, yet the unprecedented challenges posed should be addressed to ensure safety and beneficial impact to healthcare.
- Published
- 2019
35. Hands-on training about overfitting
- Author
-
Janez Demšar and Blaž Zupan
- Subjects
0301 basic medicine ,Social Sciences ,Gene Expression ,02 engineering and technology ,Overfitting ,computer.software_genre ,Machine Learning ,Learning and Memory ,Sociology ,0202 electrical engineering, electronic engineering, information engineering ,ComputingMilieux_COMPUTERSANDEDUCATION ,Psychology ,Data Mining ,Biology (General) ,Data Management ,Ecology ,Orange (software) ,Software Engineering ,Toolbox ,Computational Theory and Mathematics ,Modeling and Simulation ,Lectures ,Engineering and Technology ,Workshops ,Computer and Information Sciences ,QH301-705.5 ,Machine learning ,Training (civil) ,Models, Biological ,Education ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,Human Learning ,Data visualization ,Artificial Intelligence ,020204 information systems ,Genetics ,Humans ,Learning ,Molecular Biology ,Preprocessing ,Ecology, Evolution, Behavior and Systematics ,Models, Statistical ,business.industry ,Data Visualization ,Data Science ,Cognitive Psychology ,Computational Biology ,Biology and Life Sciences ,030104 developmental biology ,Workflow ,Cognitive Science ,Artificial intelligence ,business ,computer ,Software ,Neuroscience - Abstract
Overfitting is one of the critical problems in developing models by machine learning. With machine learning becoming an essential technology in computational biology, we must include training about overfitting in all courses that introduce this technology to students and practitioners. We here propose a hands-on training for overfitting that is suitable for introductory level courses and can be carried out on its own or embedded within any data science course. We use workflow-based design of machine learning pipelines, experimentation-based teaching, and hands-on approach that focuses on concepts rather than underlying mathematics. We here detail the data analysis workflows we use in training and motivate them from the viewpoint of teaching goals. Our proposed approach relies on Orange, an open-source data science toolbox that combines data visualization and machine learning, and that is tailored for education in machine learning and explorative data analysis., Author summary Every teacher strives for an a-ha moment, a sudden revelation by the student who gained a fundamental insight she will always remember. In the past years, authors of this paper have been tailoring their courses in machine learning to include material that could lead students to such discoveries. We aim to expose machine learning to practitioners–not only computer scientists but also molecular biologists and students of biomedicine, that is, the end-users of bioinformatics’ computational approaches. In this article, we lay out a course that aims to teach about overfitting, one of the key concepts in machine learning that needs to be understood, mastered, and avoided in data science applications. We propose a hands-on approach that uses an open-source workflow-based data science toolbox that combines data visualization and machine learning. In the proposed training about overfitting, we first deceive the students, then expose the problem, and finally challenge them to find the solution. In the paper, we present three lessons in overfitting and associated data analysis workflows and motivate the use of introduced computation methods by relating them to concepts conveyed by instructors.
- Published
- 2021
36. Clinlabomics: leveraging clinical laboratory data by data mining strategies.
- Author
-
Wen, Xiaoxia, Leng, Ping, Wang, Jiasi, Yang, Guishu, Zu, Ruiling, Jia, Xiaojiong, Zhang, Kaijiong, Mengesha, Birga Anteneh, Huang, Jian, Wang, Dongsheng, and Luo, Huaichao
- Subjects
PATHOLOGICAL laboratories ,DATA mining ,MEDICAL screening ,COMPUTER engineering ,MACHINE learning ,DIAGNOSIS - Abstract
The recent global focus on big data in medicine has been associated with the rise of artificial intelligence (AI) in diagnosis and decision-making following recent advances in computer technology. Up to now, AI has been applied to various aspects of medicine, including disease diagnosis, surveillance, treatment, predicting future risk, targeted interventions and understanding of the disease. There have been plenty of successful examples in medicine of using big data, such as radiology and pathology, ophthalmology cardiology and surgery. Combining medicine and AI has become a powerful tool to change health care, and even to change the nature of disease screening in clinical diagnosis. As all we know, clinical laboratories produce large amounts of testing data every day and the clinical laboratory data combined with AI may establish a new diagnosis and treatment has attracted wide attention. At present, a new concept of radiomics has been created for imaging data combined with AI, but a new definition of clinical laboratory data combined with AI has lacked so that many studies in this field cannot be accurately classified. Therefore, we propose a new concept of clinical laboratory omics (Clinlabomics) by combining clinical laboratory medicine and AI. Clinlabomics can use high-throughput methods to extract large amounts of feature data from blood, body fluids, secretions, excreta, and cast clinical laboratory test data. Then using the data statistics, machine learning, and other methods to read more undiscovered information. In this review, we have summarized the application of clinical laboratory data combined with AI in medical fields. Undeniable, the application of Clinlabomics is a method that can assist many fields of medicine but still requires further validation in a multi-center environment and laboratory. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
37. Data Science Techniques for COVID-19 in Intensive Care Units
- Author
-
Fernando López Hernández, Alberto Corbi Bellot, and Sergio Muñoz Lezcano
- Subjects
Statistics and Probability ,0209 industrial biotechnology ,2019-20 coronavirus outbreak ,Emergency rooms ,Coronavirus disease 2019 (COVID-19) ,Computer Networks and Communications ,Process (engineering) ,Computer science ,Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) ,02 engineering and technology ,lcsh:Technology ,Data-driven ,X-ray ,020901 industrial engineering & automation ,Artificial Intelligence ,Intensive care ,Health care ,0202 electrical engineering, electronic engineering, information engineering ,coronavirus covid-19 ,coronavirus COVID-19 ,business.industry ,lcsh:T ,biomarkers ,IJIMAI ,data mining ,Data science ,Computer Science Applications ,image processing ,machine learning ,x-ray ,Signal Processing ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,business - Abstract
Data scientists aim to provide techniques and tools to the clinicians to manage the new coronavirus disease. Nowadays, new emerging tools based on Artificial Intelligence (AI), Image Processing (IP) and Machine Learning (ML) are contributing to the improvement of healthcare and treatments of different diseases. This paper reviews the most recent research efforts and approaches related to these new data driven techniques and tools in combination with the exploitation of the already available COVID-19 datasets. The tools can assist clinicians and nurses in efficient decision making with complex and heavily heterogeneous data, even in hectic and overburdened Intensive Care Units (ICU) scenarios. The datasets and techniques underlying these tools can help finding a more correct diagnosis. The paper also describes how these innovative AI+IP+ML-based methods (e.g., conventional X-ray imaging, clinical laboratory data, respiratory monitoring and automatic adjustments, etc.) can assist in the process of easing both the care of infected patients in ICUs and Emergency Rooms and the discovery of new treatments (drugs).
- Published
- 2020
38. Vovel metrics-novel coupling metrics for improved software fault prediction
- Author
-
Muddassar Azam Sindhu, Rizwan Muhammad, and Aamer Nadeem
- Subjects
General Computer Science ,Computer science ,02 engineering and technology ,computer.software_genre ,Software ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Information flow (information theory) ,Software system ,Measure (data warehouse) ,Software coupling ,business.industry ,Data Science ,Univariate ,Software Engineering ,020207 software engineering ,Software faults ,QA75.5-76.95 ,Software metric ,Coupling (computer programming) ,Expert opinion ,Electronic computers. Computer science ,020201 artificial intelligence & image processing ,Data mining ,business ,Software metrics ,Model building ,computer - Abstract
Software is a complex entity, and its development needs careful planning and a high amount of time and cost. To assess quality of program, software measures are very helpful. Amongst the existing measures, coupling is an important design measure, which computes the degree of interdependence among the entities of a software system. Higher coupling leads to cognitive complexity and thus a higher probability occurrence of faults. Well in time prediction of fault-prone modules assists in saving time and cost of testing. This paper aims to capture important aspects of coupling and then assess the effectiveness of these aspects in determining fault-prone entities in the software system. We propose two coupling metrics, i.e., Vovel-in and Vovel-out, that capture the level of coupling and the volume of information flow. We empirically evaluate the effectiveness of the Vovel metrics in determining the fault-prone classes using five projects, i.e., Eclipse JDT, Equinox framework, Apache Lucene, Mylyn, and Eclipse PDE UI. Model building is done using univariate logistic regression and later Spearman correlation coefficient is computed with the existing coupling metrics to assess the coverage of unique information. Finally, the least correlated metrics are used for building multivariate logistic regression with and without the use of Vovel metrics, to assess the effectiveness of Vovel metrics. The results show the proposed metrics significantly improve the predicting of fault prone classes. Moreover, the proposed metrics cover a significant amount of unique information which is not covered by the existing well-known coupling metrics, i.e., CBO, RFC, Fan-in, and Fan-out. This paper, empirically evaluates the impact of coupling metrics, and more specifically the importance of level and volume of coupling in software fault prediction. The results advocate the prudent addition of proposed metrics due to their unique information coverage and significant predictive ability.
- Published
- 2020
39. Healthcare Applications of Artificial Intelligence and Analytics: A Review and Proposed Framework
- Author
-
Alex Ramirez, Stephane Gagnon, Gregory Richards, and S. Azzi
- Subjects
Computer science ,Big data ,02 engineering and technology ,lcsh:Technology ,Field (computer science) ,lcsh:Chemistry ,03 medical and health sciences ,0302 clinical medicine ,big data ,Health care ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,ontologies ,030212 general & internal medicine ,Instrumentation ,lcsh:QH301-705.5 ,Fluid Flow and Transfer Processes ,Chronic care ,business.industry ,lcsh:T ,Process Chemistry and Technology ,General Engineering ,Information technology ,Medical practice ,healthcare ,data mining ,artificial intelligence ,Data science ,lcsh:QC1-999 ,Computer Science Applications ,smart homes ,lcsh:Biology (General) ,lcsh:QD1-999 ,Analytics ,lcsh:TA1-2040 ,020201 artificial intelligence & image processing ,Applications of artificial intelligence ,business ,lcsh:Engineering (General). Civil engineering (General) ,lcsh:Physics - Abstract
Healthcare is considered as one of the most promising application areas for artificial intelligence and analytics (AIA) just after the emergence of the latter. AI combined to analytics technologies is increasingly changing medical practice and healthcare in an impressive way using efficient algorithms from various branches of information technology (IT). Indeed, numerous works are published every year in several universities and innovation centers worldwide, but there are concerns about progress in their effective success. There are growing examples of AIA being implemented in healthcare with promising results. This review paper summarizes the past 5 years of healthcare applications of AIA, across different techniques and medical specialties, and discusses the current issues and challenges, related to this revolutionary technology. A total of 24,782 articles were identified. The aim of this paper is to provide the research community with the necessary background to push this field even further and propose a framework that will help integrate diverse AIA technologies around patient needs in various healthcare contexts, especially for chronic care patients, who present the most complex comorbidities and care needs.
- Published
- 2020
40. Playtime Measurement With Survival Analysis
- Author
-
Markus Viljanen, Jukka Heikkonen, Antti Airola, and Tapio Pahikkala
- Subjects
FOS: Computer and information sciences ,Computer Science - Artificial Intelligence ,Computer science ,Machine Learning (stat.ML) ,02 engineering and technology ,computer.software_genre ,Statistics - Applications ,Session (web analytics) ,Statistics - Machine Learning ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Applications (stat.AP) ,Product (category theory) ,Electrical and Electronic Engineering ,Duration (project management) ,Survival analysis ,ta113 ,Video game development ,Monetization ,business.industry ,05 social sciences ,050301 education ,Data science ,Game analytics ,Artificial Intelligence (cs.AI) ,Control and Systems Engineering ,Analytics ,Data mining ,business ,0503 education ,computer ,Software - Abstract
Maximizing product use is a central goal of many businesses, which makes retention and monetization two central analytics metrics in games. Player retention may refer to various duration variables quantifying product use: total playtime or session playtime are popular research targets, and active playtime is well-suited for subscription games. Such research often has the goal of increasing player retention or conversely decreasing player churn. Survival analysis is a framework of powerful tools well suited for retention type data. This paper contributes new methods to game analytics on how playtime can be analyzed using survival analysis without covariates. Survival and hazard estimates provide both a visual and an analytic interpretation of the playtime phenomena as a funnel type nonparametric estimate. Metrics based on the survival curve can be used to aggregate this playtime information into a single statistic. Comparison of survival curves between cohorts provides a scientific AB-test. All these methods work on censored data and enable computation of confidence intervals. This is especially important in time and sample limited data which occurs during game development. Throughout this paper, we illustrate the application of these methods to real world game development problems on the Hipster Sheep mobile game.
- Published
- 2018
41. Clarifying Data Analytics Concepts for Industrial Engineering
- Author
-
Marco Macchi, Laura Cattaneo, Luca Fumagalli, and Elisa Negri
- Subjects
Big Data ,0209 industrial biotechnology ,Material requirements planning ,Industry 4.0 ,business.industry ,Computer science ,Big data ,Information technology ,Process design ,data mining ,02 engineering and technology ,artificial intelligence ,Data science ,020901 industrial engineering & automation ,Product lifecycle ,Control and Systems Engineering ,Analytics ,0202 electrical engineering, electronic engineering, information engineering ,Data analysis ,Information system ,020201 artificial intelligence & image processing ,data analytics ,business - Abstract
In the last decade manufacturing experienced a shift towards digitalization. Cost decrease of sensors, wireless connectivity, and the opportunity to store big amounts of data pushed a process towards a next generation of IT industry. Manufacturing now has the opportunity to gather large quantities of data, coming from different areas, such as product and process design, assembly, material planning, quality control, scheduling, maintenance, fault detection and cover all the product life cycle phases. The extraction of value from data is a new challenge that companies are now experiencing. Therefore, the need for analytical information system is growing, in order to explore datasets and discover useful and often hidden information. Data analytics became a keyword in this context, but sometimes it is not clear how different methods or tools are defined and could be effectively used to analyze data in manufacturing. The paper aims to present and clarify the meaning of terms that are currently and frequently used in the context of analytics. The paper also provides an overview of the data analysis techniques that could be used to extract knowledge from data along the manufacturing process.
- Published
- 2018
42. A Review on Business Intelligence and Big Data
- Author
-
Hacer Karacan and Erkan Sirin
- Subjects
Big Data ,Decision support system ,business.industry ,Computer science ,Big data ,Context (language use) ,02 engineering and technology ,USable ,Computer Graphics and Computer-Aided Design ,Data science ,Variety (cybernetics) ,Machine Learning ,Business Intelligence ,Artificial Intelligence ,Control and Systems Engineering ,Analytics ,020204 information systems ,Business intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Data Mining ,020201 artificial intelligence & image processing ,business ,Raw data ,Information Systems - Abstract
Improvement of data generating, processing, storing and networking technologies has made storing, capturing and sharing of data easier and cheaper than before and has enabled organizations to handle huge volume of data at high velocity and variety, named as big data. Big data offers many opportunities when the associated difficulties are addressed properly. Business Intelligence (BI) basically focuses on transforming raw data into usable, valuable and actionable information for decision-making. It can be classified as a kind of data-driven decision support system. Although big data related papers have increased for last fifteen years, there are not sufficient papers that directly overviews big data impact on BI. As data is growing exponentially, storage, process and analytics tools and technologies become more important for BI solutions. With the advent of big data, BI’s concept, architecture and capabilities are meant to be changed. Unlike a decades before, BI now is to be extract value from huge data ocean by using big data tools as well as classical ones. So, an interclusion has emerged between big data and BI. This paper overviews the current state of the art of BI and big data, and discuss how big data era affects BI solutions in general context.
- Published
- 2017
43. Cardiovascular informatics: building a bridge to data harmony.
- Author
-
Caufield, John Harry, Sigdel, Dibakar, Fu, John, Choi, Howard, Guevara-Gonzalez, Vladimir, Wang, Ding, and Ping, Peipei
- Subjects
DATA mining ,ARTIFICIAL intelligence ,MACHINE learning ,ELECTRONIC data processing ,MEDICAL research - Abstract
The search for new strategies for better understanding cardiovascular (CV) disease is a constant one, spanning multitudinous types of observations and studies. A comprehensive characterization of each disease state and its biomolecular underpinnings relies upon insights gleaned from extensive information collection of various types of data. Researchers and clinicians in CV biomedicine repeatedly face questions regarding which types of data may best answer their questions, how to integrate information from multiple datasets of various types, and how to adapt emerging advances in machine learning and/or artificial intelligence to their needs in data processing. Frequently lauded as a field with great practical and translational potential, the interface between biomedical informatics and CV medicine is challenged with staggeringly massive datasets. Successful application of computational approaches to decode these complex and gigantic amounts of information becomes an essential step toward realizing the desired benefits. In this review, we examine recent efforts to adapt informatics strategies to CV biomedical research: automated information extraction and unification of multifaceted -omics data. We discuss how and why this interdisciplinary space of CV Informatics is particularly relevant to and supportive of current experimental and clinical research. We describe in detail how open data sources and methods can drive discovery while demanding few initial resources, an advantage afforded by widespread availability of cloud computing-driven platforms. Subsequently, we provide examples of how interoperable computational systems facilitate exploration of data from multiple sources, including both consistently formatted structured data and unstructured data. Taken together, these approaches for achieving data harmony enable molecular phenotyping of CV diseases and unification of CV knowledge. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
44. Top 10 data mining techniques in business applications: a brief survey
- Author
-
Chih-Fong Tsai, Wei Chao Lin, and Shih Wen Ke
- Subjects
Association rule learning ,Computer science ,Decision tree ,02 engineering and technology ,Customer relationship management ,Recommender system ,Machine learning ,computer.software_genre ,Theoretical Computer Science ,Business process discovery ,Naive Bayes classifier ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Engineering (miscellaneous) ,business.industry ,Supervised learning ,Data science ,Control and Systems Engineering ,Unsupervised learning ,020201 artificial intelligence & image processing ,Data mining ,Artificial intelligence ,business ,computer ,Social Sciences (miscellaneous) - Abstract
Purpose Data mining is widely considered necessary in many business applications for effective decision-making. The importance of business data mining is reflected by the existence of numerous surveys in the literature focusing on the investigation of related works using data mining techniques for solving specific business problems. The purpose of this paper is to answer the following question: What are the widely used data mining techniques in business applications? Design/methodology/approach The aim of this paper is to examine related surveys in the literature and thus to identify the frequently applied data mining techniques. To ensure the recent relevance and quality of the conclusions, the criterion for selecting related studies are that the works be published in reputed journals within the past 10 years. Findings There are 33 different data mining techniques employed in eight different application areas. Most of them are supervised learning techniques and the application area where such techniques are most often seen is bankruptcy prediction, followed by the areas of customer relationship management, fraud detection, intrusion detection and recommender systems. Furthermore, the widely used ten data mining techniques for business applications are the decision tree (including C4.5 decision tree and classification and regression tree), genetic algorithm, k-nearest neighbor, multilayer perceptron neural network, naïve Bayes and support vector machine as the supervised learning techniques and association rule, expectation maximization and k-means as the unsupervised learning techniques. Originality/value The originality of this paper is to survey the recent 10 years of related survey and review articles about data mining in business applications to identify the most popular techniques.
- Published
- 2017
45. An overview and comparison of free Python libraries for data mining and big data analysis
- Author
-
Alan Jovic, Igor Stancin, and Skala, Karolj
- Subjects
010302 applied physics ,business.industry ,Computer science ,Deep learning ,Big data ,data science ,python ,data mining ,machine learning library ,big data analysis ,framework ,02 engineering and technology ,Python (programming language) ,computer.software_genre ,01 natural sciences ,Popularity ,Data preparation ,Data visualization ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,Artificial intelligence ,business ,computer ,computer.programming_language - Abstract
The popularity of Python is growing, especially in the field of data science. Consequently, there is an increasing number of free libraries available for usage. The aim of this review paper is to describe and compare the characteristics of different data mining and big data analysis libraries in Python. There is currently no paper dealing with the subject and describing pros and cons of all these libraries. Here we consider more than 20 libraries and separate them into six groups: core libraries, data preparation, data visualization, machine learning, deep learning and big data. Beside functionalities of a certain library, important factors for comparison are the number of contributors developing and maintaining the library and the size of the community. Bigger communities mean larger chances for easily finding solution to a certain problem. We currently recommend: pandas for data preparation ; Matplotlib, seaborn or Plotly for data visualization ; scikit-learn for machine learning ; TensorFlow, Keras and PyTorch for deep learning ; and Hadoop Streaming and PySpark for big data.
- Published
- 2019
46. Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest
- Author
-
Pierre Zweigenbaum and Aurélie Névéol
- Subjects
020205 medical informatics ,Computer science ,Section (typography) ,02 engineering and technology ,computer.software_genre ,Semantics ,Field (computer science) ,03 medical and health sciences ,0302 clinical medicine ,Health care ,0202 electrical engineering, electronic engineering, information engineering ,Selection (linguistics) ,Data Mining ,Electronic Health Records ,Humans ,Leverage (statistics) ,030212 general & internal medicine ,Natural Language Processing ,business.industry ,Patient Selection ,General Medicine ,Unified Medical Language System ,Data science ,Variety (cybernetics) ,Synopsis ,Yearbook ,Artificial intelligence ,business ,computer ,Algorithms ,Natural language processing - Abstract
Summary Objective: To summarize recent research and present a selection of the best papers published in 2015 in the field of clinical Natural Language Processing (NLP). Method: A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Section editors first selected a shortlist of candidate best papers that were then peer-reviewed by independent external reviewers. Results: The clinical NLP best paper selection shows that clinical NLP is making use of a variety of texts of clinical interest to contribute to the analysis of clinical information and the building of a body of clinical knowledge. The full review process highlighted five papers analyzing patient-authored texts or seeking to connect and aggregate multiple sources of information. They provide a contribution to the development of methods, resources, applications, and sometimes a combination of these aspects. Conclusions: The field of clinical NLP continues to thrive through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques to impact clinical practice. Foundational progress in the field makes it possible to leverage a larger variety of texts of clinical interest for healthcare purposes.
- Published
- 2016
47. Combined data mining techniques based patient data outlier detection for healthcare safety
- Author
-
Chai Yi, Gebeyehu Belay Gebremeskel, Zhongshi He, and Dawit Haile
- Subjects
Biological data ,General Computer Science ,Computer science ,business.industry ,02 engineering and technology ,Machine learning ,computer.software_genre ,Data science ,Field (computer science) ,Patient safety ,020204 information systems ,Outlier ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Anomaly detection ,Data mining ,Artificial intelligence ,Haystack ,business ,Cluster analysis ,computer ,Situation analysis - Abstract
Purpose– Among the growing number of data mining (DM) techniques, outlier detection has gained importance in many applications and also attracted much attention in recent times. In the past, outlier detection researched papers appeared in a safety care that can view as searching for the needles in the haystack. However, outliers are not always erroneous. Therefore, the purpose of this paper is to investigate the role of outliers in healthcare services in general and patient safety care, in particular.Design/methodology/approach– It is a combined DM (clustering and the nearest neighbor) technique for outliers’ detection, which provides a clear understanding and meaningful insights to visualize the data behaviors for healthcare safety. The outcomes or the knowledge implicit is vitally essential to a proper clinical decision-making process. The method is important to the semantic, and the novel tactic of patients’ events and situations prove that play a significant role in the process of patient care safety and medications.Findings– The outcomes of the paper is discussing a novel and integrated methodology, which can be inferring for different biological data analysis. It is discussed as integrated DM techniques to optimize its performance in the field of health and medical science. It is an integrated method of outliers detection that can be extending for searching valuable information and knowledge implicit based on selected patient factors. Based on these facts, outliers are detected as clusters and point events, and novel ideas proposed to empower clinical services in consideration of customers’ satisfactions. It is also essential to be a baseline for further healthcare strategic development and research works.Research limitations/implications– This paper mainly focussed on outliers detections. Outlier isolation that are essential to investigate the reason how it happened and communications how to mitigate it did not touch. Therefore, the research can be extended more about the hierarchy of patient problems.Originality/value– DM is a dynamic and successful gateway for discovering useful knowledge for enhancing healthcare performances and patient safety. Clinical data based outlier detection is a basic task to achieve healthcare strategy. Therefore, in this paper, the authors focussed on combined DM techniques for a deep analysis of clinical data, which provide an optimal level of clinical decision-making processes. Proper clinical decisions can obtain in terms of attributes selections that important to know the influential factors or parameters of healthcare services. Therefore, using integrated clustering and nearest neighbors techniques give more acceptable searched such complex data outliers, which could be fundamental to further analysis of healthcare and patient safety situational analysis.
- Published
- 2016
48. Multi-dimension reviewer credibility quantification across diverse travel communities
- Author
-
Norman Au, Hong Va Leong, Grace Ngai, Stephen Chi Chan, and Yuanyuan Wang
- Subjects
Index (economics) ,Multi dimension ,Computer science ,02 engineering and technology ,Destinations ,computer.software_genre ,Data science ,Human-Computer Interaction ,Work (electrical) ,Artificial Intelligence ,Hardware and Architecture ,020204 information systems ,Credibility ,0202 electrical engineering, electronic engineering, information engineering ,Information source ,020201 artificial intelligence & image processing ,Social media ,Data mining ,computer ,Software ,Tourism ,Information Systems - Abstract
The rapid development of social media technologies enables travellers to share travel experiences and opinions online by posting reviews, which then serve as information source for other travellers. However, the explosive growth of reviews and the proliferation of uninformative, biased or even false information make it very challenging for travellers to find credible information. To help travellers seek credible information, most current work apply mainly qualitative approaches to investigate the credibility of reviews or reviewers. This paper adopts an Impact Index to quantify the credibility of reviewers by simultaneously evaluating the expertise and trustworthiness of reviewers based on the number of reviews posted by them and the number of helpful votes received by those reviews. Furthermore, the Impact Index is enhanced into the Exposure-Impact Index by considering in addition reviewers' breadth of expertise in the form of the number of destinations on which reviewers posted reviews. To examine the effectiveness and applicability of Impact Index and Exposure-Impact Index, this paper evaluates them on several data sets collected from two rather different online travel communities: TripAdvisor, the world's largest travel community, and Qunar, one of the most popular travel communities in China. Experimental results show that both Impact Index and Exposure-Impact Index lead to more consistent results with human judgments than the state-of-the-art method in measuring the credibility of reviewers from diverse communities, manifesting their effectiveness and applicability.
- Published
- 2016
49. Digitalisation and Big Data Mining in Banking
- Author
-
Hossein Hassani, Xu Huang, and Emmanuel Sirimal Silva
- Subjects
lcsh:T ,business.industry ,Big data ,banking ,big data analytics ,02 engineering and technology ,data mining ,lcsh:Technology ,Data science ,Data resources ,Computer Science Applications ,Management Information Systems ,Artificial Intelligence ,Order (exchange) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,020201 artificial intelligence & image processing ,Strategic management ,Customer satisfaction ,survey ,Business ,Implementation ,Big data mining ,Information Systems - Abstract
open access article Banking as a data-intensive subject has been progressing continuously under the promoting influences of the era of big data. Exploring the advanced big data analytic tools like Data Mining (DM) techniques is key for the banking sector, which aims to reveal valuable information from the overwhelming volume of data and achieve better strategic management and customer satisfaction. In order to provide sound direction for the future research and development, a comprehensive and most up to date review of the current research status of DM in banking will be extremely beneficial. Since existing reviews only cover the applications until 2013, this paper aims to fill this research gap and presents the significant progressions and most recent DM implementations in banking post 2013. By collecting and analyzing the trends of research focus, data resources, technological aids, and data analytical tools, this paper contributes to bringing valuable insights with regard to the future developments of both DM and the banking sector along with a comprehensive one stop reference table. Moreover, we identify the key obstacles and present a summary for all interested parties that are facing the challenges of big data.
- Published
- 2018
50. Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text
- Author
-
Guergana Savova, Graciela Gonzalez-Hernandez, Abeed Sarker, and Karen O'Connor
- Subjects
020205 medical informatics ,Computer science ,Information Storage and Retrieval ,02 engineering and technology ,Noisy text ,computer.software_genre ,Field (computer science) ,03 medical and health sciences ,0302 clinical medicine ,Resource (project management) ,Social media mining ,0202 electrical engineering, electronic engineering, information engineering ,Data Mining ,Humans ,Social media ,030212 general & internal medicine ,Natural Language Processing ,business.industry ,General Medicine ,Digital library ,Data science ,Consumer Health Informatics ,Systematic review ,Artificial intelligence ,business ,computer ,Consumer health informatics ,Social Media ,Natural language processing - Abstract
Summary Background: Natural Language Processing (NLP) methods are increasingly being utilized to mine knowledge from unstructured health-related texts. Recent advances in noisy text processing techniques are enabling researchers and medical domain experts to go beyond the information encapsulated in published texts (e.g., clinical trials and systematic reviews) and structured questionnaires, and obtain perspectives from other unstructured sources such as Electronic Health Records (EHRs) and social media posts. Objectives: To review the recently published literature discussing the application of NLP techniques for mining health-related information from EHRs and social media posts. Methods: Literature review included the research published over the last five years based on searches of PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers. We particularly focused on the techniques employed on EHRs and social media data. Results: A set of 62 studies involving EHRs and 87 studies involving social media matched our criteria and were included in this paper. We present the purposes of these studies, outline the key NLP contributions, and discuss the general trends observed in the field, the current state of research, and important outstanding problems. Conclusions: Over the recent years, there has been a continuing transition from lexical and rule-based systems to learning-based approaches, because of the growth of annotated data sets and advances in data science. For EHRs, publicly available annotated data is still scarce and this acts as an obstacle to research progress. On the contrary, research on social media mining has seen a rapid growth, particularly because the large amount of unlabeled data available via this resource compensates for the uncertainty inherent to the data. Effective mechanisms to filter out noise and for mapping social media expressions to standard medical concepts are crucial and latent research problems. Shared tasks and other competitive challenges have been driving factors behind the implementation of open systems, and they are likely to play an imperative role in the development of future systems.
- Published
- 2017
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.