3,805 results
Search Results
2. Data science as a language: challenges for computer science—a position paper
- Author
-
Siebes, Arno
- Published
- 2018
- Full Text
- View/download PDF
3. Financial Data Mining: Appropriate Selection of Tools, Techniques and Algorithms
- Author
-
Saxena, Akash, Sharma, Navneet, Saxena, Khushoo, Parikh, Satyen M., Barbosa, Simone Diniz Junqueira, Series Editor, Filipe, Joaquim, Series Editor, Kotenko, Igor, Series Editor, Sivalingam, Krishna M., Series Editor, Washio, Takashi, Series Editor, Yuan, Junsong, Series Editor, Zhou, Lizhu, Series Editor, Deshpande, A.V., editor, Unal, Aynur, editor, Passi, Kalpdrum, editor, Singh, Dharm, editor, Nayak, Malaya, editor, Patel, Bharat, editor, and Pathan, Shafi, editor
- Published
- 2018
- Full Text
- View/download PDF
4. Call for papers: Semantics-enabled biomedical literature analytics
- Author
-
Halil Kilicoglu, Faezeh Ensan, Bridget McInnes, and Lucy Lu Wang
- Subjects
Data Science ,Publications ,Data Mining ,Health Informatics ,Semantics ,Computer Science Applications - Published
- 2022
5. Data science as a language: challenges for computer science-a position paper
- Author
-
Arno Siebes
- Subjects
Computer science ,media_common.quotation_subject ,Big data ,02 engineering and technology ,Field (computer science) ,Data science ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Regular Paper ,Pseudocode ,Function (engineering) ,Data mining ,media_common ,Kolmogorov complexity ,business.industry ,Applied Mathematics ,Inductive inference ,Inductive reasoning ,Computer Science Applications ,Management information systems ,Computational Theory and Mathematics ,Modeling and Simulation ,Position paper ,020201 artificial intelligence & image processing ,business ,Information Systems - Abstract
In this paper, I posit that from a research point of view, Data Science is a language. More precisely Data Science is doing Science using computer science as a language for datafied sciences; much as mathematics is the language of, e.g., physics. From this viewpoint, three (classes) of challenges for computer science are identified; complementing the challenges the closely related Big Data problem already poses to computer science. I discuss the challenges with references to, in my opinion, related, interesting directions in computer science research; note, I claim neither that these directions are the most appropriate to solve the challenges nor that the cited references represent the best work in their field, they are inspirational to me. So, what are these challenges? Firstly, if computer science is to be a language, what should that language look like? While our traditional specifications such as pseudocode are an excellent way to convey what has been done, they fail for more mathematics like reasoning about computations. Secondly, if computer science is to function as a foundation of other, datafied, sciences, its own foundations should be in order. While we have excellent foundations for supervised learning-e.g., by having loss functions to optimize and, more general, by PAC learning (Valiant in Commun ACM 27(11):1134-1142, 1984)-this is far less true for unsupervised learning. Kolmogorov complexity-or, more general, Algorithmic Information Theory-provides a solid base (Li and Vitanyi in An introduction to Kolmogorov complexity and its applications, Springer, Berlin, 1993). It provides an objective criterion to choose between competing hypotheses, but it lacks, e.g., an objective measure of the uncertainty of a discovery that datafied sciences need. Thirdly, datafied sciences come with new conceptual challenges. Data-driven scientists come up with data analysis questions that sometimes do and sometimes don't, fit our conceptual toolkit. Clearly, computer science does not suffer from a lack of interesting, deep, research problems. However, the challenges posed by data science point to a large reservoir of untapped problems. Interesting, stimulating problems, not in the least because they are posed by our colleagues in datafied sciences. It is an exciting time to be a computer scientist.
- Published
- 2017
6. Data Mining-based Coefficient of Influence Factors Optimization of Test Paper Reliability.
- Author
-
Peiyao Xu, Huiping Jiang, and Jieyao wei
- Subjects
DATA mining ,TEACHING ,DATA science ,GRADING of students ,SUPPORT vector machines ,CLASSIFICATION algorithms - Abstract
Test is a significant part of the teaching process. It demonstrates the final outcome of school teaching through teachers' teaching level and students' scores. The analysis of test paper is a complex operation that has the characteristics of non-linear relation in the length of the paper, time duration and the degree of difficulty. It is therefore difficult to optimize the coefficient of influence factors under different conditions in order to get text papers with clearly higher reliability with general methods [1]. With data mining techniques like Support Vector Regression (SVR) and Genetic Algorithm (GA), we can model the test paper analysis and optimize the coefficient of impact factors for higher reliability. It's easy to find that the combination of SVR and GA can get an effective advance in reliability from the test results. The optimal coefficient of influence factors optimization has a practicability in actual application, and the whole optimizing operation can offer model basis for test paper analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. Paper review on data mining, components, and big data
- Author
-
Ali Muwafaq Shaban, Muhmmad Shihab Muayad Shihab, Hussein Ali Alhalboosi, Ahmed Shamil Mustafa, Ahmed Muhi Shantaf, Sufyan Sabah Al-Rifai, Shantaf, Ahmed Muhi, Al-Rifai, Sufyan Sabah, Shaban, Ali Muwafaq, Shihab, Muhmmad Shihab Muayad, and Alhalboosi, Hussein Ali
- Subjects
Data stream ,Big Data ,Computer science ,Data stream mining ,business.industry ,Big data ,Data science ,Field (computer science) ,Variety (cybernetics) ,Knowledge-based systems ,Software ,Scalability ,Data Mining ,business ,Components - Abstract
2nd International Congress on Human-Computer Interaction, Optimization and Robotic Applications, HORA 2020 -- 26 June 2020 through 27 June 2020 -- -- 162106 2-s2.0-85089705154 Recent progress in software and hardware has allowed different data measurements in a variety of fields to be captured. These measures are produced continuously at very fluctuating data rates. For example, network sensors, web logs and computer network traffic. Computer-intensive activities are the collection, query and removal of these data sets. Mining data sources include the recovery of knowledge systems incorporated into templates and trends on streams of non-stop information. Because of the importance of its applications and the increasing generation of data stream research, information. Information. Information. Analysis applications for data streams can range from critical scientific and astronomical applications to major business and financial applications. Algorithms, processes and structures Streaming challenges have been developed over the last three years. We present the newest in this increasingly important field in this review paper. © 2020 IEEE.
- Published
- 2020
8. How to cheat the page limit.
- Author
-
Duivesteijn, Wouter, Hess, Sibylle, and Du, Xin
- Subjects
DATA mining ,DATA science ,MACHINE learning ,SWINDLERS & swindling - Abstract
Every conference imposing a limit on the length of submissions must deal with the problem of page limit cheating: authors tweaking the parameters of the game such that they can squeeze more content into their paper. We claim that this problem is endemic, although we lack the data to formally prove this. Instead, this paper provides a far from exhaustive summary of ways to cheat the page limit, a case study involving the papers accepted for the Research and Applied Data Science tracks at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD) 2019, and a discussion of ways for program chairs to tackle this problem. Of the 130 accepted papers in these two ECMLPKDD 2019 tracks, 68 satisfied the page limit; 62 (47.7%) turned out to spill over the page limit, by up to as much as 50%. To misappropriate a phrase from Darrell Huff's "How to Lie with Statistics," we intend for this paper not to be a manual for swindlers; instead, nefarious paper authors already know these tricks, and honest program chairs must learn them in self‐defense. This article is categorized under:Commercial, Legal, and Ethical Issues > Fairness in Data Mining [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
9. Applications of educational data mining and learning analytics on data from cybersecurity training.
- Author
-
Švábenský, Valdemar, Vykopal, Jan, Čeleda, Pavel, and Kraus, Lydia
- Subjects
DATA mining ,INTERNET security ,DATA science ,LITERATURE reviews ,COMPUTER security - Abstract
Cybersecurity professionals need hands-on training to prepare for managing the current advanced cyber threats. To practice cybersecurity skills, training participants use numerous software tools in computer-supported interactive learning environments to perform offensive or defensive actions. The interaction involves typing commands, communicating over the network, and engaging with the training environment. The training artifacts (data resulting from this interaction) can be highly beneficial in educational research. For example, in cybersecurity education, they provide insights into the trainees' learning processes and support effective learning interventions. However, this research area is not yet well-understood. Therefore, this paper surveys publications that enhance cybersecurity education by leveraging trainee-generated data from interactive learning environments. We identified and examined 3021 papers, ultimately selecting 35 articles for a detailed review. First, we investigated which data are employed in which areas of cybersecurity training, how, and why. Second, we examined the applications and impact of research in this area, and third, we explored the community of researchers. Our contribution is a systematic literature review of relevant papers and their categorization according to the collected data, analysis methods, and application contexts. These results provide researchers, developers, and educators with an original perspective on this emerging topic. To motivate further research, we identify trends and gaps, propose ideas for future work, and present practical recommendations. Overall, this paper provides in-depth insight into the recently growing research on collecting and analyzing data from hands-on training in security contexts. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
10. Quasi-experimental study designs series—paper 8: identifying quasi-experimental studies to inform systematic reviews
- Author
-
Grace Wang, Marit Johansen, Andrew M. Jones, Hannah R. Rothstein, Ian Shemilt, Julie Glanville, Michelle Fiander, and John Eyers
- Subjects
Non-Randomized Controlled Trials as Topic ,Epidemiology ,Computer science ,business.industry ,Search engine indexing ,Guidelines as Topic ,Social Welfare ,computer.software_genre ,Data science ,Review Literature as Topic ,03 medical and health sciences ,0302 clinical medicine ,Systematic review ,Research Design ,Quasi experimental study ,Health care ,Humans ,030212 general & internal medicine ,Justice (ethics) ,Data mining ,International development ,business ,computer ,030217 neurology & neurosurgery ,Research evidence - Abstract
Objective This article reviews the available evidence and guidance on methods to identify reports of quasi-experimental (QE) studies to inform systematic reviews of health care, public health, international development, education, crime and justice, and social welfare. Study Design and Setting Research, guidance, and examples of search strategies were identified by searching a range of databases, key guidance documents, selected reviews, conference proceedings, and personal communication. Current practice and research evidence were summarized. Results Four thousand nine hundred twenty-four records were retrieved by database searches, and additional documents were obtained by other searches. QE studies are challenging to identify efficiently because they have no standardized nomenclature and may be indexed in various ways. Reliable search filters are not available. There is a lack of specific resources devoted to collecting QE studies and little evidence on where best to search. Conclusion Searches to identify QE studies should search a range of resources and, until indexing improves, use strategies that focus on the topic rather than the study design. Better definitions, better indexing in databases, prospective registers, and reporting guidance are required to improve the retrieval of QE studies and promote systematic reviews of what works based on the evidence from such studies.
- Published
- 2017
11. Challenges and Issues in Multisensor Fusion Approach for Fall Detection: Review Paper
- Author
-
Amy Loutfi, Maria Lindén, and Gregory Koshmak
- Subjects
Engineering ,business.industry ,010401 analytical chemistry ,Body movement ,02 engineering and technology ,Aging society ,computer.software_genre ,01 natural sciences ,Emergency situations ,Data science ,0104 chemical sciences ,Control and Systems Engineering ,Information and Communications Technology ,lcsh:Technology (General) ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,lcsh:T1-995 ,020201 artificial intelligence & image processing ,Fall detection ,Data mining ,Electrical and Electronic Engineering ,business ,Instrumentation ,computer - Abstract
Emergency situations associated with falls are a serious concern for an aging society. Yet following the recent development within ICT, a significant number of solutions have been proposed to track body movement and detect falls using various sensor technologies, thereby facilitating fall detection and in some cases prevention. A number of recent reviews on fall detection methods using ICT technologies have emerged in the literature and an increasingly popular approach considers combining information from several sensor sources to assess falls. The aim of this paper is to review in detail the subfield of fall detection techniques that explicitly considers the use of multisensor fusion based methods to assess and determine falls. The paper highlights key differences between the single sensor-based approach and a multifusion one. The paper also describes and categorizes the various systems used, provides information on the challenges of a multifusion approach, and finally discusses trends for future work.
- Published
- 2016
12. Research Paper On Big data and Hadoop.
- Author
-
Srivastava, Abhinav Krishna, Sharma, Rakshit, and Prince
- Subjects
BIG data ,DATA science ,ENGINEERS ,DATA analysis ,DATA mining - Abstract
In Today's era the word Big Data has a very useful meaning in every aspects. The meaning of Big data is to use and analysis of the data that is big in size. Now-a-days Big Data[1] is becoming very popular and there is so many research and analysis going on in this field. Researcher uses the big data to analysis the practical data and to find out its use and application based on the analysis. Data analysis of the given data is done through various modes that is known as an unstructured data as we can easily find out now a days on the Google in the form of posts that has to be taken from social media sites[2] and images. Our research paper will give a an idea about Big Data and its application in daily life. We have also includes the challenges that has been faced by every data science engineer while analysis of the data. Our paper also talks about the Hadoop and gives a basic understanding of Hadoop. [ABSTRACT FROM AUTHOR]
- Published
- 2021
13. Detection method of emerging leading papers using time transition
- Author
-
Shino Iwami, Ichiro Sakata, Junichiro Mori, and Yuya Kajikawa
- Subjects
Citation network ,Emerging technologies ,Computer science ,Transition (fiction) ,General Social Sciences ,Library and Information Sciences ,computer.software_genre ,Data science ,Computer Science Applications ,Data mining ,Time series ,Citation ,Centrality ,computer - Abstract
To survive worldwide competitions of research and development in the current rapid increase of information, decision-makers and researchers need to be supported to find promising research fields and papers. But finding those fields from an available data in too much heavy flood of information becomes difficult. We aim to develop a methodology supporting to find emerging leading papers with a bibliometric approach. The analyses in this work are about four academic domains using our time transition analysis. In the time transition analysis, after citation networks are constructed, centralities of each paper are calculated and their changes are tracked. Then, the centralities are plotted, and the features of the leading papers are extracted. Based on the features, we proposed ways to detect the leading papers by focusing on in-degree centrality and its transition. This work will contribute to finding the leading paper, and it is useful for decision-makers and researchers to decide the worthy research topic to invest their resources.
- Published
- 2014
14. Predicting highly cited papers: A Method for Early Detection of Candidate Breakthroughs
- Author
-
Laurel L. Haak, Charles J. Hackett, Ilya Ponomarev, Duane E. Williams, and Joshua D. Schnell
- Subjects
Computer science ,Early detection ,Scientometrics ,Bibliometrics ,computer.software_genre ,Data science ,Management of Technology and Innovation ,Rare events ,Science policy ,Data mining ,Business and International Management ,Project portfolio management ,Citation ,computer ,Applied Psychology ,Technology forecasting - Abstract
Scientific breakthroughs are rare events, and usually recognized retrospectively. We developed methods for early detection of candidate breakthroughs, based on dynamics of publication citations and used a quantitative approach to identify typical citation patterns of known breakthrough papers and a larger group of highly cited papers. Based on these analyses, we proposed two forecasting models that were validated using statistical methods to derive confidence levels. These findings can be used to inform research portfolio management practices.
- Published
- 2014
15. Layered evaluation of multi-criteria collaborative filtering for scientific paper recommendation
- Author
-
Manouselis, N., Verbert, K., Alexandrov, V., Lees, M., Krzhizhanovskaya, V., Dongarra, J., Sloot, P.M.A., Process Science, Alexandrov, V, Lees, M, Krzhizhanovskaya, V, Dongarra, J, and Sloot, PMA
- Subjects
Computer science ,multi-criteria decision making ,Scale (chemistry) ,Intelligent decision support system ,Context (language use) ,02 engineering and technology ,Plan (drawing) ,Recommender system ,computer.software_genre ,Data science ,Multi criteria ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Collaborative filtering ,Recommender systems ,General Earth and Planetary Sciences ,Multi-Criteria Decision Making (MCDM) ,020201 artificial intelligence & image processing ,Relevance (information retrieval) ,Data mining ,Evaluation ,computer ,General Environmental Science - Abstract
Recommendation algorithms have been researched extensively to help people deal with abundance of information. In recent years, the incorporation of multiple relevance criteria has attracted increased interest. Such multi-criteria recommendation approaches are researched as a paradigm for building intelligent systems that can be tailored to multiple interest indicators of end-users – such as combinations of implicit and explicit interest indicators in the form of ratings or ratings on multiple relevance dimensions. Nevertheless, evaluation of these recommendation techniques in the context of real-life applications still remains rather limited. Previous studies dealing with the evaluation of recommender systems have outlined that the performance of such algorithms is often dependent on the dataset – and indicate the importance of carrying out careful testing and parameterization. Especially when looking at large scale datasets, it becomes very difficult to deploy evaluation methods that may help in assessing the effect that different system components have to the overall design. In this paper, we study how layered evaluation can be applied for the case of a multi-criteria recommendation service that we plan to deploy for paper recommendation using the Mendeley dataset. The paper introduces layered evaluation and suggests two experiments that may help assess the components of the envisaged system separately. ispartof: pages:1189-1197 ispartof: Procedia Computer Science vol:18 pages:1189-1197 ispartof: 2013 International Conference on Computational Science location:SPAIN, Barcelona date:5 Jun - 7 Jun 2013 status: published
- Published
- 2013
16. Template for preparation of papers for IEEE sponsored conferencessymposia
- Author
-
Lucia Sacchi, Valentina Tibollo, P. De Cata, Riccardo Bellazzi, Paola Leporati, Carlo Cerra, Arianna Dagliati, and Luca Chiovato
- Subjects
Engineering drawing ,Engineering ,business.industry ,Process (engineering) ,Dashboard (business) ,MEDLINE ,Behavioral pattern ,Pharmacy ,Data science ,Open data ,Diabetes Mellitus, Type 2 ,Italy ,Informatics ,Agency (sociology) ,Health care ,Data Mining ,Electronic Health Records ,Humans ,business ,Delivery of Health Care - Abstract
To improve the access to medical information is necessary to design and implement integrated informatics techniques aimed to gather data from different and heterogeneous sources. This paper describes the technologies used to integrate data coming from the electronic medical record of the IRCCS Fondazione Maugeri (FSM) hospital of Pavia, Italy, and combines them with administrative, pharmacy drugs purchase coming from the local healthcare agency (ASL) of the Pavia area and environmental open data of the same region. The integration process is focused on data coming from a cohort of one thousand patients diagnosed with Type 2 Diabetes Mellitus (T2DM). Data analysis and temporal data mining techniques have been integrated to enhance the initial dataset allowing the possibility to stratify patients using further information coming from the mined data like behavioral patterns of prescription-related drug purchases and other frequent clinical temporal patterns, through the use of an intuitive dashboard controlled system.
- Published
- 2016
17. Data in Business Process Models, A Preliminary Empirical Study (Short Paper)
- Author
-
Alessandro Russo, Kevin Andrews, Manfred Reichert, Sebastian Steinau, Massimo Mecella, and Andrea Marrella
- Subjects
Business Process Model and Notation ,Process modeling ,Business process ,Modeling language ,Artifact-centric business process model ,Computer science ,Data mining ,Business process modeling ,computer.software_genre ,Data science ,computer ,Service-oriented modeling ,Data modeling - Abstract
Traditional activity-centric process modeling languages treat data as simple black boxes acting as input or output for activities. Many alternate and emerging process modeling paradigms, such as case handling and artifact-centric process modeling, give data a more central role. This is achieved by introducing lifecycles and states for data objects, which is beneficial when modeling data-or knowledge-intensive processes. We assume that traditional activity-centric process modeling languages lack the capabilities to adequately capture the complexity of such processes. To verify this assumption we conducted an online interview among BPM experts. The results not only allow us to identify various profiles of persons modeling business processes, but also the problems that exist in contemporary modeling languages w.r.t. The modeling of business data. Overall, this preliminary empirical study confirms the necessity of data-awareness in process modeling notations in general.
- Published
- 2015
18. Review Paper: Data Mining of Fungal Secondary Metabolites Using Genomics and Proteomics
- Author
-
Kaushal Kishore Sharma, Ruchi Sethi Gutch, and Aditi Tiwari
- Subjects
Exploit ,Genomics ,Data mining ,Biology ,computer.software_genre ,Proteomics ,Data science ,computer ,Positive direction - Abstract
Fungi are versatile organisms; they exist on earth in all extremes of conditions. Fungi are sources of important chemical entities which may be both beneficial and deleterious. Biotechnology has helped to harness this potential of Fungi in a positive direction. The advancements in Genomics and Proteomics have opened up new horizon in research. Improved advanced Molecular Biological Technologies have given a boost to our understanding of genes and helped us to exploit the full potential of Fungi. Bioinformatics and Statistical sciences are indispensable in this regard. Databases are available, providing fast, efficient, meaningful interpretation and analysis of vast amounts of data generated in scientific laboratories.
- Published
- 2015
19. Optimized summarization of research papers as an aid for research scholars using data mining techniques
- Author
-
Sunita R. Patil and Sunita M. Mahajan
- Subjects
Information retrieval ,Scope (project management) ,Computer science ,media_common.quotation_subject ,computer.software_genre ,Automatic summarization ,Data science ,Field (computer science) ,Domain (software engineering) ,Reading (process) ,Similarity (psychology) ,Relevance (information retrieval) ,Data mining ,Cluster analysis ,computer ,media_common - Abstract
Whenever a research scholar starts working on some innovative ideas he/she searches for the domain specific technical research articles published as research papers in most of the international journals, conferences or workshops. The problems associated with these papers are similarity in contents and repeated relevant information. Reading these all relatedpapers completely one by one to get the latest research developments in the interested domain is time-consuming, unnecessary, irrelevant, cumbersome and impossible. These problems are solved by developing a innovative solution, which optimizes and summarizes these research papers as an aid for research scholars using various Data Mining strategies. This helps research scholar for getting short, condensed, accurate and most relevant summarized information i.e. overview of domain-specific topic-based contents from research papers. Data mining strategies such as extraction and clustering are used for identifying ‘Research Relevant Novel’ (RRN) terms. Extracting relevant sentences from multiple papers uses ‘Maximal Marginal Relevance’ (MMR) criteria containing RRN terms. In addition to optimized and summarized contents, this paperalso informs earlier and latest research developments, progress, and challenges and future scope in particular field of study through various research categories identified such as research methods/techniques/approaches used, comparisons with existing research and providing starting material for further innovation.
- Published
- 2012
20. JRS’2012 Data Mining Competition: Topical Classification of Biomedical Research Papers
- Author
-
Hung Son Nguyen, Sebastian Stawicki, Adam Krasuski, Andrzej Janusz, and Dominik Ślęzak
- Subjects
Multi-label classification ,Information retrieval ,Scope (project management) ,Test data generation ,Computer science ,computer.software_genre ,CONTEST ,Data science ,Task (project management) ,Competition (economics) ,Explicit semantic analysis ,Scalability ,Data mining ,GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries) ,computer - Abstract
We summarize the JRS’2012 Data Mining Competition on “Topical Classification of Biomedical Research Papers”, held between January 2, 2012 and March 30, 2012 as an interactive on-line contest hosted on the TunedIT platform ( http://tunedit.org ). We present the scope and background of the challenge task, the evaluation procedure, the progress, and the results. We also present a scalable method for the contest data generation from biomedical research papers.
- Published
- 2012
21. Special Issue: Selection of Best Papers of the VLDB Data Management in Grids Workshop (VLDB DMG 2007)
- Author
-
Harald Kosch and Jean-Marc Pierson
- Subjects
Computer Networks and Communications ,business.industry ,Computer science ,Data management ,computer.software_genre ,Data science ,Computer Science Applications ,Theoretical Computer Science ,Very large database ,Computational Theory and Mathematics ,Data mining ,business ,computer ,Software ,Selection (genetic algorithm) - Published
- 2008
22. Data mining model for classifying and prediction Nigeria paper currency notes
- Author
-
OS Akinola and TO Adigun
- Subjects
Engineering ,Currency ,business.industry ,Data mining ,computer.software_genre ,business ,Data science ,computer - Abstract
No Abstract.
- Published
- 2013
23. Short Paper: Troubleshooting Distributed Systems via Data Mining
- Author
-
Nitesh V. Chawla, David A. Cieslak, and Douglas Thain
- Subjects
Computer science ,Scale (chemistry) ,Reliability (computer networking) ,Distributed computing ,media_common.quotation_subject ,Troubleshooting ,computer.software_genre ,Data science ,Electronic mail ,Statistical classification ,Debugging ,Data mining ,Adaptation (computer science) ,Throughput (business) ,computer ,media_common - Abstract
Through massive parallelism, distributed systems enable the multiplication of productivity. Unfortunately, increas- ing the scale of available machines to users will also mul- tiply debugging when failure occurs. Data mining allows the extraction of patterns within large amounts of data and therefore forms the foundation for a useful method of de- bugging, particularly within such distributed systems. This paper outlines a successful application of data mining in troubleshooting distributed systems, proposes a framework for further study, and speculates on other future work. We propose that data mining techniques can be applied to the problem of large scale troubleshooting. If both jobs and the resources that they consume are annotated with structured information relevant to success or failure, then classification algorithms can be used to find properties of each that correlate with success or failure. In the one- million jobs example above, an ideal troubleshooter would report to the user something like: Your jobs always fail on Linux 2.8 machines, always fail on cluster X between mid- night and 6 A.M, and fail with 50% probability on machines owned by user Y. Further, these discoveries may be used to automatically avoid making bad placement decisions that waste time and resources. We hasten to note that this form of data mining is not a panacea. It does not explain why failures happen, or make any attempt to diagnose problems in fine detail. It only proposes to the user properties correlated with suc- cess or failure. Other tools and techniques may be applied to extract causes. Rather, data mining allows the user of a large system to rapidly make generalizations to improve the throughput and reliability of a system without engag- ing in low level debugging. These generalizations may be used later at leisure to locate and repair problems. In addi- tion, the problem of distributed debugging, with its unique idiosyncracies and dynamics, lends itself as a compelling application for data mining research. Standard off-the-shelf methodologies might not be directly applicable for large, dynamic, and evolving system. It is desired to implement techniques that are capable of incremental self-revision and adaptation. The goal of our paper is to serve as a proof-of- concept and identify venues for compelling future research.
- Published
- 2006
24. Invited Paper: Intelligent Data Mining Assistance via CBR and Ontologies
- Author
-
Sylvain Delisle, O. Cervantes, M. Charest, and Y. Shen
- Subjects
Computer science ,business.industry ,Process (engineering) ,Data stream mining ,Realization (linguistics) ,InformationSystems_DATABASEMANAGEMENT ,Concept mining ,Ontology (information science) ,computer.software_genre ,Data science ,Text mining ,Web mining ,Ontology ,Case-based reasoning ,Data mining ,business ,computer - Abstract
Most commercial data mining products provide a large number of models and tools for performing various data mining tasks, but few provide intelligent assistance for addressing many important decisions that must be considered during the mining process. In this paper, we propose the realization of a hybrid data mining assistant, based on the CBR paradigm and the use of an ontology, in order to empower the user during the various phases of the data mining process.
- Published
- 2006
25. Keynote Paper: Data Mining Researcher, Who is Your Customer? Some Issues Inspired by the Information Systems Field
- Author
-
S. Puuronen, Mykola Pechenizkiy, and Alexey Tsymbal
- Subjects
Work (electrical) ,Computer science ,Information system ,Identity (social science) ,Applied research ,Data mining ,computer.software_genre ,Data science ,computer ,Data warehouse ,Field (computer science) - Abstract
Data mining as an applied research field is still causing great expectations among organizations which want to raise the utility they are getting from their huge databases and data warehouses. There exist too few success stories about organizations having managed to satisfy even some of those expectations. This situation is very similar to the one inside the information systems (IS) field, especially earlier but even currently. The recent lively debate about the identity of the IS discipline included also the analysis concerning the customers of IS research. Inspired by IS researchers' insights related to the topic, we ask the question "who is our customer?" as data mining researchers. With this we want to raise to discussion the border that limits the topics 'acceptable' to work with as a data mining researcher. We suggest in this paper that the border should be transferred more clearly towards the direction so that beside the technical concerns also at least some user- and organization-related research questions are included
- Published
- 2006
26. Human behavior analysis in the production and consumption of scientific knowledge across regions : A case study on publications in Scopus
- Author
-
Qasim, Muhammad Awais, Ul Hassan, Saeed, Aljohani, Naif Radi, and Lytras, Miltiadis D.
- Published
- 2017
- Full Text
- View/download PDF
27. Towards a Unified Approach to Information Integration - A review paper on data/information fusion
- Author
-
Paul D. Whitney, Xingye C. Lei, and Christian Posse
- Subjects
business.industry ,Process (engineering) ,Computer science ,computer.software_genre ,Sensor fusion ,Data science ,Expert system ,Field (computer science) ,Software ,Software design ,Data mining ,business ,computer ,Data integration ,Information integration - Abstract
Information or data fusion of data from different sources are ubiquitous in many applications, from epidemiology, medical, biological, political, and intelligence to military applications. Data fusion involves integration of spectral, imaging, text, and many other sensor data. For example, in epidemiology, information is often obtained based on many studies conducted by different researchers at different regions with different protocols. In the medical field, the diagnosis of a disease is often based on imaging (MRI, X-Ray, CT), clinical examination, and lab results. In the biological field, information is obtained based on studies conducted on many different species. In military field, information is obtained based on data from radar sensors, text messages, chemical biological sensor, acoustic sensor, optical warning and many other sources. Many methodologies are used in the data integration process, from classical, Bayesian, to evidence based expert systems. The implementation of the data integration ranges from pure software design to a mixture of software and hardware. In this review we summarize the methodologies and implementations of data fusion process, and illustrate in more detail the methodologies involved in three examples. We propose a unified multi-stage and multi-path mapping approach to the data fusion process, and point out future prospects andmore » challenges.« less
- Published
- 2005
28. Best papers from the Fifth International Conference on Advanced Data Mining and Applications (ADMA 2009)
- Author
-
Xue Li, Qiang Yang, Jian Pei, João Gama, and Ronghuai Huang
- Subjects
Human-Computer Interaction ,Information retrieval ,Artificial Intelligence ,Hardware and Architecture ,Computer science ,Data mining ,computer.software_genre ,computer ,Data science ,Software ,Information Systems - Published
- 2011
29. The most influential papers from the ISSTA research community (panel)
- Author
-
Richard A. Kemmerer, Debra J. Richardson, Edward F. Miller, and Richard G. Hamlet
- Subjects
Computer science ,Research community ,Data mining ,computer.software_genre ,Data science ,computer - Published
- 1998
30. Data mining and data visualization: Position paper for the second IEEE workshop on database issues for data visualization
- Author
-
Bhavani Thuraisingham and Georges Grinstein
- Subjects
Information retrieval ,Database ,business.industry ,Computer science ,computer.software_genre ,Data science ,Data warehouse ,Metadata ,Identification (information) ,Information visualization ,Data visualization ,Data access ,Knowledge extraction ,Data mining ,business ,Cluster analysis ,computer - Abstract
The government, corporate, and industrial communities are faced with an ever increasing number of databases. These databases need not only to be managed, but also explored. The first requires secure access to distributed heterogeneous multimedia databases with rich metadata and having to meet timing constraints. The second requires exploratory tools supporting the identification of domain and mission critical elements such as patterns in data access (e.g., security breach determinations), patterns in data (e.g., marketing and clustering), or for patterns in transactions (e.g., data compression), to site a few. Knowledge Discovery in Databases is a relatively new research area that employs a variety of tools to explore and identify structure and patterns in these large databases. Often the data is preprocessed to facilitate such computations (data warehousing). The data is then mined for specific rules that are built incrementally and often steered by users with a specific set of goals in mind.
- Published
- 1996
31. Computational techniques to counter terrorism: a systematic survey.
- Author
-
Saini, Jaspal Kaur and Bansal, Divya
- Abstract
Terrorist Network Analysis (TNA) is the field of analyzing and defining the scope of terrorism and researching the countermeasures in order to handle exponentially increasing threats due to ever growing terrorist based activities. This field constitutes several sub-domains such as crawling the data about terrorist attacks/groups, classification, behavioral, and predictive analysis. In this paper we present a systematic review of TNA which includes study of different terrorist groups and attack characteristics, use of online social networks, machine learning techniques and data mining tools in order to counter terrorism. Our survey is divided into three sections of TNA: Data Collection, Analysis Approaches and Future Directions. Each section highlights the major research achievements in order to present active use of research methodology to counter terrorism. Furthermore, the metrics used for TNA analysis have been thoroughly studied and identified. The paper has been written with an intent of providing all the necessary background to the researchers who plan to carry out similar studies in this emerging field of TNA. Our contributions to TNA field are with respect to effective utilization of computational techniques of data mining, machine learning, online social networks, and highlighting the research gaps and challenges in various sub domains. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. What the papers say: Text mining for genomics and systems biology
- Author
-
Wendy Filsell, Nathan Harmston, and Michael P. H. Stumpf
- Subjects
lcsh:QH426-470 ,Systems biology ,lcsh:Medicine ,Review ,Biology ,Bioinformatics ,Reinventing the wheel ,Terminology as Topic ,Drug Discovery ,Genetics ,systems medicine ,Molecular Biology ,Disadvantage ,Information explosion ,Systems Biology ,lcsh:R ,hypothesis generation ,Genomics ,data mining ,literature processing ,Biomedical text mining ,Data science ,Systems medicine ,lcsh:Genetics ,If and only if ,Molecular Medicine ,Periodicals as Topic ,Publication Bias ,Know-how - Abstract
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
- Published
- 2010
33. An exploratory text analysis of the autophagy research field
- Author
-
Yoshitaka Kurikawa, Noboru Mizushima, and Willa Wen-You Yim
- Subjects
Topic model ,RNA, Untranslated ,Impact factor ,HUGO Gene Nomenclature Committee ,Subject (documents) ,Cell Biology ,Biology ,Data science ,Latent Dirichlet allocation ,Field (geography) ,Variety (cybernetics) ,Gene nomenclature ,symbols.namesake ,Autophagy ,symbols ,Data Mining ,Molecular Biology ,Research Paper - Abstract
After its discovery in the 1950 s, the autophagy research field has seen its annual number of publications climb from tens to thousands. The ever-growing number of autophagy publications is a wealth of information but presents a challenge to researchers, especially those new to the field, who are looking for a general overview of the field to, for example, determine current topics of the field or formulate new hypotheses. Here, we employed text mining tools to extract research trends in the autophagy field, including those of genes, terms, and topics. The publication trend of the field can be separated into three phases. The exponential rise in publication number began in the last phase and is most likely spurred by a series of highly cited research papers published in previous phases. The exponential increase in papers has resulted in a larger variety of research topics, with the majority involving those that are directly physiologically relevant, such as disease and modulating autophagy. Our findings provide researchers a summary of the history of the autophagy research field and perhaps hints of what is to come. Abbreviations: 5Y-IF: 5-year impact factor; AIS: article influence score; EM: electron microscopy; HGNC: HUGO gene nomenclature committee; LDA: latent Dirichlet allocation; MeSH: medical subject headings; ncRNA: non-coding RNA
- Published
- 2021
34. On Some Scientific Results of the IMTA-VIII-2022: 8th International Workshop "Image Mining: Theory and Applications".
- Author
-
Gurevich, Igor B., Moroni, Davide, Pascali, Maria Antonietta, and Yashina, Vera V.
- Abstract
The publication presents an introductory paper to the Special issue of the international journal Pattern Recognition and Image Analysis: Advances in Mathematical Theory and Applications of the Russian Academy of Sciences. The main scientific results of the 8th International Workshop "Image Mining: Theory and Applications," held on August 21, 2022, Montreal, Canada, are presented. Historical information is given on this series of international workshops, and their significant role in the development of the theory and practice of automation of image analysis, pattern recognition, and artificial intelligence is emphasized. The list of papers of the Special issue of PRIA, prepared based on the invited and regular papers selected and recommended for publication by the Program Committee of the IMTA-VIII-2022, is presented. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. Big Data: Opportunities and Challenges in Libraries, a Systematic Literature Review.
- Author
-
Garoufallou, Emmanouel and Gaitanou, Panorea
- Subjects
BIG data ,DATA mining ,DATABASES ,DATA science ,DATA modeling - Abstract
Currently "Big Data" is an emerging field that presents several Information Technology challenges regarding the capture, storage search, structure, and visualization of this data. The real challenge for organizations is to find ways to extract value from it and provide better services to their clients. The data generated in academic and other institutions is vast and complex. Libraries face new challenges as they seek to determine their role in the handling of Big Data within their organization and use it to develop services. Thus, in most organizations, libraries will not have the knowledge to build new services unaided. Furthermore, libraries have always been information handlers and technology adopters; therefore, Big Data technologies will certainly affect their context. The purpose of this paper is to explore all these issues through a systematic literature review, unveiling the theories that underpin the paper's argument. It attempts to answer several research questions, such as how librarians are involved in the Big Data era? And what are the future research developments of Big Data within the library context? The study considered only papers published between 2012 and 2018 in English and presents the collected literature by grouping them according to the type of library each paper refers to. Thus, it identifies new and evolving roles in the context of all types of libraries. In addition, the study presents several interesting tables, which aim to help librarians locate relevant articles that will inform their practice and guide service development for users of large and complex datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
36. Variable Selection for Meaningful Clustering of Multitopic Territorial Data.
- Author
-
Angerri, Xavier and Gibert, Karina
- Subjects
DATABASES ,FEATURE selection ,KNOWLEDGE acquisition (Expert systems) ,ELECTRONIC data processing ,DATA mining ,CLUSTER analysis (Statistics) - Abstract
This paper proposes a new methodology to improve territorial cohesion in clustering processes where many variables from different topics are considered. Clustering techniques provide added value to identify typologies, but there are still unsolved challenges when data contain an unbalanced number of variables from different topics. The territorial feature selection method (TFSM) is presented as a method to select the representative variable of each topic such that the interpretability of resulting clusters is preserved and the geographical cohesion is improved with respect to classical approaches. This paper also introduces the thermometer as a new knowledge acquisition tool that allows experts to transfer semantics to the data mining process. TFSM proposes the index of potential explainability ( E k ) as the criteria to select the most promising variables for clustering. E k is based on the combination of inferential testing and metrics such as support. The proposal is applied with the INSESS-COVID19 database, where territorial groups of vulnerable populations were found. A set of 195 variables with 21 unbalanced thematic blocks is used to compare the results with a traditional multiview clustering analysis with promising results from both the geographical and the thematic point of view and the capacity to support further decision making. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
37. Injury Patterns and Impact on Performance in the NBA League Using Sports Analytics.
- Author
-
Sarlis, Vangelis, Papageorgiou, George, and Tjortjis, Christos
- Subjects
ATHLETIC leagues ,BASKETBALL ,PROFESSIONAL sports ,SPORTS medicine ,SPORTS nutrition - Abstract
This research paper examines Sports Analytics, focusing on injury patterns in the National Basketball Association (NBA) and their impact on players' performance. It employs a unique dataset to identify common NBA injuries, determine the most affected anatomical areas, and analyze how these injuries influence players' post-recovery performance. This study's novelty lies in its integrative approach that combines injury data with performance metrics and salary data, providing new insights into the relationship between injuries and economic and on-court performance. It investigates the periodicity and seasonality of injuries, seeking patterns related to time and external factors. Additionally, it examines the effect of specific injuries on players' per-match analytics and performance, offering perspectives on the implications of injury rehabilitation for player performance. This paper contributes significantly to sports analytics, assisting coaches, sports medicine professionals, and team management in developing injury prevention strategies, optimizing player rotations, and creating targeted rehabilitation plans. Its findings illuminate the interplay between injuries, salaries, and performance in the NBA, aiming to enhance player welfare and the league's overall competitiveness. With a comprehensive and sophisticated analysis, this research offers unprecedented insights into the dynamics of injuries and their long-term effects on athletes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. A hybrid data-mining framework for train rescheduling strategy pattern discovery.
- Author
-
Chen, Ruirui, Ge, Xuekai, Huang, Ping, and Wen, Chao
- Subjects
DATA mining ,AUTOMATIC extracting (Information science) ,DATABASE searching ,ALGORITHM research ,DATA science - Abstract
This study presents a hybrid data-mining framework based on feature selection algorithms and clustering methods to perform the pattern discovery of high-speed railway train rescheduling strategies (RSs). The proposed model is composed of two states. In the first state, decision tree, random forest, gradient boosting decision tree (GBDT) and extreme gradient boosting (XGBoost) models are used to investigate the importance of features. The features that have a high influence on RSs are first selected. In the second state, a K-means clustering method is used to uncover the interdependences between RSs and the influencing features, based on the results in the first state. The proposed method can determine the quantitative relationships between RSs and influencing factors. The results clearly show the influences of the factors on RSs, the possibilities of different train operation RSs under different situations, as well as some key time periods and key trains that the controllers should pay more attention to. The research in this paper can help train traffic controllers better understand the train operation patterns and provides direction for optimizing rail traffic RSs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. The Case Study Method in Philosophy of Science: An Empirical Study.
- Author
-
Mizrahi, Moti
- Subjects
PHILOSOPHY of science ,CASE studies ,PHILOSOPHY methodology ,DATA science ,DATA mining - Abstract
There is an ongoing methodological debate in philosophy of science concerning the use of case studies as evidence for and/or against theories about science. In this paper, I aim to make a contribution to this debate by taking an empirical approach. I present the results of a systematic survey of the PhilSci-Archive, which suggest that a sizeable proportion of papers in philosophy of science contain appeals to case studies, as indicated by the occurrence of the indicator words "case study" and/or "case studies." These results are confirmed by data mined from the JSTOR database on research articles published in leading journals in the field: Philosophy of Science, the British Journal for the Philosophy of Science (BJPS), and the Journal for General Philosophy of Science (JGPS), as well as the Proceedings of the Biennial Meeting of the Philosophy of Science Association (PSA). The data also show upward trends in appeals to case studies in articles published in Philosophy of Science, the BJPS, and the JGPS. The empirical work I have done for this paper provides philosophers of science who are wary of the use of case studies as evidence for and/or against theories about science with a way to do philosophy of science that is informed by data rather than case studies. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
40. Special issue on advances in data, information and knowledge engineering in data science era.
- Author
-
Bellatreche, Ladjel and Tjoa, A Min
- Subjects
DATA science ,LINKED data (Semantic Web) ,DATA mining ,COMPUTER science ,SOFTWARE engineering - Abstract
In recent years, Data Science emerged as a new and important discipline. SOFSEM provides an interesting forum for having and reinforcing the prerequisites of Data Science, coving topics related to Fundamental Computer Science, Data and Knowledge Management, and Software Engineering. [Extracted from the article]
- Published
- 2022
- Full Text
- View/download PDF
41. Preprocessing and Artificial Intelligence for Increasing Explainability in Mental Health.
- Author
-
Angerri, X. and Gibert, Karina
- Subjects
ARTIFICIAL intelligence ,MENTAL health ,DATA mining ,DATA analysis - Abstract
This paper shows the added value of using the existing specific domain knowledge to generate new derivated variables to complement a target dataset and the benefits of including these new variables into further data analysis methods. The main contribution of the paper is to propose a methodology to generate these new variables as a part of preprocessing, under a double approach: creating 2nd generation knowledge-driven variables, catching the experts criteria used for reasoning on the field or 3rd generation data-driven indicators, these created by clustering original variables. And Data Mining and Artificial Intelligence techniques like Clustering or Traffic light Panels help to obtain successful results. Some results of the project INSESS-COVID19 are presented, basic descriptive analysis gives simple results that even though they are useful to support basic policy-making, especially in health, a much richer global perspective is acquired after including derivated variables. When 2nd generation variables are available and can be introduced in the method for creating 3rd generation data, added value is obtained from both basic analysis and building new data-driven indicators. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
42. DATA SCIENCE EDUCATION - A SCOPING REVIEW.
- Author
-
Msweli, Nkosikhona Theoren, Mawela, Tendani, and Twinomurinzi, Hossana
- Subjects
DATA science ,SCIENCE education ,DATA mining ,COMPUTER science ,RESEARCH personnel ,ELECTRONIC journals - Abstract
Aim/Purpose This study aimed to evaluate the extant research on data science education (DSE) to identify the existing gaps, opportunities, and challenges, and make recommendations for current and future DSE. Background There has been an increase in the number of data science programs especially because of the increased appreciation of data as a multidisciplinary strategic resource. This has resulted in a greater need for skills in data science to extract meaningful insights from data. However, the data science programs are not enough to meet the demand for data science skills. While there is growth in data science programs, they appear more as a rebranding of existing engineering, computer science, mathematics, and statistics programs. Methodology A scoping review was adopted for the period 2010-2021 using six scholarly multidisciplinary databases: Google Scholar, IEEE Xplore, ACM Digital Library, ScienceDirect, Scopus, and the AIS Basket of eight journals. The study was narrowed down to 91 research articles and adopted a classification coding framework and correlation analysis for analysis. Contribution We theoretically contribute to the growing body of knowledge about the need to scale up data science through multidisciplinary pedagogies and disciplines as the demand grows. This paves the way for future research to understand which programs can provide current and future data scientists the skills and competencies relevant to societal needs. Findings The key results revealed the limited emphasis on DSE, especially in non-STEM (Science, Technology, Engineering, and Mathematics) disciplines. In addition, the results identified the need to find a suitable pedagogy or a set of pedagogies because of the multidisciplinary nature of DSE. Further, there is currently no existing framework to guide the design and development of DSE at various education levels, leading to sometimes inadequate programs. The study also noted the importance of various stakeholders who can contribute towards DSE and thus create opportunities in the DSE ecosystem. Most of the research studies reviewed were case studies that presented more STEM programs as compared to non-STEM. Recommendations for Practitioners We recommend CRoss Industry Standard Process for Data Mining (CRISPDM) as a framework to adopt collaborative pedagogies to teach data science. This research implies that it is important for academia, policymakers, and data science content developers to work closely with organizations to understand their needs. Recommendations for Researchers We recommend future research into programs that can provide current and future data scientists the skills and competencies relevant to societal needs and how interdisciplinarity within these programs can be integrated. Impact on Society Data science expertise is essential for tackling societal issues and generating beneficial effects. The main problem is that data is diverse and always changing, necessitating ongoing (up)skilling. Academic institutions must therefore stay current with new advances, changing data, and organizational requirements. Industry experts might share views based on their practical knowledge. The DSE ecosystem can be shaped by collaborating with numerous stakeholders and being aware of each stakeholder's function in order to advance data science internationally. Future Research The study found that there are a number of research opportunities that can be explored to improve the implementation of DSE, for instance, how can CRISPDM be integrated into collaborative pedagogies to provide a fully comprehensive data science curriculum?. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
43. Machine learning and big data analytics in bipolar disorder
- Author
-
Luciano Minuzzi, Erkki Isometsä, Elisa Brietzke, Diego Librenza-Garcia, Anne Duffy, Martin Alda, Benson Mwangi, Flávio Kapczinski, Rodrigo B. Mansur, Boris Birmaher, Bartholomeus C M Haarman, Roger S. McIntyre, Lars Vedel Kessing, Raymond W. Lam, Lakshmi N. Yatham, Pedro Ballester, Tomas Hajek, Ives Cavalcante Passos, Carlos López Jaramillo, and Rodrigo C. Barros
- Subjects
SYMPTOMS ,Computer science ,Big data ,Scientific literature ,computer.software_genre ,Field (computer science) ,Terminology ,risk prediction ,0302 clinical medicine ,big data ,SCHIZOPHRENIA ,NEUROPROGRESSION ,bipolar disorder ,RISK ,ASSOCIATION ,Prognosis ,DEPRESSION ,3. Good health ,Psychiatry and Mental health ,Phenotype ,machine learning ,LITHIUM RESPONSE ,MOOD DISORDERS ,Schizophrenia (object-oriented programming) ,Advisory Committees ,Clinical Decision-Making ,education ,Machine learning ,Risk Assessment ,CLASSIFICATION ,Suicidal Ideation ,03 medical and health sciences ,medicine ,Humans ,Bipolar disorder ,Biological Psychiatry ,PREDICTING SUICIDALITY ,business.industry ,Deep learning ,predictive psychiatry ,Data Science ,deep learning ,data mining ,medicine.disease ,personalized psychiatry ,030227 psychiatry ,Position paper ,Artificial intelligence ,business ,computer ,030217 neurology & neurosurgery - Abstract
OBJECTIVES: The International Society for Bipolar Disorders Big Data Task Force assembled leading researchers in the field of bipolar disorder (BD), machine learning, and big data with extensive experience to evaluate the rationale of machine learning and big data analytics strategies for BD.METHOD: A task force was convened to examine and integrate findings from the scientific literature related to machine learning and big data based studies to clarify terminology and to describe challenges and potential applications in the field of BD. We also systematically searched PubMed, Embase, and Web of Science for articles published up to January 2019 that used machine learning in BD.RESULTS: The results suggested that big data analytics has the potential to provide risk calculators to aid in treatment decisions and predict clinical prognosis, including suicidality, for individual patients. This approach can advance diagnosis by enabling discovery of more relevant data-driven phenotypes, as well as by predicting transition to the disorder in high-risk unaffected subjects. We also discuss the most frequent challenges that big data analytics applications can face, such as heterogeneity, lack of external validation and replication of some studies, cost and non-stationary distribution of the data, and lack of appropriate funding.CONCLUSION: Machine learning-based studies, including atheoretical data-driven big data approaches, provide an opportunity to more accurately detect those who are at risk, parse-relevant phenotypes as well as inform treatment selection and prognosis. However, several methodological challenges need to be addressed in order to translate research findings to clinical settings.
- Published
- 2019
44. Towards an ELSA Curriculum for Data Scientists.
- Author
-
Christoforaki, Maria and Beyan, Oya Deniz
- Subjects
CONSCIOUSNESS raising ,DATA mining ,DATA science ,SCIENCE projects ,CURRICULUM - Abstract
The use of artificial intelligence (AI) applications in a growing number of domains in recent years has put into focus the ethical, legal, and societal aspects (ELSA) of these technologies and the relevant challenges they pose. In this paper, we propose an ELSA curriculum for data scientists aiming to raise awareness about ELSA challenges in their work, provide them with a common language with the relevant domain experts in order to cooperate to find appropriate solutions, and finally, incorporate ELSA in the data science workflow. ELSA should not be seen as an impediment or a superfluous artefact but rather as an integral part of the Data Science Project Lifecycle. The proposed curriculum uses the CRISP-DM (CRoss-Industry Standard Process for Data Mining) model as a backbone to define a vertical partition expressed in modules corresponding to the CRISP-DM phases. The horizontal partition includes knowledge units belonging to three strands that run through the phases, namely ethical and societal, legal and technical rendering knowledge units (KUs). In addition to the detailed description of the aforementioned KUs, we also discuss their implementation, issues such as duration, form, and evaluation of participants, as well as the variance of the knowledge level and needs of the target audience. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. DAPS diagrams for defining Data Science projects.
- Author
-
de Mast, Jeroen and Lokkerbol, Joran
- Subjects
SCIENCE projects ,DATA science ,BIG data ,BUSINESS analytics ,OPERATIONS research ,DATA mining - Abstract
Background: Models for structuring big-data and data-analytics projects typically start with a definition of the project's goals and the business value they are expected to create. The literature identifies proper project definition as crucial for a project's success, and also recognizes that the translation of business objectives into data-analytic problems is a difficult task. Unfortunately, common project structures, such as CRISP-DM, provide little guidance for this crucial stage when compared to subsequent project stages such as data preparation and modeling. Contribution: This paper contributes structure to the project-definition stage of data-analytic projects by proposing the Data-Analytic Problem Structure (DAPS). The diagrammatic technique facilitates the collaborative development of a consistent and precise definition of a data-analytic problem, and the articulation of how it contributes to the organization's goals. In addition, the technique helps to identify important assumptions, and to break down large ambitions in manageable subprojects. Methods: The semi-formal specification technique took other models for problem structuring — common in fields such as operations research and business analytics — as a point of departure. The proposed technique was applied in 47 real data-analytic projects and refined based on the results, following a design-science approach. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Fueling Clinical and Translational Research in Appalachia: Informatics Platform Approach
- Author
-
Uma Sundaram, Gouthami Kothakapu, Usha Murughiyan, Niharika Bhardwaj, and Alfred A. Cecchetti
- Subjects
Computer science ,Population ,Computer applications to medicine. Medical informatics ,R858-859.7 ,health care disparities ,Health Informatics ,Translational research ,data warehousing ,Health informatics ,03 medical and health sciences ,0302 clinical medicine ,Health Information Management ,Health care ,medical informatics ,data visualization ,030212 general & internal medicine ,Appalachian region ,education ,education.field_of_study ,Original Paper ,business.industry ,data mining ,Data science ,Data warehouse ,electronic health records ,machine learning ,Analytics ,Informatics ,data science ,Translational science ,business ,030217 neurology & neurosurgery - Abstract
Background The Appalachian population is distinct, not just culturally and geographically but also in its health care needs, facing the most health care disparities in the United States. To meet these unique demands, Appalachian medical centers need an arsenal of analytics and data science tools with the foundation of a centralized data warehouse to transform health care data into actionable clinical interventions. However, this is an especially challenging task given the fragmented state of medical data within Appalachia and the need for integration of other types of data such as environmental, social, and economic with medical data. Objective This paper aims to present the structure and process of the development of an integrated platform at a midlevel Appalachian academic medical center along with its initial uses. Methods The Appalachian Informatics Platform was developed by the Appalachian Clinical and Translational Science Institute’s Division of Clinical Informatics and consists of 4 major components: a centralized clinical data warehouse, modeling (statistical and machine learning), visualization, and model evaluation. Data from different clinical systems, billing systems, and state- or national-level data sets were integrated into a centralized data warehouse. The platform supports research efforts by enabling curation and analysis of data using the different components, as appropriate. Results The Appalachian Informatics Platform is functional and has supported several research efforts since its implementation for a variety of purposes, such as increasing knowledge of the pathophysiology of diseases, risk identification, risk prediction, and health care resource utilization research and estimation of the economic impact of diseases. Conclusions The platform provides an inexpensive yet seamless way to translate clinical and translational research ideas into clinical applications for regions similar to Appalachia that have limited resources and a largely rural population.
- Published
- 2020
47. Open Agile text mining for bioinformatics: the PubAnnotation ecosystem
- Author
-
Toyofumi Fujiwara, Shujiro Okuda, Yue Wang, K. Bretonnel Cohen, Tiffany J. Callahan, and Jin-Dong Kim
- Subjects
Statistics and Probability ,PubMed ,Computer science ,computer.software_genre ,Biochemistry ,Personalization ,Task (project management) ,03 medical and health sciences ,Annotation ,0302 clinical medicine ,Text mining ,Pregnancy ,Data Mining ,Humans ,Use case ,Molecular Biology ,Ecosystem ,Natural Language Processing ,030304 developmental biology ,0303 health sciences ,business.industry ,Computational Biology ,Original Papers ,Data science ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,030220 oncology & carcinogenesis ,Female ,Data and Text Mining ,Web service ,business ,Precision and recall ,computer ,Agile software development - Abstract
Motivation Most currently available text mining tools share two characteristics that make them less than optimal for use by biomedical researchers: they require extensive specialist skills in natural language processing and they were built on the assumption that they should optimize global performance metrics on representative datasets. This is a problem because most end-users are not natural language processing specialists and because biomedical researchers often care less about global metrics like F-measure or representative datasets than they do about more granular metrics such as precision and recall on their own specialized datasets. Thus, there are fundamental mismatches between the assumptions of much text mining work and the preferences of potential end-users. Results This article introduces the concept of Agile text mining, and presents the PubAnnotation ecosystem as an example implementation. The system approaches the problems from two perspectives: it allows the reformulation of text mining by biomedical researchers from the task of assembling a complete system to the task of retrieving warehoused annotations, and it makes it possible to do very targeted customization of the pre-existing system to address specific end-user requirements. Two use cases are presented: assisted curation of the GlycoEpitope database, and assessing coverage in the literature of pre-eclampsia-associated genes. Availability and implementation The three tools that make up the ecosystem, PubAnnotation, PubDictionaries and TextAE are publicly available as web services, and also as open source projects. The dictionaries and the annotation datasets associated with the use cases are all publicly available through PubDictionaries and PubAnnotation, respectively.
- Published
- 2019
48. Forward feature selection: empirical analysis.
- Author
-
Kamalov, Firuz, Elnaffar, Said, Cherukuri, Aswani, and Jonnalagadda, Annapurna
- Subjects
FEATURE selection ,TIME complexity ,BIG data ,DATA science ,ALGORITHMS - Abstract
Feature selection is an important preprocessing step in many data science and machine learning applications. Although there exist several sophisticated feature selection algorithms, their benefits are sometimes overshadowed by their complexity and slow execution. Therefore, in many cases, a more simple algorithm may be better suited. In this paper, we demonstrate that a rudimentary forward selection algorithm can achieve optimal performance with a low time complexity. Our study is based on an extensive empirical evaluation of the forward feature selection algorithm in the context of linear regression. Concretely, we compare the forward selection algorithm against the gold standard exhaustive search algorithm based on several datasets. The results show that the forward selection algorithm achieves high performance with relatively fast execution. Given the simplicity, accuracy, and speed of the forward feature selection algorithm, we recommend it as a primary feature selection method for most regression applications. Our results are particularly pertinent in the case of big data and real-time analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Data science curriculum in the iField.
- Author
-
Zhang, Yin, Wu, Dan, Hagen, Loni, Song, Il‐Yeol, Mostafa, Javed, Oh, Sam, Anderson, Theresa, Shah, Chirag, Bishop, Bradley Wade, Hopfgartner, Frank, Eckert, Kai, Federer, Lisa, and Saltz, Jeffrey S.
- Subjects
DATA science ,COMMITTEES ,LEADERSHIP ,JOB descriptions ,CURRICULUM ,UNDERGRADUATE programs ,GRADUATE education ,DATA analytics ,INFORMATION technology ,DELPHI method ,DATA mining - Abstract
Many disciplines, including the broad Field of Information (iField), offer Data Science (DS) programs. There have been significant efforts exploring an individual discipline's identity and unique contributions to the broader DS education landscape. To advance DS education in the iField, the iSchool Data Science Curriculum Committee (iDSCC) was formed and charged with building and recommending a DS education framework for iSchools. This paper reports on the research process and findings of a series of studies to address important questions: What is the iField identity in the multidisciplinary DS education landscape? What is the status of DS education in iField schools? What knowledge and skills should be included in the core curriculum for iField DS education? What are the jobs available for DS graduates from the iField? What are the differences between graduate‐level and undergraduate‐level DS education? Answers to these questions will not only distinguish an iField approach to DS education but also define critical components of DS curriculum. The results will inform individual DS programs in the iField to develop curriculum to support undergraduate and graduate DS education in their local context. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
50. Social Media Text Mining Framework for Drug Abuse: Development and Validation Study With an Opioid Crisis Case Analysis
- Author
-
Yong Wang, Tareq Nasralah, and Omar F. El-Gayar
- Subjects
020205 medical informatics ,Computer science ,Substance-Related Disorders ,media_common.quotation_subject ,social media ,Health Informatics ,02 engineering and technology ,text mining ,Ontology (information science) ,lcsh:Computer applications to medicine. Medical informatics ,infodemiology ,03 medical and health sciences ,infoveillance ,0302 clinical medicine ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Data Mining ,Humans ,Social media ,Quality (business) ,030212 general & internal medicine ,Opioid Epidemic ,media_common ,drug abuse ,Original Paper ,Web search query ,Data Collection ,lcsh:Public aspects of medicine ,lcsh:RA1-1270 ,medicine.disease ,Data science ,Substance abuse ,Identification (information) ,Data quality ,Infoveillance ,lcsh:R858-859.7 ,opioid crisis - Abstract
Background Social media are considered promising and viable sources of data for gaining insights into various disease conditions and patients’ attitudes, behaviors, and medications. They can be used to recognize communication and behavioral themes of problematic use of prescription drugs. However, mining and analyzing social media data have challenges and limitations related to topic deduction and data quality. As a result, we need a structured approach to analyze social media content related to drug abuse in a manner that can mitigate the challenges and limitations surrounding the use of such data. Objective This study aimed to develop and evaluate a framework for mining and analyzing social media content related to drug abuse. The framework is designed to mitigate challenges and limitations related to topic deduction and data quality in social media data analytics for drug abuse. Methods The proposed framework started with defining different terms related to the keywords, categories, and characteristics of the topic of interest. We then used the Crimson Hexagon platform to collect data based on a search query informed by a drug abuse ontology developed using the identified terms. We subsequently preprocessed the data and examined the quality using an evaluation matrix. Finally, a suitable data analysis approach could be used to analyze the collected data. Results The framework was evaluated using the opioid epidemic as a drug abuse case analysis. We demonstrated the applicability of the proposed framework to identify public concerns toward the opioid epidemic and the most discussed topics on social media related to opioids. The results from the case analysis showed that the framework could improve the discovery and identification of topics in social media domains characterized by a plethora of highly diverse terms and lack of a commonly available dictionary or language by the community, such as in the case of opioid and drug abuse. Conclusions The proposed framework addressed the challenges related to topic detection and data quality. We demonstrated the applicability of the proposed framework to identify the common concerns toward the opioid epidemic and the most discussed topics on social media related to opioids.
- Published
- 2020
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.