14,145 results on '"Synthetic data"'
Search Results
2. Meta-TadGAN: Time Series Anomaly Detection Using TadGAN with Meta-features
- Author
-
Silva, Inês Oliveira e, Soares, Carlos, Cerqueira, Vitor, Rodrigues, Arlete, Bastardo, Pedro, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Santos, Manuel Filipe, editor, Machado, José, editor, Novais, Paulo, editor, Cortez, Paulo, editor, and Moreira, Pedro Miguel, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Synthetic Data for Robust Identification of Typical and Atypical Serotonergic Neurons Using Convolutional Neural Networks
- Author
-
Corradetti, Daniele, Bernardi, Alessandro, Corradetti, Renato, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Santos, Manuel Filipe, editor, Machado, José, editor, Novais, Paulo, editor, Cortez, Paulo, editor, and Moreira, Pedro Miguel, editor
- Published
- 2025
- Full Text
- View/download PDF
4. Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
- Author
-
Zhang, David Junhao, Xu, Mutian, Wu, Jay Zhangjie, Xue, Chuhui, Zhang, Wenqing, Han, Xiaoguang, Bai, Song, Shou, Mike Zheng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
5. PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation
- Author
-
Li, Zhenyu, Bhat, Shariq Farooq, Wonka, Peter, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
6. Practical and Ethical Considerations for Generative AI in Medical Imaging
- Author
-
Jha, Debesh, Rauniyar, Ashish, Hagos, Desta Haileselassie, Sharma, Vanshali, Tomar, Nikhil Kumar, Zhang, Zheyuan, Isler, Ilkin, Durak, Gorkem, Wallace, Michael, Yazici, Cemal, Berzin, Tyler, Biswas, Koushik, Bagci, Ulas, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Puyol-Antón, Esther, editor, Zamzmi, Ghada, editor, Feragen, Aasa, editor, King, Andrew P., editor, Cheplygina, Veronika, editor, Ganz-Benjaminsen, Melanie, editor, Ferrante, Enzo, editor, Glocker, Ben, editor, Petersen, Eike, editor, Baxter, John S. H., editor, Rekik, Islem, editor, and Eagleson, Roy, editor
- Published
- 2025
- Full Text
- View/download PDF
7. On Differentially Private 3D Medical Image Synthesis with Controllable Latent Diffusion Models
- Author
-
Daum, Deniz, Osuala, Richard, Riess, Anneliese, Kaissis, Georgios, Schnabel, Julia A., Di Folco, Maxime, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Mukhopadhyay, Anirban, editor, Oksuz, Ilkay, editor, Engelhardt, Sandy, editor, Mehrof, Dorit, editor, and Yuan, Yixuan, editor
- Published
- 2025
- Full Text
- View/download PDF
8. FISHing in Uncertainty: Synthetic Contrastive Learning for Genetic Aberration Detection
- Author
-
Gutwein, Simon, Kampel, Martin, Taschner-Mandl, Sabine, Licandro, Roxane, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Sudre, Carole H., editor, Mehta, Raghav, editor, Ouyang, Cheng, editor, Qin, Chen, editor, Rakic, Marianne, editor, and Wells, William M., editor
- Published
- 2025
- Full Text
- View/download PDF
9. Learning Domain-Invariant Spatio-Temporal Visual Cues for Video-Based Crowd Panic Detection
- Author
-
Calle, Javier, Unzueta, Luis, Leskovsky, Peter, García, Jorge, Akhgar, Babak, Series Editor, Gkotsis, Ilias, editor, Kavallieros, Dimitrios, editor, Stoianov, Nikolai, editor, Vrochidis, Stefanos, editor, and Diagourtas, Dimitrios, editor
- Published
- 2025
- Full Text
- View/download PDF
10. AFreeCA: Annotation-Free Counting for All
- Author
-
D’Alessandro, Adriano, Mahdavi-Amiri, Ali, Hamarneh, Ghassan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
11. New Metrics to Benchmark and Improve BIM Visibility Within a Synthetic Image Generation Process for Computer Vision Progress Tracking
- Author
-
Nunez-Morales, Juan D., Hsu, Shun-Hsiang, Ibrahim, Amir, Golparvar-Fard, Mani, di Prisco, Marco, Series Editor, Chen, Sheng-Hong, Series Editor, Vayas, Ioannis, Series Editor, Kumar Shukla, Sanjay, Series Editor, Sharma, Anuj, Series Editor, Kumar, Nagesh, Series Editor, Wang, Chien Ming, Series Editor, Cui, Zhen-Dong, Series Editor, Lu, Xinzheng, Series Editor, Desjardins, Serge, editor, Poitras, Gérard J., editor, and Nik-Bakht, Mazdak, editor
- Published
- 2025
- Full Text
- View/download PDF
12. Private measures, random walks, and synthetic data.
- Author
-
Boedihardjo, March, Vershynin, Roman, and Strohmer, Thomas
- Subjects
Differential privacy ,Random walks ,Synthetic data - Abstract
Differential privacy is a mathematical concept that provides an information-theoretic security guarantee. While differential privacy has emerged as a de facto standard for guaranteeing privacy in data sharing, the known mechanisms to achieve it come with some serious limitations. Utility guarantees are usually provided only for a fixed, a priori specified set of queries. Moreover, there are no utility guarantees for more complex-but very common-machine learning tasks such as clustering or classification. In this paper we overcome some of these limitations. Working with metric privacy, a powerful generalization of differential privacy, we develop a polynomial-time algorithm that creates a private measure from a data set. This private measure allows us to efficiently construct private synthetic data that are accurate for a wide range of statistical analysis tools. Moreover, we prove an asymptotically sharp min-max result for private measures and synthetic data in general compact metric spaces, for any fixed privacy budget ε bounded away from zero. A key ingredient in our construction is a new superregular random walk, whose joint distribution of steps is as regular as that of independent random variables, yet which deviates from the origin logarithmically slowly.
- Published
- 2024
13. Synthetic Data: Methods, Use Cases, and Risks
- Author
-
De Cristofaro, Emiliano
- Subjects
Information and Computing Sciences ,Human-Centred Computing ,Synthetic data ,Data privacy ,Data models ,Privacy ,Training ,Training data ,Security ,Computation Theory and Mathematics ,Computer Software ,Data Format ,Strategic ,Defence & Security Studies ,Cybersecurity and privacy - Published
- 2024
14. Development of a cerebellar ataxia diagnosis model using conditional GAN-based synthetic data generation for visuomotor adaptation task.
- Author
-
Kim, Jinah, Woo, Sung-Ho, Kim, Taekyung, Yoon, Won Tae, Shin, Jung Hwan, Lee, Jee-Young, and Ryu, Jeh-Kwang
- Subjects
- *
GENERATIVE adversarial networks , *CEREBELLAR ataxia , *DEEP learning , *DIGITAL health , *EARLY diagnosis - Abstract
This study proposes a synthetic data generation model to create a classification framework for cerebellar ataxia patients using trajectory data from the visuomotor adaptation task. The classification objectives include patients with cerebellar ataxia, age-matched normal individuals, and young healthy subjects. Synthetic data for the three classes is generated based on class conditions and random noise by leveraging a combination of conditional adversarial generative neural networks and reconstruction networks. This synthetic data, alongside real data, is utilized as training data for the patient classification model to enhance classification accuracy. The fidelity of the synthetic data is assessed visually to measure the validity and diversity of the generated data qualitatively while quantitatively evaluating distribution similarity to real data. Furthermore, the clinical efficacy of the patient classification model employing synthetic data is demonstrated by showcasing improved classification accuracy through a comparative analysis between results obtained using solely real data and those obtained when both real and synthetic data are utilized. This methodological approach holds promise in addressing data insufficiency in the digital healthcare domain, employing deep learning methodologies, and developing early disease diagnosis tools. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Yet Another Discriminant Analysis (YADA): A Probabilistic Model for Machine Learning Applications.
- Author
-
Field Jr., Richard V., Smith, Michael R., Wuest, Ellery J., and Ingram, Joe B.
- Subjects
- *
DISTRIBUTION (Probability theory) , *MARGINAL distributions , *DEEP learning , *DISCRIMINANT analysis , *DECISION making - Abstract
This paper presents a probabilistic model for various machine learning (ML) applications. While deep learning (DL) has produced state-of-the-art results in many domains, DL models are complex and over-parameterized, which leads to high uncertainty about what the model has learned, as well as its decision process. Further, DL models are not probabilistic, making reasoning about their output challenging. In contrast, the proposed model, referred to as Yet Another Discriminate Analysis(YADA), is less complex than other methods, is based on a mathematically rigorous foundation, and can be utilized for a wide variety of ML tasks including classification, explainability, and uncertainty quantification. YADA is thus competitive in most cases with many state-of-the-art DL models. Ideally, a probabilistic model would represent the full joint probability distribution of its features, but doing so is often computationally expensive and intractable. Hence, many probabilistic models assume that the features are either normally distributed, mutually independent, or both, which can severely limit their performance. YADA is an intermediate model that (1) captures the marginal distributions of each variable and the pairwise correlations between variables and (2) explicitly maps features to the space of multivariate Gaussian variables. Numerous mathematical properties of the YADA model can be derived, thereby improving the theoretic underpinnings of ML. Validation of the model can be statistically verified on new or held-out data using native properties of YADA. However, there are some engineering and practical challenges that we enumerate to make YADA more useful. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Synthetic Data for Deep Learning in Computer Vision & Medical Imaging: A Means to Reduce Data Bias.
- Author
-
Paproki, Anthony, Salvado, Olivier, and Fookes, Clinton
- Published
- 2024
- Full Text
- View/download PDF
17. Noninvasive Deep Learning Analysis for Smith–Magenis Syndrome Classification.
- Author
-
Núñez-Vidal, Esther, Fernández-Ruiz, Raúl, Álvarez-Marquina, Agustín, Hidalgo-delaGuía, Irene, Garayzábal-Heinze, Elena, Hristov-Kalamov, Nikola, Domínguez-Mateos, Francisco, Conde, Cristina, and Martínez-Olalla, Rafael
- Subjects
ARTIFICIAL neural networks ,VOICE analysis ,SPEECH synthesis ,RARE diseases ,GENETIC testing - Abstract
Smith–Magenis syndrome (SMS) is a rare, underdiagnosed condition due to limited public awareness of genetic testing and a lengthy diagnostic process. Voice analysis can be a noninvasive tool for monitoring and detecting SMS. In this paper, the cepstral peak prominence and mel-frequency cepstral coefficients are used as disease monitoring and detection metrics. In addition, an efficient neural network, incorporating synthetic data processes, was used to detect SMS in a cohort of individuals with the disease. Three study cases were conducted with a set of 19 SMS patients and 292 controls. The three study cases employed various oversampling and undersampling techniques, including SMOTE, random oversampling, NearMiss, random undersampling, and 16 additional methods, resulting in balanced accuracies ranging from 69% to 92%. This is the first study using a neural network model to focus on a rare genetic syndrome using phonation analysis data. By using synthetic data (oversampling and undersampling) and a CNN, it was possible to detect SMS with high levels of accuracy. Voice analysis and deep learning techniques have proven to be a useful and noninvasive method. This is a finding that may help in the complex identification of this syndrome as well as other rare diseases. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Completing 3D point clouds of individual trees using deep learning.
- Author
-
Bornand, Aline, Abegg, Meinrad, Morsdorf, Felix, and Rehush, Nataliia
- Subjects
POINT cloud ,DECIDUOUS plants ,ARTIFICIAL intelligence ,TREE height ,CLOUD forests - Abstract
In close‐range remote sensing data collected in a forest, occlusion often causes incomplete or sparse point cloud representations of individual trees, impeding accurate 3D reconstruction of tree architecture and estimation of tree height and volume. Recent developments in deep learning (DL) for 3D data have produced approaches for point cloud completion, which could potentially be applied to trees.We explored the potential of a DL approach to fill gaps in dense point clouds representing the main structures of deciduous trees by applying an existing transformer‐based completion model (PoinTr). Complete point clouds are required as training data, but even dense terrestrial laser scanning (TLS) data sets contain gaps caused by occlusion, making it nearly impossible to acquire such data. We therefore investigated the ability of point cloud completion models trained on a range of synthetic data sets to handle occlusion patterns in real‐world point clouds.Despite the limited data set, we successfully fine‐tuned a general pre‐trained completion model to fill gaps within 1 m3 segments of tree point clouds. Fine‐tuning on synthetic tree data improved the model's ability to complete tree objects compared with training on diverse artificial objects. However, the quality of the predictions was influenced by the level of sophistication of the synthetic data. Our results demonstrate that incorporating even limited real‐world TLS data during training can considerably improve completion results but may introduce additional noise in the predictions.3D point cloud completion with DL has the potential to improve and fill gaps in point clouds of individual trees, facilitating further steps in the processing and analysis of 3D forest data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Towards objective and systematic evaluation of bias in artificial intelligence for medical imaging.
- Author
-
Stanley, Emma A M, Souza, Raissa, Winder, Anthony J, Gulve, Vedant, Amador, Kimberly, Wilms, Matthias, and Forkert, Nils D
- Abstract
Objective Artificial intelligence (AI) models trained using medical images for clinical tasks often exhibit bias in the form of subgroup performance disparities. However, since not all sources of bias in real-world medical imaging data are easily identifiable, it is challenging to comprehensively assess their impacts. In this article, we introduce an analysis framework for systematically and objectively investigating the impact of biases in medical images on AI models. Materials and Methods Our framework utilizes synthetic neuroimages with known disease effects and sources of bias. We evaluated the impact of bias effects and the efficacy of 3 bias mitigation strategies in counterfactual data scenarios on a convolutional neural network (CNN) classifier. Results The analysis revealed that training a CNN model on the datasets containing bias effects resulted in expected subgroup performance disparities. Moreover, reweighing was the most successful bias mitigation strategy for this setup. Finally, we demonstrated that explainable AI methods can aid in investigating the manifestation of bias in the model using this framework. Discussion The value of this framework is showcased in our findings on the impact of bias scenarios and efficacy of bias mitigation in a deep learning model pipeline. This systematic analysis can be easily expanded to conduct further controlled in silico trials in other investigations of bias in medical imaging AI. Conclusion Our novel methodology for objectively studying bias in medical imaging AI can help support the development of clinical decision-support tools that are robust and responsible. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Synthetic data for privacy-preserving clinical risk prediction.
- Author
-
Qian, Zhaozhi, Callender, Thomas, Cebere, Bogdan, Janes, Sam M., Navani, Neal, and van der Schaar, Mihaela
- Subjects
- *
DATA release , *PROGNOSTIC models , *MACHINE learning , *LUNG cancer , *INFORMATION sharing - Abstract
Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches—such as federated learning—analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act in place of real data for a wide range of use cases. However, the role that synthetic data might play in all aspects of clinical model development remains unknown. In this work, we used state-of-the-art generators explicitly designed for privacy preservation to create a synthetic version of ever-smokers in the UK Biobank before building prognostic models for lung cancer under several data release assumptions. We demonstrate that synthetic data can be effectively used throughout the medical prognostic modeling pipeline even without eventual access to the real data. Furthermore, we show the implications of different data release approaches on how synthetic biobank data could be deployed within the healthcare system. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Method for Enhancing AI Accuracy in Pressure Injury Detection Using Real and Synthetic Datasets.
- Author
-
Kim, Jaeseung, Kim, Mujung, Youn, Heejun, Lee, Seunghyun, Kwon, Soonchul, and Park, Kyung Hee
- Subjects
PRESSURE ulcers ,STABLE Diffusion ,ARTIFICIAL intelligence ,IMAGE recognition (Computer vision) ,NOSOLOGY ,CHATBOTS - Abstract
Pressure injuries pose significant health risks, especially for the elderly, immobile individuals, and those with sensory impairments. These injuries can rapidly become chronic, making initial diagnosis important. Due to the difficulty of transporting patients from local health facilities to higher-level general hospitals for treatment, it is essential to utilize telemedicine tools, such as chatbots, to ensure rapid initial diagnosis. Recent advances in artificial intelligence have demonstrated potential for medical imaging and disease classification. Ongoing research in the field of dermatological diseases focuses on disease classification. However, the assessment accuracy of artificial intelligence is often limited by unequal class distributions and insufficient dataset quantities. In this study, we aim to enhance the accuracy of artificial intelligence models by generating synthetic datasets. Specifically, we focused on training models for Pressure Injury assessment using both real and synthetic datasets. We used PI data at a domestic medical university. As part of our supplementary research, we established a chatbot system to facilitate the assessment of pressure injuries. Using both constructed and synthetic data, we achieved a top-1 accuracy of 92.03%. The experimental results demonstrate that combining real and synthetic data significantly improves model accuracy. These findings suggest that synthetic datasets can be effectively utilized to address the limitations of small-scale datasets in medical applications. Future research should explore the use of diverse synthetic data generation methods and validate model performance on a variety of datasets to enhance the generalization and robustness of AI models for Pressure Injury assessment. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Insulin Resistance and Impaired Insulin Secretion Predict Incident Diabetes: A Statistical Matching Application to the Two Korean Nationwide, Population-Representative Cohorts.
- Author
-
Hyemin Jo, Soyeon Ahn, Jung Hun Ohn, Cheol Min Shin, Eunjeong Ji, Donggil Kim, Sung Jae Jung, and Joongyub Lee
- Subjects
- *
DATA privacy , *STATISTICAL matching , *INSULIN resistance , *NATIONAL health insurance , *MEDICAL screening - Abstract
Background: To evaluate whether insulin resistance and impaired insulin secretion are useful predictors of incident diabetes in Koreans using nationwide population-representative data to enhance data privacy. Methods: This study analyzed the data of individuals without diabetes aged >40 years from the Korea National Health and Nutrition Examination Survey (KNHANES) 2007–2010 and 2015 and the National Health Insurance Service-National Health Screening Cohort (NHIS-HEALS). Owing to privacy concerns, these databases cannot be linked using direct identifiers. Therefore, we generated 10 synthetic datasets, followed by statistical matching with the NHIS-HEALS. Homeostasis model assessment of insulin resistance (HOMA-IR) and homeostasis model assessment of β-cell function (HOMA-β) were used as indicators of insulin resistance and insulin secretory function, respectively, and diabetes onset was captured in NHIS-HEALS. Results: A median of 4,580 (range, 4,463 to 4,761) adults were included in the analyses after statistical matching of 10 synthetic KNHANES and NHIS-HEALS datasets. During a mean follow-up duration of 5.8 years, a median of 4.7% (range, 4.3% to 5.0%) of the participants developed diabetes. Compared to the reference low–HOMA-IR/high–HOMA-β group, the high–HOMA-IR/low– HOMA-β group had the highest risk of diabetes, followed by high–HOMA-IR/high–HOMA-β group and low–HOMA-IR/low– HOMA-β group (median adjusted hazard ratio [ranges]: 3.36 [1.86 to 6.05], 1.81 [1.01 to 3.22], and 1.68 [0.93 to 3.04], respectively). Conclusion: Insulin resistance and impaired insulin secretion are robust predictors of diabetes in the Korean population. A retrospective cohort constructed by combining cross-sectional synthetic and longitudinal claims-based cohort data through statistical matching may be a reliable resource for studying the natural history of diabetes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. An Image-Based Sensor System for Low-Cost Airborne Particle Detection in Citizen Science Air Quality Monitoring.
- Author
-
Ali Shah, Syed Mohsin, Casado-Mansilla, Diego, and López-de-Ipiña, Diego
- Subjects
- *
IMAGE processing , *ENVIRONMENTAL monitoring , *AIR pollution , *COMMUNITY involvement , *RANDOM noise theory , *AIR quality monitoring - Abstract
Air pollution poses significant public health risks, necessitating accurate and efficient monitoring of particulate matter (PM). These organic compounds may be released from natural sources like trees and vegetation, as well as from anthropogenic, or human-made sources including industrial activities and motor vehicle emissions. Therefore, measuring PM concentrations is paramount to understanding people's exposure levels to pollutants. This paper introduces a novel image processing technique utilizing photographs/pictures of Do-it-Yourself (DiY) sensors for the detection and quantification of P M 10 particles, enhancing community involvement and data collection accuracy in Citizen Science (CS) projects. A synthetic data generation algorithm was developed to overcome the challenge of data scarcity commonly associated with citizen-based data collection to validate the image processing technique. This algorithm generates images by precisely defining parameters such as image resolution, image dimension, and PM airborne particle density. To ensure these synthetic images mimic real-world conditions, variations like Gaussian noise, focus blur, and white balance adjustments and combinations were introduced, simulating the environmental and technical factors affecting image quality in typical smartphone digital cameras. The detection algorithm for P M 10 particles demonstrates robust performance across varying levels of noise, maintaining effectiveness in realistic mobile imaging conditions. Therefore, the methodology retains sufficient accuracy, suggesting its practical applicability for environmental monitoring in diverse real-world conditions using mobile devices. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Synthetic Data for Video Surveillance Applications of Computer Vision: A Review.
- Author
-
Delussu, Rita, Putzu, Lorenzo, and Fumera, Giorgio
- Subjects
- *
OBJECT recognition (Computer vision) , *COMPUTER vision , *IMAGE analysis , *BEHAVIORAL assessment , *APPLICATION software , *VIDEO surveillance , *DEEP learning - Abstract
In recent years, there has been a growing interest in synthetic data for several computer vision applications, such as automotive, detection and tracking, surveillance, medical image analysis and robotics. Early use of synthetic data was aimed at performing controlled experiments under the analysis by synthesis approach. Currently, synthetic data are mainly used for training computer vision models, especially deep learning ones, to address well-known issues of real data, such as manual annotation effort, data imbalance and bias, and privacy-related restrictions. In this work, we survey the use of synthetic training data focusing on applications related to video surveillance, whose relevance has rapidly increased in the past few years due to their connection to security: crowd counting, object and pedestrian detection and tracking, behaviour analysis, person re-identification and face recognition. Synthetic training data are even more interesting in this kind of application, to address further, specific issues arising, e.g., from typically unconstrained image or video acquisition conditions and cross-scene application scenarios. We categorise and discuss the existing methods for creating synthetic data, analyse the synthetic data sets proposed in the literature for each of the considered applications, and provide an overview of their effectiveness as training data. We finally discuss whether and to what extent the existing synthetic data sets mitigate the issues of real data, highlight existing open issues, and suggest future research directions in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Complete blood count as a biomarker for preeclampsia with severe features diagnosis: a machine learning approach.
- Author
-
Araújo, Daniella Castro, de Macedo, Alexandre Afonso, Veloso, Adriano Alonso, Alpoim, Patricia Nessralla, Gomes, Karina Braga, Carvalho, Maria das Graças, and Dusse, Luci Maria SantAna
- Subjects
- *
MACHINE learning , *BLOOD cell count , *DATA augmentation , *ARTIFICIAL intelligence , *STATISTICAL smoothing - Abstract
Objective: This study introduces the complete blood count (CBC), a standard prenatal screening test, as a biomarker for diagnosing preeclampsia with severe features (sPE), employing machine learning models. Methods: We used a boosting machine learning model fed with synthetic data generated through a new methodology called DAS (Data Augmentation and Smoothing). Using data from a Brazilian study including 132 pregnant women, we generated 3,552 synthetic samples for model training. To improve interpretability, we also provided a ridge regression model. Results: Our boosting model obtained an AUROC of 0.90±0.10, sensitivity of 0.95, and specificity of 0.79 to differentiate sPE and non-PE pregnant women, using CBC parameters of neutrophils count, mean corpuscular hemoglobin (MCH), and the aggregate index of systemic inflammation (AISI). In addition, we provided a ridge regression equation using the same three CBC parameters, which is fully interpretable and achieved an AUROC of 0.79±0.10 to differentiate the both groups. Moreover, we also showed that a monocyte count lower than 490 / m m 3 yielded a sensitivity of 0.71 and specificity of 0.72. Conclusion: Our study showed that ML-powered CBC could be used as a biomarker for sPE diagnosis support. In addition, we showed that a low monocyte count alone could be an indicator of sPE. Significance: Although preeclampsia has been extensively studied, no laboratory biomarker with favorable cost-effectiveness has been proposed. Using artificial intelligence, we proposed to use the CBC, a low-cost, fast, and well-spread blood test, as a biomarker for sPE. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Bias Mitigation via Synthetic Data Generation: A Review.
- Author
-
Shahul Hameed, Mohamed Ashik, Qureshi, Asifa Mehmood, and Kaushik, Abhishek
- Subjects
ARTIFICIAL intelligence ,HEALTH equity ,DATA quality ,FAIRNESS ,FORECASTING - Abstract
Artificial intelligence (AI) is widely used in healthcare applications to perform various tasks. Although these models have great potential to improve the healthcare system, they have also raised significant ethical concerns, including biases that increase the risk of health disparities in medical applications. The under-representation of a specific group can lead to bias in the datasets that are being replicated in the AI models. These disadvantaged groups are disproportionately affected by bias because they may have less accurate algorithmic forecasts or underestimate the need for treatment. One solution to eliminate bias is to use synthetic samples or artificially generated data to balance datasets. Therefore, the purpose of this study is to review and evaluate how synthetic data can be generated and used to mitigate biases, specifically focusing on the medical domain. We explored high-quality peer-reviewed articles that were focused on synthetic data generation to eliminate bias. These studies were selected based on our defined inclusion criteria and exclusion criteria and the quality of the content. The findings reveal that generated synthetic data can help improve accuracy, precision, and fairness. However, the effectiveness of synthetic data is closely dependent on the quality of the data generation process and the initial datasets used. The study also highlights the need for continuous improvement in synthetic data generation techniques and the importance of evaluation metrics for fairness in AI models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. Efficient Generation of Pretraining Samples for Developing a Deep Learning Brain Injury Model via Transfer Learning.
- Author
-
Lin, Nan, Wu, Shaoju, Wu, Zheyang, and Ji, Songbai
- Abstract
The large amount of training samples required to develop a deep learning brain injury model demands enormous computational resources. Here, we study how a transformer neural network (TNN) of high accuracy can be used to efficiently generate pretraining samples for a convolutional neural network (CNN) brain injury model to reduce computational cost. The samples use synthetic impacts emulating real-world events or augmented impacts generated from limited measured impacts. First, we verify that the TNN remains highly accurate for the two impact types (N = 100 each; R 2 of 0.948–0.967 with root mean squared error, RMSE, ~ 0.01, for voxelized peak strains). The TNN-estimated samples (1000–5000 for each data type) are then used to pretrain a CNN, which is further finetuned using directly simulated training samples (250–5000). An independent measured impact dataset considered of complete capture of impact event is used to assess estimation accuracy (N = 191). We find that pretraining can significantly improve CNN accuracy via transfer learning compared to a baseline CNN without pretraining. It is most effective when the finetuning dataset is relatively small (e.g., 2000–4000 pretraining synthetic or augmented samples improves success rate from 0.72 to 0.81 with 500 finetuning samples). When finetuning samples reach 3000 or more, no obvious improvement occurs from pretraining. These results support using the TNN to rapidly generate pretraining samples to facilitate a more efficient training strategy for future deep learning brain models, by limiting the number of costly direct simulations from an alternative baseline model. This study could contribute to a wider adoption of deep learning brain injury models for large-scale predictive modeling and ultimately, enhancing safety protocols and protective equipment. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. Synthetic data at scale: a development model to efficiently leverage machine learning in agriculture.
- Author
-
Klein, Jonathan, Waller, Rebekah, Pirk, Sören, Pałubicki, Wojtek, Tester, Mark, and Michels, Dominik L.
- Subjects
ARTIFICIAL intelligence ,MACHINE learning ,AGRICULTURE ,TOMATOES ,MODELS & modelmaking - Abstract
The rise of artificial intelligence (AI) and in particularmodernmachine learning (ML) algorithms during the last decade has been met with great interest in the agricultural industry. While undisputedly powerful, their main drawback remains the need for sufficient and diverse training data. The collection of real datasets and their annotation are themain cost drivers of ML developments, and while promising results on synthetically generated training data have been shown, their generation is not without difficulties on their own. In this paper, we present a development model for the iterative, cost-efficient generation of synthetic training data. Its application is demonstrated by developing a low-cost early disease detector for tomato plants (Solanum lycopersicum) using synthetic training data. A neural classifier is trained by exclusively using synthetic images, whose generation process is iteratively refined to obtain optimal performance. In contrast to other approaches that rely on a human assessment of similarity between real and synthetic data, we instead introduce a structured, quantitative approach. Our evaluation shows superior generalization results when compared to using nontask-specific real training data and a higher cost efficiency of development compared to traditional synthetic training data. We believe that our approach will help to reduce the cost of synthetic data generation in future applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Investigating the Sim-to-Real Generalizability of Deep Learning Object Detection Models.
- Author
-
Rüter, Joachim, Durak, Umut, and Dauer, Johann C.
- Subjects
OBJECT recognition (Computer vision) ,COMPUTER vision ,AIRPLANE air refueling ,DEEP learning ,SIMULATION methods & models - Abstract
State-of-the-art object detection models need large and diverse datasets for training. As these are hard to acquire for many practical applications, training images from simulation environments gain more and more attention. A problem arises as deep learning models trained on simulation images usually have problems generalizing to real-world images shown by a sharp performance drop. Definite reasons and influences for this performance drop are not yet found. While previous work mostly investigated the influence of the data as well as the use of domain adaptation, this work provides a novel perspective by investigating the influence of the object detection model itself. Against this background, first, a corresponding measure called sim-to-real generalizability is defined, comprising the capability of an object detection model to generalize from simulation training images to real-world evaluation images. Second, 12 different deep learning-based object detection models are trained and their sim-to-real generalizability is evaluated. The models are trained with a variation of hyperparameters resulting in a total of 144 trained and evaluated versions. The results show a clear influence of the feature extractor and offer further insights and correlations. They open up future research on investigating influences on the sim-to-real generalizability of deep learning-based object detection models as well as on developing feature extractors that have better sim-to-real generalizability capabilities. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. Accelerating User Profiling in E-Commerce Using Conditional GAN Networks for Synthetic Data Generation.
- Author
-
Gabryel, Marcin, Kocić, Eliza, Kocić, Milan, Patora-Wysocka, Zofia, Xiao, Min, and Pawlak, Mirosław
- Subjects
GENERATIVE adversarial networks ,INTERNET traffic ,PRICE sensitivity ,PURCHASING ,INTERNET stores - Abstract
This paper presents the findings of a study on the profiling of online store users in terms of their likelihood of making a purchase. It also considers the possibility of implementing this solution in the short term. The paper describes the process of developing a profiling model based on data derived from monitoring user behaviour on a website. During the customer's subsequent visits, information is collected to identify the user, record their behaviour on the page and the fact that they made a purchase. The model requires a substantial amount of training data, primarily related to the purchase of products. This represents a small percentage of total website traffic and requires a considerable amount of time to monitor user behaviour. Therefore, we investigated the possibility of using the Conditional Generative Adversarial Network (CGAN) to generate synthetic data for training the profiling model. The application of GAN would facilitate a more expedient implementation of this model on an online store website. The findings of this study may also prove beneficial to webshop owners and managers, enabling them to gain a deeper insight into their customers and align their price offers or discounts with the profile of a particular user. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Validation Assessment of Privacy‐Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real‐World COVID‐19 Vaccine Effectiveness.
- Author
-
Wang, Echo, Mott, Katrina, Zhang, Hongtao, Gazit, Sivan, Chodick, Gabriel, and Burcu, Mehmet
- Abstract
Purpose: To assess the validity of privacy‐preserving synthetic data by comparing results from synthetic versus original EHR data analysis. Methods: A published retrospective cohort study on real‐world effectiveness of COVID‐19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID‐19 infection, symptomatic COVID‐19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results. Results: The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%–99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID‐19 Infection. In the comparison of hazard ratios for COVID 19‐related hospitalization and odds ratio for symptomatic COVID‐19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates. Conclusions: Overall, comparison of synthetic versus original real‐world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. RADIAN – A tool for generating synthetic spatial data for use in teaching and learning.
- Author
-
Gorry, Paddy and Mooney, Peter
- Subjects
- *
SOFTWARE development tools , *GEOGRAPHIC information systems , *PYTHONS , *MOTIVATION (Psychology) , *ALGORITHMS - Abstract
We describe a Python-based software tool called RADIAN (
RA nD om spatI al dA ta geN erator) developed with the purpose of generating simple synthetic spatial datasets. These datasets can be used in many contexts such as teaching and learning of GIS, testing of spatial algorithms, testing and visualization approaches. This paper provides a motivation for the need for a tool such as RADIAN along with a survey of other similar approaches. The methodological component of how RADIAN generates synthetic spatial datasets is described. We describe experimental results from the comparison of RADIAN with QGIS, which is the most closely comparable tool available at the time of writing. Finally, we provide some conclusions on the impact and potential of RADIAN with some interesting avenues for future work and development. RADIAN is available as open-source software. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
33. Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups.
- Author
-
Farhadyar, Kiana, Bonofiglio, Federico, Hackenberg, Maren, Behrens, Max, Zöller, Daniela, and Binder, Harald
- Subjects
- *
MARGINAL distributions , *DEEP learning , *DATA protection , *HETEROGENEITY , *NUISANCES , *AUTOENCODER - Abstract
In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. Synthetic Health Data: Real Ethical Promise and Peril.
- Author
-
Susser, Daniel, Schiff, Daniel S., Gerke, Sara, Cabrera, Laura Y., Cohen, I. Glenn, Doerr, Megan, Harrod, Jordan, Kostick‐Quenet, Kristin, McNealy, Jasmine, Meyer, Michelle N., Price, W. Nicholson, and Wagner, Jennifer K.
- Subjects
- *
DATA security , *PRIVACY , *MEDICAL care , *INTERNET , *COMMUNICATION , *ELECTRONIC health records , *TECHNOLOGY , *MEDICAL ethics , *ETHICS - Abstract
Researchers and practitioners are increasingly using machine‐generated synthetic data as a tool for advancing health science and practice, by expanding access to health data while—potentially—mitigating privacy and related ethical concerns around data sharing. While using synthetic data in this way holds promise, we argue that it also raises significant ethical, legal, and policy concerns, including persistent privacy and security problems, accuracy and reliability issues, worries about fairness and bias, and new regulatory challenges. The virtue of synthetic data is often understood to be its detachment from the data subjects whose measurement data is used to generate it. However, we argue that addressing the ethical issues synthetic data raises might require bringing data subjects back into the picture, finding ways that researchers and data subjects can be more meaningfully engaged in the construction and evaluation of datasets and in the creation of institutional safeguards that promote responsible use. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Potential of synthetic images in landslide segmentation in data-poor scenario: a framework combining GAN and transformer models.
- Author
-
Feng, Xiao, Du, Juan, Wu, Minghua, Chai, Bo, Miao, Fasheng, and Wang, Yang
- Subjects
- *
CONVOLUTIONAL neural networks , *TRANSFORMER models , *GENERATIVE adversarial networks , *REMOTE sensing , *IMAGE segmentation , *SCARCITY , *DEEP learning - Abstract
Accurate landslide segmentation from remote sensing data is pivotal for efficient emergency response and risk management. In recent years, data-driven deep learning approaches have emerged as a significant area of focus in this domain. However, the limited availability of landslide data often restricts the effectiveness of these approaches. This study introduces the StyleGAN2-transformer framework for landslide segmentation, utilizing generative adversarial networks (GANs) for the first time to create synthetic, high-quality landslide images to address the data scarcity issue that undermines landslide segmentation model performance. Two datasets were developed: one containing a limited set of real landslide images and the other supplemented with synthetic landslide images generated by StyleGAN2. These datasets facilitated comparative experiments to quantitatively assess the impact of synthetic data on the performance of both convolutional neural network (CNN) and transformer series models, employing a suite of metrics for thorough evaluation. The findings indicate that adding synthetic landslide images from StyleGAN2 improves the overall accuracy of most landslide segmentation models significantly, achieving more than a 10% increase. Moreover, integrating StyleGAN2 with transformer models presents an optimized approach, as transformer models surpass CNN models in accuracy when adequate training data are available. Finally, the results also confirm that the StyleGAN2-transformer framework exhibits strong generalizability in a variety of scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Data augmentation for generating synthetic electrogastrogram time series.
- Author
-
Miljković, Nadica, Milenić, Nikola, Popović, Nenad B., and Sodnik, Jaka
- Abstract
To address an emerging need for large number of diverse datasets for rigor evaluation of signal processing techniques, we developed and evaluated a new method for generating synthetic electrogastrogram time series. We used electrogastrography (EGG) data from an open database to set model parameters and statistical tests to evaluate synthesized data. Additionally, we illustrated method customization for generating artificial EGG time series alterations caused by the simulator sickness. Proposed data augmentation method generates synthetic EGG data with specified duration, sampling frequency, recording state (postprandial or fasting state), overall noise and breathing artifact injection, and pauses in the gastric rhythm (arrhythmia occurrence) with statistically significant difference between postprandial and fasting states in > 70% cases while not accounting for individual differences. Features obtained from the synthetic EGG signal resembling simulator sickness occurrence displayed expected trends. The code for generation of synthetic EGG time series is not only freely available and can be further customized to assess signal processing algorithms but also may be used to increase data diversity for training artificial intelligence (AI) algorithms. The proposed approach is customized for EGG data synthesis but can be easily utilized for other biosignals with similar nature such as electroencephalogram. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. SPINNED: Simulation‐based physics‐informed neural network for deconvolution of dynamic susceptibility contrast MRI perfusion data.
- Author
-
Asaduddin, Muhammad, Kim, Eung Yeop, and Park, Sung‐Hong
- Subjects
SINGULAR value decomposition ,MAGNETIC resonance imaging ,SUPERVISED learning ,PERFUSION ,TIME series analysis - Abstract
Purpose: To propose the simulation‐based physics‐informed neural network for deconvolution of dynamic susceptibility contrast (DSC) MRI (SPINNED) as an alternative for more robust and accurate deconvolution compared to existing methods. Methods: The SPINNED method was developed by generating synthetic tissue residue functions and arterial input functions through mathematical simulations and by using them to create synthetic DSC MRI time series. The SPINNED model was trained using these simulated data to learn the underlying physical relation (deconvolution) between the DSC‐MRI time series and the arterial input functions. The accuracy and robustness of the proposed SPINNED method were assessed by comparing it with two common deconvolution methods in DSC MRI data analysis, circulant singular value decomposition, and Volterra singular value decomposition, using both simulation data and real patient data. Results: The proposed SPINNED method was more accurate than the conventional methods across all SNR levels and showed better robustness against noise in both simulation and real patient data. The SPINNED method also showed much faster processing speed than the conventional methods. Conclusion: These results support that the proposed SPINNED method can be a good alternative to the existing methods for resolving the deconvolution problem in DSC MRI. The proposed method does not require any separate ground‐truth measurement for training and offers additional benefits of quick processing time and coverage of diverse clinical scenarios. Consequently, it will contribute to more reliable, accurate, and rapid diagnoses in clinical applications compared with the previous methods including those based on supervised learning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Sensor-based characterization of construction and demolition waste at high occupancy densities using synthetic training data and deep learning.
- Author
-
Kronenwett, Felix, Maier, Georg, Leiss, Norbert, Gruna, Robin, Thome, Volker, and Längle, Thomas
- Subjects
CONSTRUCTION & demolition debris ,OBJECT recognition (Computer vision) ,CONVOLUTIONAL neural networks ,CIRCULAR economy ,MACHINE learning ,DEEP learning - Abstract
Sensor-based monitoring of construction and demolition waste (CDW) streams plays an important role in recycling (RC). Extracted knowledge about the composition of a material stream helps identifying RC paths, optimizing processing plants and form the basis for sorting. To enable economical use, it is necessary to ensure robust detection of individual objects even with high material throughput. Conventional algorithms struggle with resulting high occupancy densities and object overlap, making deep learning object detection methods more promising. In this study, different deep learning architectures for object detection (Region-based CNN/Region-based Convolutional Neural Network (Faster R-CNN), You only look once (YOLOv3), Single Shot MultiBox Detector (SSD)) are investigated with respect to their suitability for CDW characterization. A mixture of brick and sand-lime brick is considered as an exemplary waste stream. Particular attention is paid to detection performance with increasing occupancy density and particle overlap. A method for the generation of synthetic training images is presented, which avoids time-consuming manual labelling. By testing the models trained on synthetic data on real images, the success of the method is demonstrated. Requirements for synthetic training data composition, potential improvements and simplifications of different architecture approaches are discussed based on the characteristic of the detection task. In addition, the required inference time of the presented models is investigated to ensure their suitability for use under real-time conditions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. A Systematic Review of Synthetic Data Generation Techniques Using Generative AI.
- Author
-
Goyal, Mandeep and Mahmoud, Qusay H.
- Subjects
GENERATIVE artificial intelligence ,LANGUAGE models ,ALGORITHMIC bias ,GENERATIVE adversarial networks ,MACHINE learning - Abstract
Synthetic data are increasingly being recognized for their potential to address serious real-world challenges in various domains. They provide innovative solutions to combat the data scarcity, privacy concerns, and algorithmic biases commonly used in machine learning applications. Synthetic data preserve all underlying patterns and behaviors of the original dataset while altering the actual content. The methods proposed in the literature to generate synthetic data vary from large language models (LLMs), which are pre-trained on gigantic datasets, to generative adversarial networks (GANs) and variational autoencoders (VAEs). This study provides a systematic review of the various techniques proposed in the literature that can be used to generate synthetic data to identify their limitations and suggest potential future research areas. The findings indicate that while these technologies generate synthetic data of specific data types, they still have some drawbacks, such as computational requirements, training stability, and privacy-preserving measures which limit their real-world usability. Addressing these issues will facilitate the broader adoption of synthetic data generation techniques across various disciplines, thereby advancing machine learning and data-driven solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Harnessing synthetic data to address fraud in cross-border payments.
- Author
-
Bryssinck, Johan, Jacobs, Tom, Simini, Filippo, Doddasomayajula, Ravi, Koder, Martin, Curbera, Francisco, Vishwanath, Venkatram, and Neti, Chalapathy
- Subjects
FRAUD ,ARTIFICIAL intelligence ,ALGORITHMS ,FRAUD investigation ,INFORMATION sharing - Abstract
The sharing of data between financial institutions is widely recognised as a key component in the industry's efforts to combat fraud. Broader access to multiple sources of financial data is also critical to the development of high-quality fraud detection mechanisms based on artificial intelligence (AI). Given the challenges relating to sharing real financial data across countries and institutions, the use of synthetic data has recently become critical to enabling the exploration of broader data sharing and supporting open collaboration in AI model development. To generate synthetic data that can substitute for real data, computer algorithms closely mimic the key statistical properties of genuine data, while strictly preserving the privacy and sovereignty of the source data. This paper presents the results of an ongoing exploration into the generation of high-utility synthetic datasets of cross-border payment transactions using transformer models and discusses its application to the development of AI-based fraud prevention solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Enhancing Building Services in Higher Education Campuses through Participatory Science.
- Author
-
Itair, Mohammed, Shahrour, Isam, El Meouche, Rani, and Hattab, Nizar
- Subjects
BUILDING information modeling ,ARTIFICIAL intelligence ,COLLEGE buildings ,QUALITY of service ,EMPLOYEE participation in management - Abstract
This paper explores how participatory science can enhance building services on a higher education campus. The use of participatory science aims to involve students, faculty members, and technical teams in improving the management of the campus through their participation in data collection and evaluation of the building services. It represents a valuable alternative for campuses needing more building monitoring. The paper also shows how the performance of participatory science could be improved by combining digital technologies such as Building Information Modeling (BIM) and artificial intelligence (AI). The framework is applied to the Faculty of Engineering at An-Najah National University to improve the building services of the campus. A combination of users' feedback and AI-generated synthetic data is used to explore the performance of the proposed method. Results confirm the high potential of participatory science for improving the services and quality of life on higher education campuses. This is achieved through students' active participation and involvement in data collection and reporting on their individual experiences. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Railway Defect Detection.
- Author
-
Ferdousi, Rahatara, Yang, Chunsheng, Hossain, M. Anwar, Laamarti, Fedwa, Hossain, M. Shamim, and Saddik, Abdulmotaleb El
- Abstract
Recent advancements in cognitive computing, through the integration of artificial intelligence (AI) techniques, have facilitated the development of intelligent cognitive systems (ICS). This benefits railway defect detection by enabling ICS to emulate human-like analysis of defect patterns in image data. Although visual defect classification based on convolutional neural networks (CNN) has achieved decent performance, the scarcity of large datasets for railway defect detection remains a challenge. This scarcity stems from the infrequent nature of accidents that result in defective railway parts. Existing research efforts have addressed the challenge of data scarcity by exploring rule-based and generative data augmentation approaches. Among these approaches, variational autoencoder (VAE) models can generate realistic data without the need for extensive baseline datasets for noise modeling. This study proposes a VAE-based synthetic image generation technique for training railway defect classifiers. Our approach introduces a modified regularization strategy that combines weight decay with reconstruction loss. Using this method, we created a synthetic dataset for the Canadian Pacific Railway (CPR), consisting of 50 real samples across five classes. Remarkably, our method generated 500 synthetic samples, achieving a minimal reconstruction loss of 0.021. A visual transformer (ViT) model, fine-tuned using this synthetic CPR dataset, achieved high accuracy rates (98–99%) in classifying the five railway defect classes. This research presents an approach that addresses the data scarcity issue in railway defect detection, indicating a path toward enhancing the development of ICS in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. A Deep Learning Model for Detecting Fake Medical Images to Mitigate Financial Insurance Fraud.
- Author
-
Arshed, Muhammad Asad, Mumtaz, Shahzad, Gherghina, Ștefan Cristian, Urooj, Neelam, Ahmed, Saeed, and Dewi, Christine
- Subjects
GENERATIVE adversarial networks ,STABLE Diffusion ,CONVOLUTIONAL neural networks ,ARTIFICIAL intelligence ,INSURANCE crimes ,DEEP learning - Abstract
Artificial Intelligence and Deepfake Technologies have brought a new dimension to the generation of fake data, making it easier and faster than ever before—this fake data could include text, images, sounds, videos, etc. This has brought new challenges that require the faster development of tools and techniques to avoid fraudulent activities at pace and scale. Our focus in this research study is to empirically evaluate the use and effectiveness of deep learning models such as Convolutional Neural Networks (CNNs) and Patch-based Neural Networks in the context of successful identification of real and fake images. We chose the healthcare domain as a potential case study where the fake medical data generation approach could be used to make false insurance claims. For this purpose, we obtained publicly available skin cancer data and used recently introduced stable diffusion approaches—a more effective technique than prior approaches such as Generative Adversarial Network (GAN)—to generate fake skin cancer images. To the best of our knowledge, and based on the literature review, this is one of the few research studies that uses images generated using stable diffusion along with real image data. As part of the exploratory analysis, we analyzed histograms of fake and real images using individual color channels and averaged across training and testing datasets. The histogram analysis demonstrated a clear change by shifting the mean and overall distribution of both real and fake images (more prominent in blue and green) in the training data whereas, in the test data, both means were different from the training data, so it appears to be non-trivial to set a threshold which could give better predictive capability. We also conducted a user study to observe where the naked eye could identify any patterns for classifying real and fake images, and the accuracy of the test data was observed to be 68%. The adoption of deep learning predictive approaches (i.e., patch-based and CNN-based) has demonstrated similar accuracy (~100%) in training and validation subsets of the data, and the same was observed for the test subset with and without StratifiedKFold (k = 3). Our analysis has demonstrated that state-of-the-art exploratory and deep-learning approaches are effective enough to detect images generated from stable diffusion vs. real images. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. Development of a cerebellar ataxia diagnosis model using conditional GAN-based synthetic data generation for visuomotor adaptation task
- Author
-
Jinah Kim, Sung-Ho Woo, Taekyung Kim, Won Tae Yoon, Jung Hwan Shin, Jee-Young Lee, and Jeh-Kwang Ryu
- Subjects
Cerebellar ataxia diagnosis ,Visuomotor adaptation task ,Conditional generative adversarial network ,Synthetic data ,Digital healthcare ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract This study proposes a synthetic data generation model to create a classification framework for cerebellar ataxia patients using trajectory data from the visuomotor adaptation task. The classification objectives include patients with cerebellar ataxia, age-matched normal individuals, and young healthy subjects. Synthetic data for the three classes is generated based on class conditions and random noise by leveraging a combination of conditional adversarial generative neural networks and reconstruction networks. This synthetic data, alongside real data, is utilized as training data for the patient classification model to enhance classification accuracy. The fidelity of the synthetic data is assessed visually to measure the validity and diversity of the generated data qualitatively while quantitatively evaluating distribution similarity to real data. Furthermore, the clinical efficacy of the patient classification model employing synthetic data is demonstrated by showcasing improved classification accuracy through a comparative analysis between results obtained using solely real data and those obtained when both real and synthetic data are utilized. This methodological approach holds promise in addressing data insufficiency in the digital healthcare domain, employing deep learning methodologies, and developing early disease diagnosis tools.
- Published
- 2024
- Full Text
- View/download PDF
45. Completing 3D point clouds of individual trees using deep learning
- Author
-
Aline Bornand, Meinrad Abegg, Felix Morsdorf, and Nataliia Rehush
- Subjects
artificial intelligence (AI) ,deep learning (DL) ,forest ,LiDAR ,point cloud completion ,synthetic data ,Ecology ,QH540-549.5 ,Evolution ,QH359-425 - Abstract
Abstract In close‐range remote sensing data collected in a forest, occlusion often causes incomplete or sparse point cloud representations of individual trees, impeding accurate 3D reconstruction of tree architecture and estimation of tree height and volume. Recent developments in deep learning (DL) for 3D data have produced approaches for point cloud completion, which could potentially be applied to trees. We explored the potential of a DL approach to fill gaps in dense point clouds representing the main structures of deciduous trees by applying an existing transformer‐based completion model (PoinTr). Complete point clouds are required as training data, but even dense terrestrial laser scanning (TLS) data sets contain gaps caused by occlusion, making it nearly impossible to acquire such data. We therefore investigated the ability of point cloud completion models trained on a range of synthetic data sets to handle occlusion patterns in real‐world point clouds. Despite the limited data set, we successfully fine‐tuned a general pre‐trained completion model to fill gaps within 1 m3 segments of tree point clouds. Fine‐tuning on synthetic tree data improved the model's ability to complete tree objects compared with training on diverse artificial objects. However, the quality of the predictions was influenced by the level of sophistication of the synthetic data. Our results demonstrate that incorporating even limited real‐world TLS data during training can considerably improve completion results but may introduce additional noise in the predictions. 3D point cloud completion with DL has the potential to improve and fill gaps in point clouds of individual trees, facilitating further steps in the processing and analysis of 3D forest data.
- Published
- 2024
- Full Text
- View/download PDF
46. The Problems of LLM-generated Data in Social Science Research
- Author
-
Luca Rossi, Katherine Harrison, and Irina Shklovski
- Subjects
llm ,synthetic data ,social science ,research methods ,Social Sciences ,Sociology (General) ,HM401-1281 - Abstract
Beyond being used as fast and cheap annotators for otherwise complex classification tasks, LLMs have seen a growing adoption for generating synthetic data for social science and design research. Researchers have used LLM-generated data for data augmentation and prototyping, as well as for direct analysis where LLMs acted as proxies for real human subjects. LLM-based synthetic data build on fundamentally different epistemological assumptions than previous synthetically generated data and are justified by a different set of considerations. In this essay, we explore the various ways in which LLMs have been used to generate research data and consider the underlying epistemological (and accompanying methodological) assumptions. We challenge some of the assumptions made about LLM-generated data, and we highlight the main challenges that social sciences and humanities need to address if they want to adopt LLMs as synthetic data generators.
- Published
- 2024
- Full Text
- View/download PDF
47. Synthetic data for privacy-preserving clinical risk prediction
- Author
-
Zhaozhi Qian, Thomas Callender, Bogdan Cebere, Sam M. Janes, Neal Navani, and Mihaela van der Schaar
- Subjects
Synthetic data ,Machine learning ,Risk-prediction ,Medicine ,Science - Abstract
Abstract Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches—such as federated learning—analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act in place of real data for a wide range of use cases. However, the role that synthetic data might play in all aspects of clinical model development remains unknown. In this work, we used state-of-the-art generators explicitly designed for privacy preservation to create a synthetic version of ever-smokers in the UK Biobank before building prognostic models for lung cancer under several data release assumptions. We demonstrate that synthetic data can be effectively used throughout the medical prognostic modeling pipeline even without eventual access to the real data. Furthermore, we show the implications of different data release approaches on how synthetic biobank data could be deployed within the healthcare system.
- Published
- 2024
- Full Text
- View/download PDF
48. Forecasting Population Migration in Small Settlements Using Generative Models under Conditions of Data Scarcity
- Author
-
Kirill Zakharov, Albert Aghajanyan, Anton Kovantsev, and Alexander Boukhanovsky
- Subjects
migration forecasting ,small settlements ,synthetic data ,collecting data ,machine learning for migration ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Today, the problem of predicting population migration is essential in the concept of smart cities for the proper development planning of certain regions of the country, as well as their financing and landscaping. In dealing with population migration in small settlements whose population is below 100,000, data collection is challenging. In countries where data collection is not well developed, most of the available data in open access are presented as part of textual reports issued by authorities in municipal districts. Therefore, the creation of a more or less adequate dataset requires significant efforts, and despite these efforts, the outcome is far from ideal. However, for large cities, there are typically aggregated databases maintained by authorities. We used them to find out what factors had an impact on the number of people who arrived or departed the city. Then, we reviewed several dozens of documents to mine the data of small settlements. These data were not sufficient to solve machine learning tasks, but they were used as the basis for creating a synthetic sample for model fitting. We found that a combination of two models, each trained on synthetic data, performed better. A binary classifier predicted the migration direction and a regressor estimateed the number of migrants. Lastly, the model fitted with synthetics was applied to the other set of real data, and we obtained good results, which are presented in this paper.
- Published
- 2024
- Full Text
- View/download PDF
49. Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups
- Author
-
Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Max Behrens, Daniela Zöller, and Harald Binder
- Subjects
Synthetic data ,Complex distribution ,Propensity score ,Deep generative model ,Variational autoencoder ,Medicine (General) ,R5-920 - Abstract
Abstract In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.
- Published
- 2024
- Full Text
- View/download PDF
50. Review on synergizing the Metaverse and AI-driven synthetic data: enhancing virtual realms and activity recognition in computer vision
- Author
-
Megani Rajendran, Chek Tien Tan, Indriyati Atmosukarto, Aik Beng Ng, and Simon See
- Subjects
Synthetic data ,Virtual reality ,Datasets ,Human action ,Metaverse ,Electronic computers. Computer science ,QA75.5-76.95 ,Neurophysiology and neuropsychology ,QP351-495 - Abstract
Abstract The Metaverse’s emergence is redefining digital interaction, enabling seamless engagement in immersive virtual realms. This trend’s integration with AI and virtual reality (VR) is gaining momentum, albeit with challenges in acquiring extensive human action datasets. Real-world activities involve complex intricate behaviors, making accurate capture and annotation difficult. VR compounds this difficulty by requiring meticulous simulation of natural movements and interactions. As the Metaverse bridges the physical and digital realms, the demand for diverse human action data escalates, requiring innovative solutions to enrich AI and VR capabilities. This need is underscored by state-of-the-art models that excel but are hampered by limited real-world data. The overshadowing of synthetic data benefits further complicates the issue. This paper systematically examines both real-world and synthetic datasets for activity detection and recognition in computer vision. Introducing Metaverse-enabled advancements, we unveil SynDa’s novel streamlined pipeline using photorealistic rendering and AI pose estimation. By fusing real-life video datasets, large-scale synthetic datasets are generated to augment training and mitigate real data scarcity and costs. Our preliminary experiments reveal promising results in terms of mean average precision (mAP), where combining real data and synthetic video data generated using this pipeline to train models presents an improvement in mAP (32.35%), compared to the mAP of the same model when trained on real data (29.95%). This demonstrates the transformative synergy between Metaverse and AI-driven synthetic data augmentation.
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.