569 results on '"Synthetic data generation"'
Search Results
2. Electricity GANs: Generative Adversarial Networks for Electricity Price Scenario Generation.
- Author
-
Yilmaz, Bilgi, Laudagé, Christian, Korn, Ralf, and Desmettre, Sascha
- Subjects
GENERATIVE adversarial networks ,DEEP learning ,ELECTRICITY markets ,ELECTRICITY pricing ,ENERGY industries - Abstract
The dynamic structure of electricity markets, where uncertainties abound due to, e.g., demand variations and renewable energy intermittency, poses challenges for market participants. We propose generative adversarial networks (GANs) to generate synthetic electricity price data. This approach aims to provide comprehensive data that accurately reflect the complexities of the actual electricity market by capturing its distribution. Consequently, we would like to equip market participants with a versatile tool for successfully dealing with strategy testing, risk model validation, and decision-making enhancement. Access to high-quality synthetic electricity price data is instrumental in cultivating a resilient and adaptive marketplace, ultimately contributing to a more knowledgeable and prepared electricity market community. In order to assess the performance of various types of GANs, we performed a numerical study on Turkey's intraday electricity market weighted average price (IDM-WAP). As a key finding, we show that GANs can effectively generate realistic synthetic electricity prices. Furthermore, we reveal that the use of complex variants of GAN algorithms does not lead to a significant improvement in synthetic data quality. However, it requires a notable increase in computational costs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Learning by Demonstration of a Robot Using One-Shot Learning and Cross-Validation Regression with Z-Score.
- Author
-
Duque-Domingo, Jaime, García-Gómez, Miguel, Zalama, Eduardo, and Gómez-García-Bermejo, Jaime
- Subjects
REGRESSION analysis ,ROBOTS ,INSTRUCTIONAL systems ,ROBOTICS ,DETECTORS - Abstract
We introduce a One-Shot Learning system where a robot effectively learns how to manipulate objects by relying solely on the object's name, a single image, and a visual example of a person picking it up. Once the robot has mastered picking up a new object, an audio command is all that is needed to prompt it to perform the action. Our approach heavily depends on synthetic data generation, which is crucial for training various detection and regression models. Additionally, we introduce a novel combined regression model called Cross-Validation Regression with Z-Score (CVR-ZS), which improves the robot's grasp accuracy. The system also features a classifier that uses a cutting-edge text-encoding technique, allowing for flexible user prompts for object retrieval. The complete system includes a text encoder and classifier, an object detector, and the CVR-ZS regressor. This setup has been validated with a Niryo Ned robot. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. SyntDiaNet: Integrating feature extraction, transfer learning and classifier-embedded generative adversarial network for advanced pneumonia diagnosis.
- Author
-
Poola, Rahul Gowtham, P.L, Lahari, and Yellampalli, Siva Sankar
- Subjects
GENERATIVE adversarial networks ,TEXTURE analysis (Image processing) ,FEATURE extraction ,DATA augmentation ,PNEUMONIA ,EXTRACTION techniques - Abstract
Pneumonia, a significant contributor to global disease and mortality, imposes a substantial burden on healthcare systems. Conventional approaches to pneumonia detection rely on constrained datasets and traditional feature extraction techniques, potentially resulting in suboptimal accuracy. To overcome these limitations, we introduce the "SyntDiaNet" framework, an innovative solution that amalgamates synthetic data generation and advanced diagnostic capabilities. This research presents a comprehensive methodology aimed at enhancing pneumonia detection precision. The central innovation revolves around the integration of a Classifier-Embedded Generative Adversarial Network (CE-GAN) for data augmentation, followed by a multifaceted feature extraction process. This process encompasses Contour Detection, Gabor Texture Analysis, and Blob Detection, to capture nuanced patterns that may indicate pneumonia in X-ray images. Leveraging broad knowledge, an unsupervised feature extraction step is integrated using the SyntDiaNet Transfer Learning model. Across diverse feature extraction strategies, combined with the CE-GAN dataset and SyntDiaNet model, accuracy rates demonstrate substantial improvements, with potential ranges of 90.5% to 97.6%. The precision ranged between 0.932 and 0.981; recall scores spanned from 0.896 to 0.987; F1 scores varied between 0.90 and 0.980, and the AUC-ROC values ranged from 0.90 to 0.97, showcasing the ability of the models to discriminate between positive and negative classes, emphasizing differences in the classifiers' overall performance in ranking instances. By utilizing synthetic data and advanced feature extraction techniques, it adeptly addresses challenges related to limited datasets and imbalanced class distributions. Furthermore, its adaptability to various medical imaging tasks underscores its potential impact across a range of healthcare applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Towards digital representations for brownfield factories using synthetic data generation and 3D object detection.
- Author
-
Toro, Javier Villena, Bolin, Lars, Eriksson, Jacob, and Wiberg, Anton
- Subjects
DIGITAL twins ,ARTIFICIAL intelligence ,POINT cloud ,DATA modeling ,ENGINEERING - Abstract
This study emphasizes the importance of automatic synthetic data generation in data-driven applications, especially in the development of a 3D computer vision system for engineering contexts such as brownfield factory projects, where no data is readily available. Key points: (1) A successful integration of a synthetic data generator with the S3DIS dataset, leading to a significant enhancement in object detection of previous classes and enabling recognition of new ones; (2) A proposal for a CAD-based configurator for efficient and customizable scene reconstruction from LiDAR scanner point clouds. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Electricity GANs: Generative Adversarial Networks for Electricity Price Scenario Generation
- Author
-
Bilgi Yilmaz, Christian Laudagé, Ralf Korn, and Sascha Desmettre
- Subjects
generative adversarial networks ,complex GAN variants ,deep learning in energy markets ,synthetic data generation ,intraday electricity prices ,Nutrition. Foods and food supply ,TX341-641 - Abstract
The dynamic structure of electricity markets, where uncertainties abound due to, e.g., demand variations and renewable energy intermittency, poses challenges for market participants. We propose generative adversarial networks (GANs) to generate synthetic electricity price data. This approach aims to provide comprehensive data that accurately reflect the complexities of the actual electricity market by capturing its distribution. Consequently, we would like to equip market participants with a versatile tool for successfully dealing with strategy testing, risk model validation, and decision-making enhancement. Access to high-quality synthetic electricity price data is instrumental in cultivating a resilient and adaptive marketplace, ultimately contributing to a more knowledgeable and prepared electricity market community. In order to assess the performance of various types of GANs, we performed a numerical study on Turkey’s intraday electricity market weighted average price (IDM-WAP). As a key finding, we show that GANs can effectively generate realistic synthetic electricity prices. Furthermore, we reveal that the use of complex variants of GAN algorithms does not lead to a significant improvement in synthetic data quality. However, it requires a notable increase in computational costs.
- Published
- 2024
- Full Text
- View/download PDF
7. Learning debiased graph representations from the OMOP common data model for synthetic data generation
- Author
-
Nicolas Alexander Schulz, Jasmin Carus, Alexander Johannes Wiederhold, Ole Johanns, Frederik Peters, Natalie Rath, Katharina Rausch, Bernd Holleczek, Alexander Katalinic, the AI-CARE Working Group, and Christopher Gundler
- Subjects
Synthetic Data Generation ,Standardized Electronic Health Records ,Causal Discovery ,Discrete Time Series ,Structural Equation Models ,Graphical Models ,Medicine (General) ,R5-920 - Abstract
Abstract Background Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention. Methods Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts. Results The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand. Conclusion Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.
- Published
- 2024
- Full Text
- View/download PDF
8. Synthetic data generation techniques for training deep acoustic siren identification networks.
- Author
-
Damiano, Stefano, Cramer, Benjamin, Guntoro, Andre, and van Waterschoot, Toon
- Subjects
ARTIFICIAL neural networks ,CONVOLUTIONAL neural networks ,DOPPLER effect ,STIMULUS generalization ,SOUND recordings ,DATA augmentation - Abstract
Acoustic sensing has been widely exploited for the early detection of harmful situations in urban environments: in particular, several siren identification algorithms based on deep neural networks have been developed and have proven robust to the noisy and non-stationary urban acoustic scene. Although high classification accuracy can be achieved when training and evaluating on the same dataset, the cross-dataset performance of such models remains unexplored. To build robust models that generalize well to unseen data, large datasets that capture the diversity of the target sounds are needed, whose collection is generally expensive and time consuming. To overcome this limitation, in this work we investigate synthetic data generation techniques for training siren identification models. To obtain siren source signals, we either collect from public sources a small set of stationary, recorded siren sounds, or generate them synthetically. We then simulate source motion, acoustic propagation and Doppler effect, and finally combine the resulting signal with background noise. This way, we build two synthetic datasets used to train three different convolutional neural networks, then tested on real-world datasets unseen during training. We show that the proposed training strategy based on the use of recorded source signals and synthetic acoustic propagation performs best. In particular, this method leads to models that exhibit a better generalization ability, as compared to training and evaluating in a cross-dataset setting. Moreover, the proposed method loosens the data collection requirement and is entirely built using publicly available resources. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Privacy Preserving Human Mobility Generation Using Grid-Based Data and Graph Autoencoders.
- Author
-
Netzler, Fabian and Lienkamp, Markus
- Subjects
- *
TRAFFIC flow , *MACHINE learning , *AUTONOMOUS vehicles , *CLUSTER analysis (Statistics) , *PRIVACY - Abstract
This paper proposes a one-to-one trajectory synthetization method with stable long-term individual mobility behavior based on a generalizable area embedding. Previous methods concentrate on producing highly detailed data on short-term and restricted areas for, e.g., autonomous driving scenarios. Another possibility consists of city-wide and beyond scales that can be used to predict general traffic flows. The now-presented approach takes the tracked mobility behavior of individuals and creates coherent synthetic mobility data. These generated data reflect the person's long-term mobility behavior, guaranteeing location persistency and sound embedding within the point-of-interest structure of the observed area. After an analysis and clustering step of the original data, the area is distributed into a geospatial grid structure (H3 is used here). The neighborhood relationships between the grids are interpreted as a graph. A feed-forward autoencoder and a graph encoding–decoding network generate a latent space representation of the area. The original clustered data are associated with their respective H3 grids. With a greedy algorithm approach and concerning privacy strategies, new combinations of grids are generated as top-level patterns for individual mobility behavior. Based on the original data, concrete locations within the new grids are found and connected to ways. The goal is to generate a dataset that shows equivalence in aggregated characteristics and distances in comparison with the original data. The described method is applied to a sample of 120 from a study with 1000 participants whose mobility data were generated in the city of Munich in Germany. The results show the applicability of the approach in generating synthetic data, enabling further research on individual mobility behavior and patterns. The result comprises a sharable dataset on the same abstraction level as the input data, which can be beneficial for different applications, particularly for machine learning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Housing GANs: Deep Generation of Housing Market Data.
- Author
-
Yilmaz, Bilgi
- Subjects
GENERATIVE adversarial networks ,MACHINE learning ,HOUSING market ,DATA distribution ,RESEARCH personnel - Abstract
Modeling housing markets is a challenging and central research area since they are highly related to the economy. However, the limited available data prevents researchers from improving models. As an alternative, this study introduces Housing GANs, a data-driven modeling approach inspired by the recent success of generative adversarial networks (GANs). The Housing GANs include a generator and discriminator function utilizing Wasserstein GAN with gradient penalty and mitigate original housing datasets, including continuous and discrete data. The generator function predicts the real data distribution and generates realistic housing data. The empirical analysis highlights that the Housing GANs successfully learns the distribution and generate realistic housing data in high fidelity. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Learning debiased graph representations from the OMOP common data model for synthetic data generation.
- Author
-
Schulz, Nicolas Alexander, Carus, Jasmin, Wiederhold, Alexander Johannes, Johanns, Ole, Peters, Frederik, Rath, Natalie, Rausch, Katharina, Holleczek, Bernd, Katalinic, Alexander, Nennecke, Alice, Kusche, Henrik, Heinrichs, Vera, Eberle, Andrea, Luttmann, Sabine, Abnaof, Khalid, Kim-Wanner, Soo-Zin, Handels, Heinz, Germer, Sebastian, Halber, Marco, and Richter, Martin
- Subjects
- *
REPRESENTATIONS of graphs , *MEDICAL informatics , *DATA modeling , *NURSING informatics , *MARKOV processes ,ELECTRONIC health record standards - Abstract
Background: Generating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention. Methods: Our approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts. Results: The algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand. Conclusion: Only TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. An MLOps Framework for GAN-based Fault Detection in Bonfiglioli’s EVO Plant.
- Author
-
Dahdal, Simon, Colombi, Lorenzo, Brina, Matteo, Gilli, Alessandro, Tortonesi, Mauro, Vignoli, Massimiliano, and Stefanelli, Cesare
- Subjects
- *
MACHINE learning , *GENERATIVE adversarial networks , *AUTOMATION , *SCARCITY , *ENGINEERING - Abstract
In Industry 5.0, the scarcity of data on defective components in smart manufacturing leads to imbalanced data sets. This imbalance poses a significant challenge to the development of robust Machine Learning (ML) models, which typically require a rich variety of data for effective training. The imbalance not only restricts the models’ accuracy but also their applicability in diverse industrial scenarios. To tackle this issue, our research delves into the capabilities of Deep Generative Models, with a special focus on Generative Adversarial Networks, for the generation of synthetic data. This approach is aimed at rectifying dataset imbalances, thereby enhancing the training process of ML models. We demonstrate how synthetic data can substantially bolster the performance and reliability of ML models in industrial settings. Furthermore, the paper presents an innovative MLOps pipeline and architecture, meticulously designed to incorporate Deep Generative Models (DGMs) into the entire ML development cycle. This solution is automated and goes beyond mere automation; it is self-optimizing and capable of making necessary corrections, specifically engineered to address the dual challenges of data imbalance and scarcity, thus enabling more precise and dependable ML applications in smart manufacturing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. HydraGAN: A Cooperative Agent Model for Multi-Objective Data Generation.
- Author
-
DESMET, CHANCE and COOK, DIANE
- Subjects
- *
GENERATIVE adversarial networks , *NASH equilibrium , *DATA modeling - Abstract
Generative adversarial networks have become a de facto approach to generate synthetic data points that resemble their real counterparts. We tackle the situation where the realism of individual samples is not the sole criterion for synthetic data generation. Additional constraints such as privacy preservation, distribution realism, and diversity promotion may also be essential to optimize. To address this challenge, we introduce HydraGAN, a multi-agent network that performs multi-objective synthetic data generation. We theoretically verify that training the HydraGAN system, containing a single generator and an arbitrary number of discriminators, leads to a Nash equilibrium. Experimental results for six datasets indicate that HydraGAN consistently outperforms prior methods in maximizing the Area under the Radar Chart, balancing a combination of cooperative or competitive data generation goals. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Bayesian Knowledge Tracing Implemented in a Telecommunications Serious Game.
- Author
-
Nedombeloni, Halatedzi, Heymann, Reolyn, and Greeff, Japie
- Subjects
INTELLIGENT tutoring systems ,EDUCATIONAL games ,STUDENT-centered learning ,TELECOMMUNICATION ,RESEARCH questions ,KNOWLEDGE gap theory - Abstract
The University of Johannesburg has integrated serious games into its teaching, exemplified by Codebreakers, a 2D game teaching information theory. While successful, Codebreakers lacked personalisation and used a criticised assessment method based on answer streaks. Knowledge tracing algorithms, known for their effectiveness in intelligent tutoring systems, were considered to address these limitations. This led to the research question: "Can a new serious game be designed, incorporating knowledge tracing algorithms to deliver personalised learning experiences in telecommunications education?" In response, an escape-themed serious game was developed, integrating Bayesian Knowledge Tracing as a statistical student model for personalised learning. This innovative approach combines free-roam gameplay with tailored educational content, significantly advancing serious game design. While primarily aimed at enhancing Codebreakers, this new game contributes substantially to serious game theory by successfully implementing personalised learning within an engaging format. The project showcases the potential of knowledge tracing algorithms in creating adaptive, student-centered learning experiences within the context of educational games. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Enhancing Financial Time Series Prediction with Quantum-Enhanced Synthetic Data Generation: A Case Study on the S&P 500 Using a Quantum Wasserstein Generative Adversarial Network Approach with a Gradient Penalty.
- Author
-
Orlandi, Filippo, Barbierato, Enrico, and Gatti, Alice
- Subjects
GENERATIVE adversarial networks ,STANDARD & Poor's 500 Index ,TIME series analysis ,FORECASTING - Abstract
This study introduces a novel Quantum Wasserstein Generative Adversarial Network approach with a Gradient Penalty (QWGAN-GP) model that leverages a quantum generator alongside a classical discriminator to synthetically generate time series data. This approach aims to accurately replicate the statistical properties of the S&P 500 index. The synthetic data generated by this model were compared to the original series using various metrics, including Wasserstein distance, Dynamic Time Warping (DTW) distance, and entropy measures, among others. The outcomes demonstrate the model's robustness, with the generated data exhibiting a high degree of fidelity to the statistical characteristics of the original data. Additionally, this study explores the applicability of the synthetic time series in enhancing prediction models. An LSTM (Long-Short Term Memory)-based model was developed to evaluate the impact of incorporating synthetic data on forecasting accuracy, particularly focusing on general trends and extreme market events. The findings reveal that models trained on a mix of synthetic and real data significantly outperform those trained solely on historical data, improving predictive performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Synthesis of Hybrid Data Consisting of Chest Radiographs and Tabular Clinical Records Using Dual Generative Models for COVID-19 Positive Cases.
- Author
-
Kikuchi, Tomohiro, Hanaoka, Shouhei, Nakao, Takahiro, Takenaga, Tomomi, Nomura, Yukihiro, Mori, Harushi, and Yoshikawa, Takeharu
- Subjects
DATABASE management ,MEDICAL quality control ,RECEIVER operating characteristic curves ,RESEARCH funding ,UNIVERSITIES & colleges ,CHEST X rays ,DESCRIPTIVE statistics ,COMMUNICATION ,X-rays ,COMPARATIVE studies ,COVID-19 - Abstract
To generate synthetic medical data incorporating image-tabular hybrid data by merging an image encoding/decoding model with a table-compatible generative model and assess their utility. We used 1342 cases from the Stony Brook University Covid-19-positive cases, comprising chest X-ray radiographs (CXRs) and tabular clinical data as a private dataset (pDS). We generated a synthetic dataset (sDS) through the following steps: (I) dimensionally reducing CXRs in the pDS using a pretrained encoder of the auto-encoding generative adversarial networks (αGAN) and integrating them with the correspondent tabular clinical data; (II) training the conditional tabular GAN (CTGAN) on this combined data to generate synthetic records, encompassing encoded image features and clinical data; and (III) reconstructing synthetic images from these encoded image features in the sDS using a pretrained decoder of the αGAN. The utility of sDS was assessed by the performance of the prediction models for patient outcomes (deceased or discharged). For the pDS test set, the area under the receiver operating characteristic (AUC) curve was calculated to compare the performance of prediction models trained separately with pDS, sDS, or a combination of both. We created an sDS comprising CXRs with a resolution of 256 × 256 pixels and tabular data containing 13 variables. The AUC for the outcome was 0.83 when the model was trained with the pDS, 0.74 with the sDS, and 0.87 when combining pDS and sDS for training. Our method is effective for generating synthetic records consisting of both images and tabular clinical data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Simulated Learners in Educational Technology: A Systematic Literature Review and a Turing-like Test.
- Author
-
Käser, Tanja and Alexandron, Giora
- Abstract
Simulation is a powerful approach that plays a significant role in science and technology. Computational models that simulate learner interactions and data hold great promise for educational technology as well. Amongst others, simulated learners can be used for teacher training, for generating and evaluating hypotheses on human learning, for developing adaptive learning algorithms, for building virtual worlds in which students can practice collaboration skills with simulated pals, and for testing learning environments. This paper provides the first systematic literature review on simulated learners in the broad area of artificial intelligence in education and related fields, focusing on the decade 2010-19. We analyze the trends regarding the use of simulated learners in educational technology within this decade, the purposes for which simulated learners are being used, and how the validity of the simulated learners is assessed. We find that simulated learner models tend to represent only narrow aspects of student learning. And, surprisingly, we also find that almost half of the studies using simulated learners do not provide any evidence that their modeling addresses the most fundamental question in simulation design – is the model valid? This poses a threat to the reliability of results that are based on these models. Based on our findings, we propose that future research should focus on developing more complete simulated learner models. To validate these models, we suggest a standard and universal criterion, which is based on the lasting idea of Turing's Test. We discuss the properties of this test and its potential to move the field of simulated learners forward. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Data-Driven ICS Network Simulation for Synthetic Data Generation.
- Author
-
Kim, Minseo, Jeon, Seungho, Cho, Jake, and Gong, Seonghyeon
- Subjects
INDUSTRIAL controls manufacturing ,SYNTHETIC biology ,SUPERVISORY control & data acquisition systems ,RESEARCH personnel - Abstract
Industrial control systems (ICSs) are integral to managing and optimizing processes in various industries, including manufacturing, power generation, and more. However, the scarcity of widely adopted ICS datasets hampers research efforts in areas like optimization and security. This scarcity arises due to the substantial cost and technical expertise required to create physical ICS environments. In response to these challenges, this paper presents a groundbreaking approach to generating synthetic ICS data through a data-driven ICS network simulation. We circumvent the need for expensive hardware by recreating the entire ICS environment in software. Moreover, rather than manually replicating the control logic of ICS components, we leverage existing data to autonomously generate control logic. The core of our method involves the stochastic setting of setpoints, which introduces randomness into the generated data. Setpoints serve as target values for controlling the operation of the ICS process. This approach enables us to augment existing ICS datasets and cater to the data requirements of machine learning-based ICS intrusion detection systems and other data-driven applications. Our simulated ICS environment employs virtualized containers to mimic the behavior of real-world PLCs and SCADA systems, while control logic is deduced from publicly available ICS datasets. Setpoints are generated probabilistically to ensure data diversity. Experimental results validate the fidelity of our synthetic data, emphasizing their ability to closely replicate temporal and statistical characteristics of real-world ICS networks. In conclusion, this innovative data-driven ICS network simulation offers a cost-effective and scalable solution for generating synthetic ICS data. It empowers researchers in the field of ICS optimization and security with diverse, realistic datasets, furthering advancements in this critical domain. Future work may involve refining the simulation model and exploring additional applications for synthetic ICS data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Addressing Data Scarcity in the Medical Domain: A GPT-Based Approach for Synthetic Data Generation and Feature Extraction.
- Author
-
Sufi, Fahim
- Subjects
- *
GENERATIVE pre-trained transformers , *LANGUAGE models , *MACHINE learning , *SCARCITY , *HOSPITAL admission & discharge , *FEATURE extraction , *SOFTWARE refactoring - Abstract
This research confronts the persistent challenge of data scarcity in medical machine learning by introducing a pioneering methodology that harnesses the capabilities of Generative Pre-trained Transformers (GPT). In response to the limitations posed by a dearth of labeled medical data, our approach involves the synthetic generation of comprehensive patient discharge messages, setting a new standard in the field with GPT autonomously generating 20 fields. Through a meticulous review of the existing literature, we systematically explore GPT's aptitude for synthetic data generation and feature extraction, providing a robust foundation for subsequent phases of the research. The empirical demonstration showcases the transformative potential of our proposed solution, presenting over 70 patient discharge messages with synthetically generated fields, including severity and chances of hospital re-admission with justification. Moreover, the data had been deployed in a mobile solution where regression algorithms autonomously identified the correlated factors for ascertaining the severity of patients' conditions. This study not only establishes a novel and comprehensive methodology but also contributes significantly to medical machine learning, presenting the most extensive patient discharge summaries reported in the literature. The results underscore the efficacy of GPT in overcoming data scarcity challenges and pave the way for future research to refine and expand the application of GPT in diverse medical contexts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Leveraging on Synthetic Data Generation Techniques to Train Machine Learning Models for Tenaga Nasional Berhad Stock Price Movement Prediction.
- Author
-
Syahmina Mohd Nazarudin, Nur Aliah, Mohd Ariffin, Nor Hapiza, and Maskat, Ruhaila
- Published
- 2024
- Full Text
- View/download PDF
21. Computational Tool for Aircraft Fuel System Analysis.
- Author
-
Di Marzo, Marcela A. D., Calil, Pedro G., Najafabadi, Hossein Nadali, Takase, Viviam Lawrence, Mourão, Carlos H. B., and Bidinotto, Jorge H.
- Subjects
FUEL systems ,SYSTEM analysis ,SYNTHETIC fuels ,SENSOR placement ,ANALYTIC geometry ,CAPACITIVE sensors ,ARTIFICIAL satellite attitude control systems ,AIRCRAFT fuels - Abstract
Fuel level gauging in aircraft presents a significant flight mechanics challenge due to the influence of aircraft movements on measurements. Moreover, it constitutes a multidimensional problem where various sensors distributed within the tank must converge to yield a precise and single measurement, independent of the aircraft's attitude. Furthermore, fuel distribution across multiple tanks of irregular geometries complicates the readings even further. These issues critically impact safety and economy, as gauging errors may compromise flight security and lead to carrying excess weight. In response to these challenges, this research introduces a multi-stage project in aircraft fuel gauging systems, as a continuum of studies, where this first article presents a computational tool designed to simulate aircraft fuel sensor data readings as a function of fuel level, fuel tank geometry, sensor location, and aircraft attitude. Developed in an open-source environment, the tool aims to support the statistical inference required for accurate modeling in which synthetic data generation becomes a crucial component. A discretization procedure accurately maps fuel tank geometries and their mass properties. The tool, then, intersects these geometries with fuel-level planes and calculates each new volume. It integrates descriptive geometry to intersect these fuel planes with representative capacitive level-sensing probes and computes the sensor readings for the simulated flight conditions. The method is validated against geometries with analytical solutions. This process yields detailed fuel measurement responses for each sensor inside the tank, and for different analyzed fuel levels, providing insights into the sensors' signals' non-linear behavior at each analyzed aircraft attitude. The non-linear behavior is also influenced by the sensor saturation readings at 0 when above the fuel level and at 1 when submerged. The synthetic fuel sensor readings lay the baseline for a better understanding on how to compute the true fuel level from multiple sensor readings, and ultimately optimizing the amount of used sensors and their placement. The tool's design offers significant improvements in aircraft fuel gauging accuracy, directly impacting aerostructures and instrumentation, and it is a key aspect of flight safety, fuel management, and navigation in aerospace technology. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Large Language Models for Synthetic Tabular Health Data: A Benchmark Study.
- Author
-
MILETIC, Marko and SARIYAR, Murat
- Abstract
Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the realworld data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Synthetic Generation of Patient Service Utilization Data: A Scalability Study.
- Author
-
HOWIE, Joseph, BALASUBRAMANIAN, Sowmya, BAMBI, Jonas, MOSELLE, Kenneth, SRINIVASAN, Venkatesh, and THOMO, Alex
- Abstract
To address privacy and ethical issues in using health data for machine learning, we evaluate the scalability of advanced synthetic data generation methods like GANs, VAEs, copulaGAN, and transformer models specifically for patient service utilization data. Our study examines five models on data from a Canadian health authority, focusing on training and generation efficiency, data resemblance, and practical utility. Our findings indicate that statistical models excel in efficiency, while most models produce synthetic data that closely mirrors real data, and is also useful for real-world applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. MAGAN: Mode Information and Attention-Based GAN for Realistic Time Series Data Synthesis
- Author
-
Wang, Yi, Luo, Yi, Ren, Peng, Wang, Weifan, Liu, Xianbo, Hu, Yuhang, Li, Zeming, Li, Xiangkuan, Li, Wenyao, Xing, Chunxiao, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Jin, Cheqing, editor, Yang, Shiyu, editor, Shang, Xuequn, editor, Wang, Haofen, editor, and Zhang, Yong, editor
- Published
- 2024
- Full Text
- View/download PDF
25. Explainable Generative Attention Mechanisms for Chest X-Ray Medical Image Synthesis and Diagnosis of Pediatric Pneumonia
- Author
-
Kaganzi, Francesca, Kakooza, Williams, Jjingo, Daudi, Marvin, Ggaliwango, Rocha, Álvaro, Series Editor, Hameurlain, Abdelkader, Editorial Board Member, Idri, Ali, Editorial Board Member, Vaseashta, Ashok, Editorial Board Member, Dubey, Ashwani Kumar, Editorial Board Member, Montenegro, Carlos, Editorial Board Member, Laporte, Claude, Editorial Board Member, Moreira, Fernando, Editorial Board Member, Peñalvo, Francisco, Editorial Board Member, Dzemyda, Gintautas, Editorial Board Member, Mejia-Miranda, Jezreel, Editorial Board Member, Hall, Jon, Editorial Board Member, Piattini, Mário, Editorial Board Member, Holanda, Maristela, Editorial Board Member, Tang, Mincong, Editorial Board Member, Ivanovíc, Mirjana, Editorial Board Member, Muñoz, Mirna, Editorial Board Member, Kanth, Rajeev, Editorial Board Member, Anwar, Sajid, Editorial Board Member, Herawan, Tutut, Editorial Board Member, Colla, Valentina, Editorial Board Member, Devedzic, Vladan, Editorial Board Member, Ragavendiran, S. D. Prabu, editor, Pavaloaia, Vasile Daniel, editor, Mekala, M. S., editor, and Cabezuelo, Antonio Sarasa, editor
- Published
- 2024
- Full Text
- View/download PDF
26. SDGnE: A Synthetic Data Generation and Evaluation System for Rare Event Prediction
- Author
-
Bae, Wan D., Alkobaisi, Shayma, Bhuvaji, Sartaj, Bankar, Siddheshwari, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Onizuka, Makoto, editor, Lee, Jae-Gil, editor, Tong, Yongxin, editor, Xiao, Chuan, editor, Ishikawa, Yoshiharu, editor, Amer-Yahia, Sihem, editor, Jagadish, H. V., editor, and Lu, Kejing, editor
- Published
- 2024
- Full Text
- View/download PDF
27. Image Processing and Analysis
- Author
-
Seeram, Euclid, Kanade, Vijay, Seeram, Euclid, and Kanade, Vijay
- Published
- 2024
- Full Text
- View/download PDF
28. Evaluation of a Fintech Sales Synthetic Data Generation Model Using a Generative Adversarial Network
- Author
-
Lopez, Felipe A., Duran-Riveros, Marcia, Maldonado-Duran, Sebastian, Ruete, David, Costa, Giannina, Coronado-Hernandez, Jairo R., Gatica, Gustavo, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gervasi, Osvaldo, editor, Murgante, Beniamino, editor, Garau, Chiara, editor, Taniar, David, editor, C. Rocha, Ana Maria A., editor, and Faginas Lago, Maria Noelia, editor
- Published
- 2024
- Full Text
- View/download PDF
29. Investigation of an Integrated Synthetic Dataset Generation Workflow for Computer Vision Applications
- Author
-
Rolf, Julian, Wolf, Mario, Gerhard, Detlef, Rannenberg, Kai, Editor-in-Chief, Soares Barbosa, Luís, Editorial Board Member, Carette, Jacques, Editorial Board Member, Tatnall, Arthur, Editorial Board Member, Neuhold, Erich J., Editorial Board Member, Stiller, Burkhard, Editorial Board Member, Stettner, Lukasz, Editorial Board Member, Pries-Heje, Jan, Editorial Board Member, Kreps, David, Editorial Board Member, Rettberg, Achim, Editorial Board Member, Furnell, Steven, Editorial Board Member, Mercier-Laurent, Eunika, Editorial Board Member, Winckler, Marco, Editorial Board Member, Malaka, Rainer, Editorial Board Member, Danjou, Christophe, editor, Harik, Ramy, editor, Nyffenegger, Felix, editor, Rivest, Louis, editor, and Bouras, Abdelaziz, editor
- Published
- 2024
- Full Text
- View/download PDF
30. Enhancing Object Detection Performance for Small Objects Through Synthetic Data Generation and Proportional Class-Balancing Technique: A Comparative Study in Industrial Scenarios
- Author
-
Antony, Jibinraj, Hegiste, Vinit, Nazeri, Ali, Tavakoli, Hooman, Walunj, Snehal, Plociennik, Christiane, Ruskowski, Martin, Chaari, Fakher, Series Editor, Gherardini, Francesco, Series Editor, Ivanov, Vitalii, Series Editor, Haddar, Mohamed, Series Editor, Cavas-Martínez, Francisco, Editorial Board Member, di Mare, Francesca, Editorial Board Member, Kwon, Young W., Editorial Board Member, Tolio, Tullio A. M., Editorial Board Member, Trojanowska, Justyna, Editorial Board Member, Schmitt, Robert, Editorial Board Member, Xu, Jinyang, Editorial Board Member, Wagner, Achim, editor, Alexopoulos, Kosmas, editor, and Makris, Sotiris, editor
- Published
- 2024
- Full Text
- View/download PDF
31. Insights in Data Generation: A Synthetic Data Approach for Enabling Small Datasets in Atrial Fibrillation Research
- Author
-
Salman, Ali, Goretti, Francesco, Cartocci, Alessandra, Iadanza, Ernesto, Magjarević, Ratko, Series Editor, Ładyżyński, Piotr, Associate Editor, Ibrahim, Fatimah, Associate Editor, Lackovic, Igor, Associate Editor, Rock, Emilio Sacristan, Associate Editor, Jarm, Tomaž, editor, Šmerc, Rok, editor, and Mahnič-Kalamiza, Samo, editor
- Published
- 2024
- Full Text
- View/download PDF
32. Detecting Abnormal Vehicle Behavior: A Clustering-Based Approach
- Author
-
Verma, Shrey, Parkinson, Simon, Khan, Saad, Shen, Xuemin Sherman, Series Editor, Parkinson, Simon, editor, Nikitas, Alexandros, editor, and Vallati, Mauro, editor
- Published
- 2024
- Full Text
- View/download PDF
33. Inducing a Realistic Surface Roughness onto 3D Mesh Data Using Conditional Generative Adversarial Network (cGAN)
- Author
-
Mutiargo, Bisma, Lou, Shan, Wong, Zheng Zheng, Chaari, Fakher, Series Editor, Gherardini, Francesco, Series Editor, Ivanov, Vitalii, Series Editor, Haddar, Mohamed, Series Editor, Cavas-Martínez, Francisco, Editorial Board Member, di Mare, Francesca, Editorial Board Member, Kwon, Young W., Editorial Board Member, Tolio, Tullio A.M., Editorial Board Member, Trojanowska, Justyna, Editorial Board Member, Schmitt, Robert, Editorial Board Member, Xu, Jinyang, Editorial Board Member, Maharjan, Niroj, editor, and He, Wei, editor
- Published
- 2024
- Full Text
- View/download PDF
34. Securing Smart Vehicles Through Federated Learning
- Author
-
Halim, Sadaf MD, Hossain, Md Delwar, Khan, Latifur, Singhal, Anoop, Inoue, Hiroyuki, Ochiai, Hideya, Hamlen, Kevin W., Kadobayashi, Youki, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Mosbah, Mohamed, editor, Sèdes, Florence, editor, Tawbi, Nadia, editor, Ahmed, Toufik, editor, Boulahia-Cuppens, Nora, editor, and Garcia-Alfaro, Joaquin, editor
- Published
- 2024
- Full Text
- View/download PDF
35. Generating Synthetic Brain Tumor Data Using StyleGAN3 for Lower Class Enhancement
- Author
-
Abdalaziz, Ahmed, Schwenker, Friedhelm, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Debelee, Taye Girma, editor, Ibenthal, Achim, editor, Schwenker, Friedhelm, editor, and Megersa Ayano, Yehualashet, editor
- Published
- 2024
- Full Text
- View/download PDF
36. Generating 3D Reconstructions Using Generative Models
- Author
-
Malah, Mehdi, Agaba, Ramzi, Abbas, Fayçal, and Lyu, Zhihan, editor
- Published
- 2024
- Full Text
- View/download PDF
37. Exploring Synthetic Noise Algorithms for Real-World Similar Data Generation: A Case Study on Digitally Twining Hybrid Turbo-Shaft Engines in UAV/UAS Applications
- Author
-
Aghazadeh Ardebili, Ali, Longo, Antonella, Ficarella, Antonio, Khalil, Adem, Khalil, Sabri, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Mosbah, Mohamed, editor, Kechadi, Tahar, editor, Bellatreche, Ladjel, editor, and Gargouri, Faiez, editor
- Published
- 2024
- Full Text
- View/download PDF
38. Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
- Author
-
Das, Sujit Kumar, Roy, Pinki, Kumar Mishra, Arnab, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Tan, Kay Chen, Series Editor, Borah, Malaya Dutta, editor, Laiphrakpam, Dolendro Singh, editor, Auluck, Nitin, editor, and Balas, Valentina Emilia, editor
- Published
- 2024
- Full Text
- View/download PDF
39. Analyzing the Effects of Different 3D-Model Acquisition Methods for Synthetic AI Training Data Generation and the Domain Gap
- Author
-
Albayrak, Özge Beyza, Schoepflin, Daniel, Holst, Dirk, Möller, Lars, Schüppstuhl, Thorsten, Chaari, Fakher, Series Editor, Gherardini, Francesco, Series Editor, Ivanov, Vitalii, Series Editor, Haddar, Mohamed, Series Editor, Cavas-Martínez, Francisco, Editorial Board Member, di Mare, Francesca, Editorial Board Member, Kwon, Young W., Editorial Board Member, Trojanowska, Justyna, Editorial Board Member, Xu, Jinyang, Editorial Board Member, Silva, Francisco J. G., editor, Pereira, António B., editor, and Campilho, Raul D. S. G., editor
- Published
- 2024
- Full Text
- View/download PDF
40. GovSynBayes: release of synthetic government microdata from multisources via Bayesian networks
- Author
-
Lu, Xiaotian, Piao, Chunhui, and Yang, Xingyu
- Published
- 2024
- Full Text
- View/download PDF
41. Generalising weighted model counting
- Author
-
Dilkas, Paulius, Belle, Vaishak, and Petrick, Ronald
- Subjects
Weighted Model Counting ,WMC ,artificial intelligence ,AI ,machine learning ,neural-symbolic AI ,probabilistic programming ,statistical relational AI ,synthetic data generation ,primal treewidth - Abstract
Given a formula in propositional or (finite-domain) first-order logic and some non-negative weights, weighted model counting (WMC) is a function problem that asks to compute the sum of the weights of the models of the formula. Originally used as a flexible way of performing probabilistic inference on graphical models, WMC has found many applications across artificial intelligence (AI), machine learning, and other domains. Areas of AI that rely on WMC include explainable AI, neural-symbolic AI, probabilistic programming, and statistical relational AI. WMC also has applications in bioinformatics, data mining, natural language processing, prognostics, and robotics. In this work, we are interested in revisiting the foundations of WMC and considering generalisations of some of the key definitions in the interest of conceptual clarity and practical efficiency. We begin by developing a measure-theoretic perspective on WMC, which suggests a new and more general way of defining the weights of an instance. This new representation can be as succinct as standard WMC but can also expand as needed to represent less-structured probability distributions. We demonstrate the performance benefits of the new format by developing a novel WMC encoding for Bayesian networks. We then show how existing WMC encodings for Bayesian networks can be transformed into this more general format and what conditions ensure that the transformation is correct (i.e., preserves the answer). Combining the strengths of the more flexible representation with the tricks used in existing encodings yields further efficiency improvements in Bayesian network probabilistic inference. Next, we turn our attention to the first-order setting. Here, we argue that the capabilities of practical model counting algorithms are severely limited by their inability to perform arbitrary recursive computations. To enable arbitrary recursion, we relax the restrictions that typically accompany domain recursion and generalise circuits (used to express a solution to a model counting problem) to graphs that are allowed to have cycles. These improvements enable us to find efficient solutions to counting fundamental structures such as injections and bijections that were previously unsolvable by any available algorithm. The second strand of this work is concerned with synthetic data generation. Testing algorithms across a wide range of problem instances is crucial to ensure the validity of any claim about one algorithm's superiority over another. However, benchmarks are often limited and fail to reveal differences among the algorithms. First, we show how random instances of probabilistic logic programs (that typically use WMC algorithms for inference) can be generated using constraint programming. We also introduce a new constraint to control the independence structure of the underlying probability distribution and provide a combinatorial argument for the correctness of the constraint model. This model allows us to, for the first time, experimentally investigate inference algorithms on more than just a handful of instances. Second, we introduce a random model for WMC instances with a parameter that influences primal treewidth-the parameter most commonly used to characterise the difficulty of an instance. We show that the easy-hard-easy pattern with respect to clause density is different for algorithms based on dynamic programming and algebraic decision diagrams than for all other solvers. We also demonstrate that all WMC algorithms scale exponentially with respect to primal treewidth, although at differing rates.
- Published
- 2023
- Full Text
- View/download PDF
42. Generative Models for Synthetic Urban Mobility Data: A Systematic Literature Review.
- Author
-
KAPP, ALEXANDRA, HANSMEYER, JULIA, and MIHALJEVIĆ, HELENA
- Published
- 2024
- Full Text
- View/download PDF
43. Editing outdoor scenes with a large annotated synthetic dataset.
- Author
-
Xie, Mingye, Liu, Zongwei, Xiang, Suncheng, Liu, Ting, and Fu, Yuzhuo
- Abstract
With the continuous popularization of smartphones and their ever-evolving photographic capabilities, individuals can easily take a large number of photos in their daily lives, creating a natural impetus for image editing. With the ability of style-based GAN, images can be reasonably edited on specific semantics by manipulating in latent space of the generator, particularly for human facial photographs. However, such methods are heavily rely on the datasets with diverse data and rich semantic annotations at the same time. Unfortunately, there is no such dataset for outdoor scenes with diverse and complex structural content, which makes current editing methods almost ineffective. To overcome these challenges, we first construct an extensive synthetic outdoor scene dataset with fine-grained semantic annotations based on an automated process. Based on it, we propose an editing network dedicated to multi-class annotations that can efficiently edit specific attributes while preserving others as much as possible. Extensive experiments evince that our method achieves better performance in outdoor scene editing, especially in regards to distance and viewpoint across several outdoor scene datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. Enhancing Spam Detection with GANs and BERT Embeddings: A Novel Approach to Imbalanced Datasets.
- Author
-
Filali, Adnane, Alaoui, El Arbi Abdellaoui, and Merras, Mostafa
- Subjects
SPAM email ,MACHINE learning ,GENERATIVE adversarial networks ,DATA augmentation - Abstract
In recent years, the prevalence of imbalanced datasets has posed significant challenges to traditional machine learning models. This imbalance is especially pronounced in fields such as spam detection, where malicious or unwanted messages are typically outnumbered by legitimate ones. Although various techniques have been developed to address this disparity, most conventional methods either undersample the majority class or oversample the minority class, potentially leading to information loss or overftting. In this study, we propose a novel approach using Generative Adversarial Networks (GANs) to generate synthetic samples, thus enhancing the representation of the minority class. By leveraging the powerful BERT embeddings to capture the intricate textual nuances, our model strives to produce synthetic spam messages that are not only realistic but also diverse. Initial results indicate that our GAN-augmented model offers a noticeable improvement in detecting spam messages compared to traditional techniques. This advancement not only holds potential for spam detection but also suggests broader applicability in addressing dataset imbalance across various domains. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. Minimal data requirement for realistic endoscopic image generation with Stable Diffusion.
- Author
-
Kaleta, Joanna, Dall'Alba, Diego, Płotka, Szymon, and Korzeniowski, Przemysław
- Abstract
Purpose: Computer-assisted surgical systems provide support information to the surgeon, which can improve the execution and overall outcome of the procedure. These systems are based on deep learning models that are trained on complex and challenging-to-annotate data. Generating synthetic data can overcome these limitations, but it is necessary to reduce the domain gap between real and synthetic data. Methods: We propose a method for image-to-image translation based on a Stable Diffusion model, which generates realistic images starting from synthetic data. Compared to previous works, the proposed method is better suited for clinical application as it requires a much smaller amount of input data and allows finer control over the generation of details by introducing different variants of supporting control networks. Results: The proposed method is applied in the context of laparoscopic cholecystectomy, using synthetic and real data from public datasets. It achieves a mean Intersection over Union of 69.76%, significantly improving the baseline results (69.76 vs. 42.21%). Conclusions: The proposed method for translating synthetic images into images with realistic characteristics will enable the training of deep learning methods that can generalize optimally to real-world contexts, thereby improving computer-assisted intervention guidance systems. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. SYNTHETIC DATA GENERATION FOR ANN MODELING OF THE HYDRODYNAMIC PROCESSES OF IN-SITU LEACHING.
- Author
-
Aizhulov, Daniar, Kurmanseiit, Maksat, Shayakhmetov, Nurlan, Tungatarova, Madina, and Suleimenova, Ainur
- Subjects
COMPUTATIONAL fluid dynamics ,LEACHING ,DATA modeling ,ARTIFICIAL neural networks ,ACQUISITION of data - Abstract
The work presents an approach to enhance the forecasting capabilities of In-Situ Leaching processes during both the production stage and early prognosis. ISL, a crucial method for resource extraction, demands rapid on-site forecasting to guide the deployment of new technological blocks. Traditional modeling techniques, though effective, are hindered by their computational demands and network throughput requirements, particularly when dealing with substantial datasets or remote computing needs. The integration of AI technologies, specifically neural networks, offers a promising opportunity for expedited calculations by leveraging the power of forward propagation through pretrained neural models. However, a critical challenge lies in transforming conventional numerical datasets into a format suitable for neural modeling. Furthermore, the scarcity of training data during the production phase, where vital parameters are concealed underground, poses an additional challenge in training AI models for In-Situ Leaching processes. This research addresses these challenges by proposing a methodology for generating training data tailored to the most resource-intensive Computational Fluid Dynamics problems encountered during modeling. Traditional numerical modeling techniques are harnessed to construct training datasets comprising input and corresponding expected output data, with a particular focus on varying well network patterns. Subsequent efforts are directed at the conversion of the acquired data into a format compatible with neural networks. The data is normalized to align with the data ranges stipulated by the activation functions employed within the neural network architecture. This preprocessing step ensures that the neural model can effectively learn from the generated data, facilitating accurate forecasting of In-Situ Leaching processes. An advantage of proposed technique lies in provision of large, reliable datasets to train neural network to predict hydrodynamic properties based on technological regimes currently active or expected on ISL site. A major implication of this approach lies in applicability of pre-trained AI technologies to forecast future or determine current hydrodynamic regime in the stratum circumventing cost deterministic simulations currently deployed at mining sites. Hence, innovative approach outlined in this paper holds promise for optimizing forecasting, allowing for quicker and more efficient decision-making in resource extraction operations while getting around the computational barriers associated with traditional methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Enhancing manufacturing operations with synthetic data: a systematic framework for data generation, accuracy, and utility.
- Author
-
Buggineni, Vishnupriya, Cheng Chen, and Camelio, Jaime
- Subjects
DATA privacy ,RESEARCH personnel ,MANUFACTURING processes ,CONTINUOUS processing ,MACHINE learning ,ELECTRONIC data processing - Abstract
Addressing the challenges of data scarcity and privacy, synthetic data generation offers an innovative solution that advances manufacturing assembly operations and data analytics. Serving as a viable alternative, it enables manufacturers to leverage a broader and more diverse range of machine learning models by incorporating the creation of artificial data points for training and evaluation. Current methods lack generalizable framework for researchers to follow and solve these issues. The development of synthetic data sets, however, can make up for missing samples and enable researchers to understand existing issues within the manufacturing process and create data-driven tools for reducing manufacturing costs. This paper systematically reviews both discrete and continuous manufacturing process data types with their applicable synthetic generation techniques. The proposed framework entails four main stages: Data collection, pre-processing, synthetic data generation, and evaluation. To validate the framework's efficacy, a case study leveraging synthetic data enabled an exploration of complex defect classification challenges in the packaging process. The results show enhanced prediction accuracy and provide a detailed comparative analysis of various synthetic data strategies. This paper concludes by highlighting our framework's transformative potential for researchers, educators, and practitioners and provides scalable guidance to solve the data challenges in the current manufacturing sector. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. Variable Selection in Data Analysis: A Synthetic Data Toolkit.
- Author
-
Mitra, Rohan, Ali, Eyad, Varam, Dara, Sulieman, Hana, and Kamalov, Firuz
- Subjects
- *
DATA analysis , *MATHEMATICAL analysis , *MATHEMATICAL models , *ALGORITHMS - Abstract
Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs' resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Generative Pre-Trained Transformer (GPT) in Research: A Systematic Review on Data Augmentation.
- Author
-
Sufi, Fahim
- Subjects
- *
GENERATIVE pre-trained transformers , *DATA augmentation , *LANGUAGE models , *RESEARCH questions - Abstract
GPT (Generative Pre-trained Transformer) represents advanced language models that have significantly reshaped the academic writing landscape. These sophisticated language models offer invaluable support throughout all phases of research work, facilitating idea generation, enhancing drafting processes, and overcoming challenges like writer's block. Their capabilities extend beyond conventional applications, contributing to critical analysis, data augmentation, and research design, thereby elevating the efficiency and quality of scholarly endeavors. Strategically narrowing its focus, this review explores alternative dimensions of GPT and LLM applications, specifically data augmentation and the generation of synthetic data for research. Employing a meticulous examination of 412 scholarly works, it distills a selection of 77 contributions addressing three critical research questions: (1) GPT on Generating Research data, (2) GPT on Data Analysis, and (3) GPT on Research Design. The systematic literature review adeptly highlights the central focus on data augmentation, encapsulating 48 pertinent scholarly contributions, and extends to the proactive role of GPT in critical analysis of research data and shaping research design. Pioneering a comprehensive classification framework for "GPT's use on Research Data", the study classifies existing literature into six categories and 14 sub-categories, providing profound insights into the multifaceted applications of GPT in research data. This study meticulously compares 54 pieces of literature, evaluating research domains, methodologies, and advantages and disadvantages, providing scholars with profound insights crucial for the seamless integration of GPT across diverse phases of their scholarly pursuits. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Synthetic data generation techniques for training deep acoustic siren identification networks
- Author
-
Stefano Damiano, Benjamin Cramer, Andre Guntoro, and Toon van Waterschoot
- Subjects
synthetic data generation ,acoustic simulation ,moving sound sources ,data augmentation ,siren identification ,convolutional neural networks ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Acoustic sensing has been widely exploited for the early detection of harmful situations in urban environments: in particular, several siren identification algorithms based on deep neural networks have been developed and have proven robust to the noisy and non-stationary urban acoustic scene. Although high classification accuracy can be achieved when training and evaluating on the same dataset, the cross-dataset performance of such models remains unexplored. To build robust models that generalize well to unseen data, large datasets that capture the diversity of the target sounds are needed, whose collection is generally expensive and time consuming. To overcome this limitation, in this work we investigate synthetic data generation techniques for training siren identification models. To obtain siren source signals, we either collect from public sources a small set of stationary, recorded siren sounds, or generate them synthetically. We then simulate source motion, acoustic propagation and Doppler effect, and finally combine the resulting signal with background noise. This way, we build two synthetic datasets used to train three different convolutional neural networks, then tested on real-world datasets unseen during training. We show that the proposed training strategy based on the use of recorded source signals and synthetic acoustic propagation performs best. In particular, this method leads to models that exhibit a better generalization ability, as compared to training and evaluating in a cross-dataset setting. Moreover, the proposed method loosens the data collection requirement and is entirely built using publicly available resources.
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.