Author: "Kesheng Wu" / Language: undetermined - Searchworks@Jio Institute Digital Library Search Results

1. Report on the IEEE GRSS Workshop on Remote Sensing Data Management Technologies in Geoscience 2022 [Technical Committees]

Author: Dai-Hai Ton That, Priya Deshpande, Khalid Belhajjame, Muthukumaran Ramasubramanian, Vishal Perekadan, Nishan Pantha, Todd Mahood, and Kesheng Wu
Subjects: General Computer Science, General Earth and Planetary Sciences, Electrical and Electronic Engineering, Instrumentation
Published: 2022
Full Text: View/download PDF

2. Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

Author: Sunggon Kim, Alex Sim, Kesheng Wu, Suren Byna, and Yongseok Son
Subjects: Information Systems and Management, Computer Networks and Communications, Hardware and Architecture, Information Systems
Abstract: Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units used by hundreds to thousands of users simultaneously. Applications from large numbers of users have diverse characteristics, such as varying computation, communication, memory, and I/O intensity. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, I/O performance is becoming increasingly important as data sizes rapidly increase and large-scale applications, such as simulation and model training, are widely adopted. However, predicting I/O performance is difficult because I/O systems are shared among all users and involve many layers of software and hardware stack, including the application, network interconnect, operating system, file system, and storage devices. Furthermore, updates to these layers and changes in system management policy can significantly alter the I/O behavior of applications and the entire system. To improve the prediction of the I/O performance on HPC systems, we propose integrating information from several different system logs and developing a regression-based approach to predict the I/O performance. Our proposed scheme can dynamically select the most relevant features from the log entries using various feature selection algorithms and scoring functions, and can automatically select the regression algorithm with the best accuracy for the prediction task. The evaluation results show that our proposed scheme can predict the write performance with up to 90% prediction accuracy and the read performance with up to 99% prediction accuracy using the real logs from the Cori supercomputer system at NERSC.
Published: 2023
Full Text: View/download PDF

3. The Imperial Valley Dark Fiber Project: Toward Seismic Studies Using DAS and Telecom Infrastructure for Geothermal Applications

Author: Jonathan Ajo-Franklin, Verónica Rodríguez Tribaldos, Avinash Nayak, Feng Cheng, Robert Mellors, Benxin Chi, Todd Wood, Michelle Robertson, Cody Rotermund, Eric Matzel, Dennise C. Templeton, Christina Morency, Kesheng Wu, Bin Dong, and Patrick Dobson
Subjects: Geophysics
Abstract: The Imperial Valley is a seismically active basin occupying the southern end of the Salton trough, an area of rapid extension, high heat flow, and abundant geothermal resources. This report describes an ongoing large-scale distributed acoustic sensing (DAS) recording study acquiring high-density seismic data on an array between Calipatria and Imperial, California. This 27 km array, operating on dark fiber since 9 November 2020, has recorded a wealth of local seismic events as well as ambient noise. The goal of the broader Imperial Valley Dark Fiber project is to evaluate passive DAS as a tool for geothermal exploration and monitoring. This report is intended to provide installation information, noise characteristics, and metadata for future studies utilizing the data set. Because of the relatively small number of basin-scale DAS studies that have been conducted to date, we also provide a range of lessons learned during the deployment to assist future researchers exploring this acquisition strategy.
Published: 2022
Full Text: View/download PDF

4. Improving nonnegative matrix factorization with advanced graph regularization

Author: Xiaoxia Zhang, Degang Chen, Hong Yu, Guoyin Wang, Houjun Tang, and Kesheng Wu
Subjects: Information Systems and Management, Artificial Intelligence, Control and Systems Engineering, Software, Computer Science Applications, Theoretical Computer Science
Published: 2022
Full Text: View/download PDF

5. Identification and detoxification of AFB1 transformation product in the peanut oil refining process

Author: Tianying Lu, Yuqian Guo, Zheling Zeng, Kesheng Wu, Xiaoyang Li, and Yonghua Xiong
Subjects: Food Science, Biotechnology
Published: 2023
Full Text: View/download PDF

6. Distributed Acoustic Sensing Using Dark Fiber for Array Detection of Regional Earthquakes

Author: Eric Matzel, Kesheng Wu, Jonathan Ajo-Franklin, Indermohan Monga, and Dennise Templeton
Subjects: Geophysics, 010504 meteorology & atmospheric sciences, Acoustics, Fiber, Distributed acoustic sensing, 010502 geochemistry & geophysics, 01 natural sciences, Geology, 0105 earth and related environmental sciences
Abstract: The intrinsic array nature of distributed acoustic sensing (DAS) makes it suitable for applying beamforming techniques commonly used in traditional seismometer arrays for enhancing weak and coherent seismic phases from distant seismic events. We test the capacity of a dark-fiber DAS array in the Sacramento basin, northern California, to detect small earthquakes at The Geysers geothermal field, at a distance of ∼100 km from the DAS array, using beamforming. We use a slowness range appropriate for ∼0.5–1.0 Hz surface waves that are well recorded by the DAS array. To take advantage of the large aperture, we divide the ∼20 km DAS cable into eight subarrays of aperture ∼1.5–2.0 km each, and apply beamforming independently to each subarray using phase-weighted stacking. The presence of subarrays of different orientations provides some sensitivity to back azimuth. We apply a short-term average/long-term average detector to the beam at each subarray. Simultaneous detections over multiple subarrays, evaluated using a voting scheme, are inferred to be caused by the same earthquake, whereas false detections caused by anthropogenic noise are expected to be localized to one or two subarrays. Analyzing 45 days of continuous DAS data, we were able to detect all earthquakes with M≥2.4, while missing most of the smaller magnitude earthquakes, with no false detections due to seismic noise. In comparison, a single broadband seismometer co-located with the DAS array was unable to detect any earthquake of M
Published: 2021
Full Text: View/download PDF

7. An empirical study of I/O separation for burst buffers in HPC systems

Author: Katie Antypas, Eun-Kyu Byun, Jialin Liu, Donghun Koo, Soonwook Hwang, Jae-Hyuck Kwak, Kesheng Wu, Jaehwan Lee, Glenn K. Lockwood, and Hyeonsang Eom
Subjects: Computer Networks and Communications, Computer science, Stream-aware, 02 engineering and technology, computer.software_genre, Theoretical Computer Science, Scheduling (computing), Computer Software, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), I/O separation, Evaluation, Input/output, Write amplification, Fragmentation (computing), Solid-state storage, Byte, 020206 networking & telecommunications, Burst buffer, Supercomputer, Hardware and Architecture, Operating system, 020201 artificial intelligence & image processing, Multi-streamed SSD, Distributed Computing, computer, Software, Garbage collection
Abstract: To meet the exascale I/O requirements for the High-Performance Computing (HPC), a new I/O subsystem, Burst Buffer, based on solid state drives (SSD), has been developed. However, the diverse HPC workloads and the bursty I/O pattern cause severe data fragmentation that requires costly garbage collection (GC) and increases the number of bytes written to the SSD. To address this data fragmentation challenge, a new multi-stream feature has been developed for SSDs. In this work, we develop an I/O Separation scheme called BIOS to leverage this multi-stream feature to group the I/O streams based on the user IDs. We propose a stream-aware scheduling policy based on burst buffer pools in the workload manager, and integrate the BIOS with the workload manager to optimize the I/O separation scheme in burst buffer. We evaluate the proposed framework with a burst buffer I/O traces from Cori Supercomputer including a diverse set of applications. Experimental results show that the BIOS could improve the performance by 1.44x on average and reduce the Write Amplification Factor (WAF) by up to 1.20x. These demonstrate the potential benefits of the I/O separation scheme for solid state storage systems.
Published: 2021
Full Text: View/download PDF

8. Exploring Large All-Flash Storage System with Scientific Simulation

Author: Junmin Gu, Greg Eisenhauer, Scott Klasky, Norbert Podhorszki, Ruonan Wang, and Kesheng Wu
Published: 2022
Full Text: View/download PDF

9. Predicting Slow Network Transfers in Scientific Computing

Author: Robin Shao, Jinoh Kim, Alex Sim, and Kesheng Wu
Published: 2022
Full Text: View/download PDF

10. Access Trends of In-network Cache for Scientific Data

Author: Ruize Han, Alex Sim, Kesheng Wu, Inder Monga, Chin Guok, Frank Würthwein, Diego Davila, Justas Balcas, and Harvey Newman
Subjects: Networking and Internet Architecture (cs.NI), Performance (cs.PF), FOS: Computer and information sciences, Computer Science - Networking and Internet Architecture, Computer Science - Machine Learning, Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching.
Published: 2022
Full Text: View/download PDF

11. Adaptive Optimization for Sparse Data on Heterogeneous GPUs

Author: Yujing Ma, Florin Rusu, Kesheng Wu, and Alexander Sim
Published: 2022
Full Text: View/download PDF

12. Using Multi-Resolution Data to Accelerate Neural Network Training in Scientific Applications

Author: Kewei Wang, Sunwoo Lee, Jan Balewski, Alex Sim, Peter Nugent, Ankit Agrawal, Alok Choudhary, Kesheng Wu, and Wei-Keng Liao
Published: 2022
Full Text: View/download PDF

13. Identification of Deoxynivalenol and Degradation Products during Maize Germ Oil Refining Process

Author: Yuqian Guo, Tianying Lu, Jiacheng Shi, Xiaoyang Li, Kesheng Wu, and Yonghua Xiong
Subjects: Health (social science), deoxynivalenol, maize germ oil, refining, degradation products, Plant Science, Health Professions (miscellaneous), Microbiology, Food Science
Abstract: Deoxynivalenol (DON) contamination in germs and germ oil is posing a serious threat to food and feed security. However, the transformation pathway, the distribution of DON, and its degradation products in edible oil refining have not yet been reported in detail. In this work, we systematically explored the variation of DON in maize germ oil during refining and demonstrated that the DON in germ oil can be effectively removed by refining, during which a part of DON was transferred to the wastes, and another section of DON was degraded during degumming and alkali refining. Moreover, the DON degradation product was identified to be norDON B by using the ultraviolet absorption spectrum, high-performance liquid chromatography (HPLC), ultra-high-performance liquid chromatography-quadrupole time-of-flight mass spectrometry (UPLC-Q-TOF MS), and nuclear magnetic resonance (NMR) methods, and the degradation product was found to be distributed in waste products during oil refining. This study provides a scientific basis and useful reference for the production of non-mycotoxins edible oil by traditional refining.
Published: 2022

14. Improving I/O Performance for Exascale Applications Through Online Data Layout Reorganization

Author: Ruonan Wang, Lipeng Wan, Jean-Luc Vay, Scott Klasky, Jieyang Chen, Ian Foster, Todd Munson, Dmitry Ganyushin, Axel Huebl, Ana Gainaru, Xin Liang, Kesheng Wu, Junmin Gu, Norbert Podhorszki, and Franz Poeschel
Subjects: Large class, FOS: Computer and information sciences, Optimization, Distributed databases, Computer science, Layout, media_common.quotation_subject, Fidelity, IO performance, data access optimization, computer.software_genre, Computer Software, Heuristic algorithms, Arrays, Auxiliary memory, media_common, File system, data layout, Communications Technologies, WarpX, Distributed database, Data layout, Dynamic data, Computational modeling, Parallel IO, data layout IO, Exascale computing, Computational Theory and Mathematics, Computer architecture, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, Signal Processing, Performance evaluation, Distributed, Parallel, and Cluster Computing (cs.DC), Distributed Computing, computer
Abstract: The applications being developed within the U.S. Exascale Computing Project (ECP) to run on imminent Exascale computers will generate scientific results with unprecedented fidelity and record turn-around time. Many of these codes are based on particle-mesh methods and use advanced algorithms, especially dynamic load-balancing and mesh-refinement, to achieve high performance on Exascale machines. Yet, as such algorithms improve parallel application efficiency, they raise new challenges for I/O logic due to their irregular and dynamic data distributions. Thus, while the enormous data rates of Exascale simulations already challenge existing file system write strategies, the need for efficient read and processing of generated data introduces additional constraints on the data layout strategies that can be used when writing data to secondary storage. We review these I/O challenges and introduce two online data layout reorganization approaches for achieving good tradeoffs between read and write performance. We demonstrate the benefits of using these two approaches for the ECP particle-in-cell simulation WarpX, which serves as a motif for a large class of important Exascale applications. We show that by understanding application I/O patterns and carefully designing data layouts we can increase read performance by more than 80%., 12 pages, 15 figures, accepted by IEEE Transactions on Parallel and Distributed Systems
Published: 2022

15. Locating Partial Discharges in Power Transformers with Convolutional Iterative Filtering

Author: Jonathan Wang, Kesheng Wu, Alex Sim, and Seongwook Hwangbo
Subjects: source location, waveform analysis, nonlinear wave propagation, UHF measurements, information_technology_data_management, Electrical and Electronic Engineering, FDTD methods, time of arrival estimation, Biochemistry, Instrumentation, partial discharges, Atomic and Molecular Physics, and Optics, Analytical Chemistry
Abstract: The most common source of transformer failure is in the insulation, and the most prevalent warning signal for insulation weakness is partial discharge (PD). Locating the positions of these partial discharges would help repair the transformer to prevent failures. This work investigates algorithms that could be deployed to locate the position of a PD event using data from ultra-high frequency (UHF) sensors inside the transformer. These algorithms typically proceed in two steps: first determining the signal arrival time, and then locating the position based on time differences. This paper reviews available methods for each task and then propose new algorithms: a convolutional iterative filter with thresholding (CIFT) to determine the signal arrival time and a reference table of travel times to resolve the source location. The effectiveness of these algorithms are tested with a set of laboratory-triggered PD events and two sets of simulated PD events inside transformers in production use. Tests show the new approach provides more accurate locations than the best-known data analysis algorithms, and the difference is particularly large, 3.7X, when the signal sources are far from sensors.
Published: 2023
Full Text: View/download PDF

16. The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale

Author: E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, David Camp, Earl P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, James Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, David Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, and Kesheng Wu
Published: 2022
Full Text: View/download PDF

17. Proximity Portability and in Transit, M-to-N Data Partitioning and Movement in SENSEI

Author: E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, Earl P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, James Kress, Patrick O’Leary, Dave Pugmire, Silvio Rizzi, David Thompson, Will Usher, Gunther H. Weber, Brad Whitlock, Matthew Wolf, and Kesheng Wu
Published: 2022
Full Text: View/download PDF

18. Performance of the Gold Standard and Machine Learning in Predicting Vehicle Transactions

Author: Alina Lazar, Ling Jin, Caitlin Brown, C. Anna Spurlock, Alexander Sim, and Kesheng Wu
Published: 2021
Full Text: View/download PDF

19. An In-Depth I/O Pattern Analysis in HPC Systems

Author: Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alex Sim, Suren Byna, Hanul Sung, and Hyeonsang Eom
Published: 2021
Full Text: View/download PDF

20. Asynchronous I/O Strategy for Large-Scale Deep Learning Applications

Author: Sunwoo Lee, Qiao Kang, Kewei Wang, Jan Balewski, Alex Sim, Ankit Agrawal, Alok Choudhary, Peter Nugent, Kesheng Wu, and Wei-keng Liao
Published: 2021
Full Text: View/download PDF

21. Multiplexed lateral flow immunoassay based on inner filter effect for mycotoxin detection in maize

Author: Hu Jiang, Hu Su, Kesheng Wu, Zemin Dong, Xiangmin Li, Lijuan Nie, Yuankui Leng, and Yonghua Xiong
Subjects: Materials Chemistry, Metals and Alloys, Electrical and Electronic Engineering, Condensed Matter Physics, Instrumentation, Surfaces, Coatings and Films, Electronic, Optical and Magnetic Materials
Published: 2023
Full Text: View/download PDF

22. Extracting Signals from High-Frequency Trading with Digital Signal Processing Tools

Author: Horst D. Simon, Marcos Lopez de Prado, Kesheng Wu, and Jung Heon Song
Subjects: Flash crash, Information Systems and Management, Computer science, business.industry, Strategy and Management, Big data, computer.software_genre, Computational Theory and Mathematics, Artificial Intelligence, Dominance (economics), Frequency domain, Econometrics, Business, Management and Accounting (miscellaneous), Business and International Management, Algorithmic trading, High-frequency trading, business, computer, Futures contract, Finance, Digital signal processing, Information Systems
Abstract: As algorithms replace a growing number of tasks performed by humans in the markets, there have been growing concerns about an increased likelihood of cascading events, similar to the Flash Crash of May 6, 2010. To address these concerns, researchers have employed a number of scientific data analysis tools to monitor the risk of such cascading events. As an example, the authors of this article investigate the natural gas (NG) futures market in the frequency domain and the interaction between weather forecasts and NG price data. They observe that Fourier components with high frequencies have become more prominent in recent years and are much stronger than could be expected from an analytical model of the market. Additionally, a significant amount of trading activity occurs in the first few seconds of every minute, which is a tell-tale sign of time-based algorithmic trading. To illustrate the potential of cascading events, the authors further study how weather forecasts drive NG prices and show that, after separating the time series by season to account for the different mechanisms that relate temperature to NG price, the temperature forecast is indeed cointegrated with NG price. They also show that the variations in temperature forecasts contribute to a significant percentage of the average daily price fluctuations, which confirms the possibility that a forecast error could significantly affect the price of NG futures. TOPICS:Statistical methods, simulations, big data/machine learning Key Findings • High-frequency components in the trading data are stronger than expected from a model assuming uniform trading during market hours. • The dominance of the high-frequency components have been increasing over the years. • Relatively small changes in temperature could create a large price fluctuation in natural gas futures contracts.
Published: 2019
Full Text: View/download PDF

23. Real-time and post-hoc compression for data from Distributed Acoustic Sensing

Author: Bin Dong, Alex Popescu, Verónica Rodríguez Tribaldos, Suren Byna, Jonathan Ajo-Franklin, and Kesheng Wu
Subjects: Computers in Earth Sciences, Information Systems
Published: 2022
Full Text: View/download PDF

24. Analyzing Scientific Data Sharing Patterns for In-network Data Caching

Author: Diego Davila, Elizabeth Copps, Inder Monga, Chin Guok, Alex Sim, F. Würthwein, Edgar Fajardo, Huiyi Zhang, Kesheng Wu, Cafaro, Massimo, Kim, Jinoh, and Sim, Alex
Subjects: Networking and Internet Architecture (cs.NI), FOS: Computer and information sciences, cs.DC, Computer science, business.industry, cs.NI, Volume (computing), 020206 networking & telecommunications, Content delivery network, 02 engineering and technology, Information repository, 01 natural sciences, 010305 fluids & plasmas, Data sharing, Computer Science - Networking and Internet Architecture, Data access, Computer Science - Distributed, Parallel, and Cluster Computing, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Bandwidth (computing), Network performance, Cache, Distributed, Parallel, and Cluster Computing (cs.DC), business, Computer network
Abstract: The volume of data moving through a network increases with new scientific experiments and simulations. Network bandwidth requirements also increase proportionally to deliver data within a certain time frame. We observe that a significant portion of the popular dataset is transferred multiple times to different users as well as to the same user for various reasons. In-network data caching for the shared data has shown to reduce the redundant data transfers and consequently save network traffic volume. In addition, overall application performance is expected to improve with in-network caching because access to the locally cached data results in lower latency. This paper shows how much data was shared over the study period, how much network traffic volume was consequently saved, and how much the temporary in-network caching increased the scientific application performance. It also analyzes data access patterns in applications and the impacts of caching nodes on the regional data repository. From the results, we observed that the network bandwidth demand was reduced by nearly a factor of 3 over the study period.
Published: 2021

25. Adaptive Stochastic Gradient Descent for Deep Learning on Heterogeneous CPU+GPU Architectures

Author: Florin Rusu, Kesheng Wu, Yujing Ma, and Alex Sim
Subjects: SGD, Memory hierarchy, Computer science, business.industry, fully-connected MLP, Deep learning, Message passing, Parallel computing, Scheduling (computing), Stochastic gradient descent, Rate of convergence, adaptive batch size, Asynchronous communication, Server, Artificial intelligence, business
Abstract: The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. However, this strategy does not employ effectively the extensive CPU and memory resources - which are used only for preprocessing, data transfer, and scheduling - available by default on the accelerated servers. In this paper, we study training algorithms for deep learning on heterogeneous CPU+GPU architectures. Our two-fold objective - maximize convergence rate and resource utilization simultaneously - makes the problem challenging. In order to allow for a principled exploration of the design space, we first introduce a generic deep learning framework that exploits the difference in computational power and memory hierarchy between CPU and GPU through asynchronous message passing. Based on insights gained through experimentation with the framework, we design two heterogeneous asynchronous stochastic gradient descent (SGD) algorithms. The first algorithm - CPU+GPU Hogbatch - combines small batches on CPU with large batches on GPU in order to maximize the utilization of both resources. However, this generates an unbalanced model update distribution which hinders the statistical convergence. The second algorithm - Adaptive Hogbatch - assigns batches with continuously evolving size based on the relative speed of CPU and GPU. This balances the model updates ratio at the expense of a customizable decrease in utilization. We show that the implementation of these algorithms in the proposed CPU+GPU framework achieves both faster convergence and higher resource utilization than TensorFlow on several real datasets.
Published: 2021

26. FasTensor Programming Model

Author: Bin Dong, Suren Byna, and Kesheng Wu
Subjects: Set (abstract data type), Workflow, Theoretical computer science, Data model, Computer science, business.industry, Data management, Big data, Programming paradigm, Abstract data type, business, Data structure
Abstract: In the previous chapter, we have introduced the motivation for a big data analysis system and its essential components: data model and programming model. We have also clearly stated the reasons why FasTensor chooses the multi-dimensional array as its data model. In this chapter, we continue to provide more details on the FasTensor’s programming model using the multi-dimensional array data model. The crucial constructs of a programming model for the data analysis include an abstract data type and a set of generic operators. The abstract data type allows users to define input and output data structures that their data analysis functions use. The abstract data type is defined on the top of the array data model. The set of generic operators should allow users to formulate a workflow with a wide range of data analysis functions. Through the abstract data type and these generic operators, a standard protocol between users and a data analysis system is established. On one hand, users can format their data with the abstract data type and express their data analysis with these generic operators. On the other hand, based on the abstract data type and these generic operators, the data analysis system can easily build its functions for generic data management functions, parallelization, and other tasks.
Published: 2021
Full Text: View/download PDF

27. FasTensor User Interface

Author: Kesheng Wu, Suren Byna, and Bin Dong
Subjects: Java, Programming language, Computer science, Programming paradigm, User interface, Python (programming language), computer.software_genre, computer, computer.programming_language, Term (time)
Abstract: In this chapter, we describe the user interface of the FasTensor library, a C++ implementation of the FasTensor programming model presented in previous chapters. Hence, the term FasTensor will mostly refer to the FasTensor library that implements the FasTensor programming model. Currently, the FasTensor programming model is available in C++. Support for other languages, such as Java, Python, and Julia, are planned.
Published: 2021
Full Text: View/download PDF

28. FasTensor in Real Scientific Applications

Author: Suren Byna, Bin Dong, and Kesheng Wu
Subjects: Computer science, Computation, Electronic engineering, Distributed acoustic sensing, Signal
Abstract: In this chapter, we describe two scientific applications to demonstrate the capability of FasTensor. The first application is from the earth science, where we show how FasTensor implement the self-similarity computation to detect useful signal from the data collected with distributed acoustic sensing (DAS). The second application is from plasma physics, where we apply FasTensor to analyze the hydro data from a VPIC simulation.
Published: 2021
Full Text: View/download PDF

29. User-Defined Tensor Data Analysis

Author: Bin Dong, Kesheng Wu, and Suren Byna
Published: 2021
Full Text: View/download PDF

30. Introduction

Author: Bin Dong, Kesheng Wu, and Suren Byna
Published: 2021
Full Text: View/download PDF

31. Deep Learning for Surface Wave Identification in Distributed Acoustic Sensing Data

Author: Jonathan B. Ajo-Franklin, Verónica Rodríguez Tribaldos, Vincent Dumont, and Kesheng Wu
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Machine Learning, Geophysical imaging, Computer science, Ambient noise level, Big data, FOS: Physical sciences, 02 engineering and technology, 010502 geochemistry & geophysics, 01 natural sciences, Seismic wave, Machine Learning (cs.LG), Physics - Geophysics, 0203 mechanical engineering, Seismic velocity, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, 0105 earth and related environmental sciences, Remote sensing, business.industry, 020302 automobile design & engineering, Distributed acoustic sensing, Geophysics (physics.geo-ph), Seismic hazard, Surface wave, Temporal resolution, business, Groundwater
Abstract: Moving loads such as cars and trains are very useful sources of seismic waves, which can be analyzed to retrieve information on the seismic velocity of subsurface materials using the techniques of ambient noise seismology. This information is valuable for a variety of applications such as geotechnical characterization of the near-surface, seismic hazard evaluation, and groundwater monitoring. However, for such processes to converge quickly, data segments with appropriate noise energy should be selected. Distributed Acoustic Sensing (DAS) is a novel sensing technique that enables acquisition of these data at very high spatial and temporal resolution for tens of kilometers. One major challenge when utilizing the DAS technology is the large volume of data that is produced, thereby presenting a significant Big Data challenge to find regions of useful energy. In this work, we present a highly scalable and efficient approach to process real, complex DAS data by integrating physics knowledge acquired during a data exploration phase followed by deep supervised learning to identify "useful" coherent surface waves generated by anthropogenic activity, a class of seismic waves that is abundant on these recordings and is useful for geophysical imaging. Data exploration and training were done on 130~Gigabytes (GB) of DAS measurements. Using parallel computing, we were able to do inference on an additional 170~GB of data (or the equivalent of 10 days' worth of recordings) in less than 30 minutes. Our method provides interpretable patterns describing the interaction of ground-based human activities with the buried sensors., Accepted at the IEEE BigData 2020 conference
Published: 2020
Full Text: View/download PDF

32. SMART Mobility. Mobility Decision Science Capstone Report

Author: Ling Jin, Joshua Auld, Victor Walker, Eleftheria Koutou, Alejandro Henao, Taha Hossein Rashidi, Sydny Fujita, Hung-Chia Yang, Clement Rames, Tom Wenzel, Alina Lazar, Jacob W. Ward, Alex Sim, Andrew Duvall, Gabrielle Wong-Parodi, Monique Stinson, James W. Sears, Zachary A. Needell, Anand Gopal, Amika Todd-Blink, Paul Leiby, C. Spurlock, Margaret R. Taylor, Omer Verbas, Annesa Enam, Colin Sheppard, Kesheng Wu, and Saika Belal
Subjects: Engineering management, Decision theory, Capstone, Business
Published: 2020
Full Text: View/download PDF

33. HPC Workload Characterization Using Feature Selection and Clustering

Author: Hyeonsang Eom, Alex Sim, Kesheng Wu, Chungyong Kim, Jiwoo Bang, Suren Byna, Sunggon Kim, Cafaro, Massimo, Kim, Jinoh, and Sim, Alex
Subjects: Set (abstract data type), Computer science, k-means clustering, Feature selection, Mutual information, Data mining, Mixture model, computer.software_genre, Cluster analysis, Supercomputer, computer, Silhouette
Abstract: Large high-performance computers (HPC) are expensive tools responsible for supporting thousands of scientific applications. However, it is not easy to determine the best set of configurations for workloads to best utilize the storage and I/O systems. Users typically use the default configurations provided by the system administrators, which typically results in poor performance. In an effort to identify application characteristics more important to I/O performance, we applied several machine learning techniques to characterize these applications. To identify the features that are most relevant to the I/O performance, we evaluate a number of different feature selection methods, e.g., Mutual information regression and F regression, and develop a novel feature selection method based on Min-max mutual information. These feature selection methods allow us to sift through a large set of the real-world workloads collected from NERSC's Cori supercomputer system, and identify the most important features. We employ a number of different clustering algorithms, including KMeans, Gaussian Mixture Model (GMM) and Ward linkage, and measure the cluster quality with Davies Boulder Index (DBI), Silhouette and a new Combined Score developed for this work. The cluster evaluation result shows that the test dataset could be best divided into three clusters, where cluster 1 contains mostly small jobs with operations on standard I/O units, cluster 2 consists of middle size parallel jobs dominated by read operations, and cluster 3 include large parallel jobs with heavy write operations. The cluster characteristics suggest that using parallel I/O library MPI IO and a large number of parallel cores are important to achieve high I/O throughput.
Published: 2020

34. Feature Selection Improves Tree-based Classification for Wireless Intrusion Detection

Author: Shilpa Bhandari, Avinash K. Kukreja, Kesheng Wu, Alex Sim, and Alina Lazar
Subjects: Computer science, Network security, business.industry, Class (philosophy), Feature selection, Intrusion detection system, Machine learning, computer.software_genre, Tree (data structure), Statistical classification, Key (cryptography), Feature (machine learning), Artificial intelligence, business, computer
Abstract: With the growth of 5G wireless technologies and IoT, it become urgent to develop robust network security systems, such as intrusions detection systems (IDS) to keep the networks secure. These IDS systems need to detect unauthorized access and attacks in real-time. However, most of the modern IDS are built based on complex machine learning models that are time-consuming to train. In this work, we propose a methodology using the SHapley Additive exPlanations (SHAP) in combination with tree-based classifiers. SHAP can be used to select consistent and small feature subsets to reduce the execution time and improve classification accuracy. We demonstrate the proposed approach with the Aegean Wi-Fi Intrusion Dataset (AWID) dataset in a series of multi-class classification experiments. Among the four classes ("normal", "injection", "flooding" and "impersonation"), it is well-known that the class impersonation is hard to be classified accurately. Tests show that we can use about 10% of the initial feature set without reducing the overall prediction accuracy. With this reduced set of features, the training time could be reduced as much as a factor of four, while slightly improving the discriminating ability to identify impersonation instances. This study suggests that by reducing the number of features, the classification algorithms are able to focus on key trends that differentiates the "attacks" classes from the "normal" class. Using a reduces subset of features improves IDS's accuracy and performance. Also, SHAP dependence plots capture the relationship between individual features and the classification decision.
Published: 2020

35. Transfer Learning Approach for Botnet Detection Based on Recurrent Variational Autoencoder

Author: Alex Sim, Jeeyung Kim, Jinoh Kim, Jaegyoon Hahm, and Kesheng Wu
Subjects: Source data, Artificial neural network, Computer science, business.industry, Botnet, Intrusion detection system, Machine learning, computer.software_genre, Autoencoder, Domain (software engineering), Problem domain, Artificial intelligence, Transfer of learning, business, computer
Abstract: Machine Learning (ML) methods have been widely used in Intrusion Detection Systems (IDS). In particular, many botnet detection methods are based on ML. However, due to the fast-evolving nature of network security threats, it is necessary to frequently retrain the ML tools with up-to-date data, especially because data labeling takes a long time and requires a lot of effort, making it difficult to generate training data. We propose transfer learning as a more effective approach for botnet detection, as it can learn from well curated source data and transfer the knowledge to a target problem domain not seen before. We devise an approach that is effective regardless whether or not the data from the target domain is labeled. More specifically, we train a neural network with the Recurrrent Variation Autoencoder (RVAE) structure on the source data, and use RVAE to compute anomaly scores for data records from the target domain. In an evaluation of this transfer learning framework, we use CTU-13 dataset as a source domain and a fresh set of network monitoring data as a target domain. Tests show that the proposed transfer learning method is able to detect botnets better than semi-supervised learning method that was trained on the target domain data. The area under Receiver Operating Characteristic is 0.810 for transfer learning, and 0.779 for directly using RVAE on the target domain data.
Published: 2020

36. Towards HPC I/O Performance Prediction through Large-scale Log Analysis

Author: Hyeonsang Eom, Kesheng Wu, Yongseok Son, Alex Sim, Suren Byna, Sunggon Kim, Parashar, Manish, Vlassov, Vladimir, Irwin, David E, and Mohror, Kathryn
Subjects: Job scheduler, Scheme (programming language), 050101 languages & linguistics, Computer science, Distributed computing, 05 social sciences, Provisioning, 02 engineering and technology, computer.software_genre, Supercomputer, Task (computing), 0202 electrical engineering, electronic engineering, information engineering, Performance prediction, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Distributed File System, computer, System software, computer.programming_language
Abstract: Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units, while used by hundreds to thousands of users at the same time. Applications from these large numbers of users have diverse characteristics, such as varying compute, communication, memory, and I/O intensiveness. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, the I/O performance is difficult to predict because the I/O system software is complex, the I/O system is shared among all users, and the I/O operations also heavily rely on networking systems. To improve the prediction of the I/O performance on HPC systems, we propose to integrate information from a number of different system logs and develop a regression-based approach that dynamically selects the most relevant features from the most recent log entries, and automatically select the best regression algorithm for the prediction task. Evaluation results show that our proposed scheme can predict the I/O performance with up to 84% prediction accuracy in the case of the I/O-intensive applications using the logs from CORI supercomputer at NERSC.
Published: 2020

37. GPU-based Classification for Wireless Intrusion Detection

Author: Kesheng Wu, Alina Lazar, and Alex Sim
Subjects: Acceleration, business.industry, Computer science, Computation, Real-time computing, Process (computing), Wireless, Intrusion detection system, Graphics, Scale (map), business, Pipeline (software)
Abstract: Automated network intrusion detection systems (NIDS) continuously monitor the network traffic to detect attacks or/and anomalies. These systems need to be able to detect attacks and alert network engineers in real-time. Therefore, modern NIDS are built using complex machine learning algorithms that require large training datasets and are time-consuming to train. The proposed work shows that machine learning algorithms from the RAPIDS cuML library on Graphics Processing Units (GPUs) can speed-up the training process on large scale datasets. This approach is able to reduce the training time while providing high accuracy and performance. We demonstrate the proposed approach on a large subset of data extracted from the Aegean Wi-Fi Intrusion Dataset (AWID). Multiple classification experiments were performed on both CPU and GPU. We achieve up to 65x acceleration of training several machine learning methods by moving most of the pipeline computations to the GPU and leveraging the new cuML library as well as the GPU version of the CatBoost library.
Published: 2020

38. Access Patterns to Disk Cache for Large Scientific Archive

Author: Kesheng Wu, Yumeng Wang, Shigeki Misawa, Shinjae Yoo, and Alex Sim
Subjects: Data access, Magnetic tape data storage, Software deployment, business.industry, Data management, Operating system, Cache, computer.software_genre, Disk buffer, business, computer
Abstract: Large scientific projects are increasing relying on analyses of data for their new discoveries; and a number of different data management systems have been developed to serve this scientific projects. In the work-in-progress paper, we describe an effort on understanding the data access patterns of one of these data management systems, dCache. This particular deployment of dCache acts as a disk cache in front of a large tape storage system primarily containing high-energy physics data. Based on the 15-month dCache logs, the cache is only accessing the tape system once for over 50 file requests, which indicates that it is effective as a disk cache. The on-disk files are repeated used, more than three times a day. We have also identified a number of unusual access patterns that are worth further investigation.
Published: 2020

39. A Deep Deterministic Policy Gradient Based Network Scheduler for Deadline-Driven Data Transfers

Author: Ghosal, G. R., Ghosal, D., Sim, A., Thakur, A. V., and Kesheng Wu
Subjects: DDPG, Value maximization, EDF, Software-defined Networking, Scheduling heuristics, TCP, Reinforcement Learning, Deadline-driven data transfers
Abstract: We consider data sources connected to a software defined network (SDN) with heterogeneous link access rates. Deadline-driven data transfer requests are made to a centralized network controller that schedules pacing rates of sources and meeting the request deadline has a pre-assigned value. The goal of the scheduler is to maximize the aggregate value. We design a scheduler (RL-Agent) based on Deep Deterministic Policy Gradient (DDPG). We compare our approach with three heuristics: (i) PFAIR, which shares the bottleneck capacity in proportion to the access rates, (ii) VDRatio, which prioritizes flows with high value-to-demand ratio, and (iii) VBEDF, which prioritizes flows with high value-to-deadline ratio. For equally valued requests and homogeneous access rates, PFAIR is the same as an idealized TCP algorithm, while VBEDF and VDRatio reduce to the Earliest Deadline First (EDF) and the Shortest Job First (SJF) algorithms, respectively. In this scenario, we show that RL-Agent performs significantly better than PFAIR and VDRatio and matches and in over-loaded scenarios out-performs VBEDF. When access rates are heterogeneous, we show that the RL-Agent performs as well as VBEDF even though the RL-Agent has no knowledge of the heterogeneity to start with. For the value maximization problems, we show that the RL-Agent out-performs the heuristics for both homogeneous and heterogeneous access networks. For the general case of heterogeneity with different values, the RL-Agent performs the best despite having no prior knowledge of the heterogeneity and the values, whereas the heuristics have full knowledge of the heterogeneity and VDRatio and VBEDF have partial knowledge of the values through the ratios of value to demand and value to deadline, respectively.
Published: 2020

40. Predicting Resource Requirement in Intermediate Palomar Transient Factory Workflow

Author: Alok Choudhary, Kesheng Wu, Qiao Kang, Alex Sim, Sunwoo Lee, Ankit Agrawal, Peter Nugent, and Wei-keng Liao
Subjects: iPTF, Data processing, Computer science, Bayesian network, Response time, 02 engineering and technology, computer.software_genre, 01 natural sciences, Spatiotemporal features, Set (abstract data type), Workflow, Resource (project management), Workflow Scheduling, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Factory (object-oriented programming), 020201 artificial intelligence & image processing, Transient (computer programming), Data mining, 010303 astronomy & astrophysics, computer
Abstract: Quickly identifying astronomical transients from synoptic surveys is critical to many recent astrophysical discoveries. However, each of the data processing pipelines in these surveys contains dozens of stages with highly varying time and space requirements. Properly predicting the resources required to run these pipelines is critical for the allocation of computing resources and reducing the discovery response time. We propose a machine learning strategy for this prediction task and demonstrate its effectiveness using a set of timing measurements from the intermediate Palomar Transient Factory (iPTF) workflow. The proposed model utilizes the spatiotemporal correlation of astronomical images, where nearby patches of the sky (space) are likely to have a similar number of objects of interest and workflows executed in the recent past (time) are likely to use a similar amount of time because the machines and data storage systems are likely to be in similar states. We capture the relationship among these spatial and temporal features in a Bayesian network and study how they impact the prediction accuracy. This Bayesian network helps us to identify the most influential features for predictions. With proper features, our models achieve errors close to the random variance boundary within batches of images taken at the same time, which can be regarded as the intrinsic limit of prediction accuracy.
Published: 2020
Full Text: View/download PDF

41. DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection

Author: Verónica Rodríguez Tribaldos, Suren Byna, Jonathan B. Ajo-Franklin, Bin Dong, Xin Xing, and Kesheng Wu
Subjects: Multi-core processor, 010504 meteorology & atmospheric sciences, business.industry, Computer science, Event (computing), Interface (computing), Distributed computing, 010502 geochemistry & geophysics, Supercomputer, FLOPS, 01 natural sciences, Computer data storage, business, 0105 earth and related environmental sciences
Abstract: Recently developed distributed acoustic sensing (DAS) technologies convert fiber-optic cables into large arrays of subsurface sensors, enabling a variety of applications including earthquake detection and environmental characterization. However, DAS systems produce voluminous datasets sampled at high spatial-temporal resolution and consequently, discovering useful geophysical knowledge within these large-scale data becomes a nearly impossible task for geophysicists. It is appealing to use supercomputers for DAS data analysis, as modern supercomputers are capable of performing over a hundred quadrillion FLOPS operations and have access to exabytes of storage space. Unfortunately, the majority of geophysical data processing libraries are not geared towards these supercomputer environments. This paper introduces a parallel DAS Data Storage and Analysis (DASSA) framework to enable easy-to-use and parallel DAS data analysis on modern supercomputers. DASSA uses a hybrid (i.e., MPI and OpenMP) data analysis execution engine that supports a user-defined function (UDF) interface for various operations and automatically parallelizes them for supercomputer execution. DASSA also provides novel data storage and access strategies, such as communication-avoiding parallel I/O, to reduce the cost of retrieving large DAS data for analysis. Compared with existing data analysis pipelines used by the geophysical community, DASSA is 16× faster and can efficiently scale up to 1456 computing nodes with 11648 CPU cores.
Published: 2020
Full Text: View/download PDF

42. Organizing Large Data Sets for Efficient Analyses on HPC Systems

Author: Junmin Gu, Philip Davis, Greg Eisenhauer, William Godoy, Axel Huebl, Scott Klasky, Manish Parashar, Norbert Podhorszki, Franz Poeschel, JeanLuc Vay, Lipeng Wan, Ruonan Wang, and Kesheng Wu
Subjects: History, Computer Science Applications, Education
Abstract: Upcoming exascale applications could introduce significant data management challenges due to their large sizes, dynamic work distribution, and involvement of accelerators such as graphical processing units, GPUs. In this work, we explore the performance of reading and writing operations involving one such scientific application on two different supercomputers. Our tests showed that the Adaptable Input and Output System, ADIOS, was able to achieve speeds over 1TB/s, a significant fraction of the peak I/O performance on Summit. We also demonstrated the querying functionality in ADIOS could effectively support common selective data analysis operations, such as conditional histograms. In tests, this query mechanism was able to reduce the execution time by a factor of five. More importantly, ADIOS data management framework allows us to achieve these performance improvements with only a minimal amount of coding effort.
Published: 2022
Full Text: View/download PDF

43. Inner-filter effect based fluorescence-quenching immunochromotographic assay for sensitive detection of aflatoxin B1 in soybean sauce

Author: Wenjing Zhang, Kesheng Wu, Juan Li, Hu Jiang, Yonghua Xiong, Hong Duan, and Nie Lijuan
Subjects: Detection limit, Aflatoxin, Materials science, Chromatography, Coefficient of variation, 010401 analytical chemistry, 02 engineering and technology, 021001 nanoscience & nanotechnology, 01 natural sciences, Fluorescence, Plasma resonance, 0104 chemical sciences, Colloidal gold, Quantum dot, Filter effect, 0210 nano-technology, Food Science, Biotechnology
Abstract: A fluorescence-quenching immunochromotographic assay (ICA) was developed for sensitive detection of aflatoxin B1 (AFB1) in soybean sauce based on the inner filter effect (IFE) between flower-like gold nanoparticles (AuNFs) and quantum dots (QDs). QDs were sprayed on the test and control line zones as background fluorescence signals, whereas AuNFs were designed as the fluorescence absorber of QDs because the surface plasma resonance peak of AuNFs totally matched with the maximum emission peak of QDs. Under the optimal conditions, the fluorescence-quenching ICA strip showed a good linear detection for AFB1 in standard AFB1 solution from 0.008 μg/L to 1 μg/L with a low detection limit of 0.004 μg/L. The average recoveries for different concentrations of AFB1-spiked soybean sauce samples ranged from 84.69% to 120.44% with a coefficient of variation ranging from 2.73% to 10.41%. In addition, the reliability of the proposed method was further confirmed by ultra-performance liquid chromatography with fluorescence detection method. In brief, this novel IFE-based strip offers a simple, rapid, sensitive, and accurate strategy for quantitative detection of AFB1 in soybean sauce.
Published: 2018
Full Text: View/download PDF

44. Incremental nonnegative matrix factorization based on correlation and graph regularization for matrix completion

Author: Xiaoxia Zhang, Kesheng Wu, and Degang Chen
Subjects: Matrix completion, Computer science, 020208 electrical & electronic engineering, Computational intelligence, 02 engineering and technology, Latent variable, Recommender system, Facial recognition system, Non-negative matrix factorization, Matrix decomposition, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Algorithm, Software
Abstract: Matrix factorization is widely used in recommendation systems, text mining, face recognition and computer vision. As one of the most popular methods, nonnegative matrix factorization and its incremental variants have attracted much attention. The existing incremental algorithms are established based on the assumption of samples are independent and only update the new latent variable of weighting coefficient matrix when the new sample comes, which may lead to inferior solutions. To address this issue, we investigate a novel incremental nonnegative matrix factorization algorithm based on correlation and graph regularizer (ICGNMF). The correlation is mainly used for finding out those correlated rows to be updated, that is, we assume that samples are dependent on each other. We derive the updating rules for ICGNMF by considering the correlation. We also present tests on widely used image datasets, and show ICGNMF reduces the error by comparing other methods.
Published: 2018
Full Text: View/download PDF

45. Special issue on scientific and statistical data management

Author: Florin Rusu and Kesheng Wu
Subjects: Information Systems and Management, Hardware and Architecture, business.industry, Computer science, Data management, business, Data structure, Data science, Software, Information Systems
Published: 2019
Full Text: View/download PDF

46. Fluorescence immunoassay based on the enzyme cleaving ss-DNA to regulate the synthesis of histone-ds-poly(AT) templated copper nanoparticles

Author: Bao Gao, Yinjiao Chai, Ying Xiong, Kesheng Wu, Xiaolin Huang, Yonghua Xiong, and Yunqing Wu
Subjects: Aflatoxin B1, Poly T, Iron, DNA, Single-Stranded, Metal Nanoparticles, 02 engineering and technology, 01 natural sciences, Fluorescence spectroscopy, Histones, Glucose Oxidase, chemistry.chemical_compound, medicine, General Materials Science, Glucose oxidase, Hydrogen peroxide, Immunoassay, Detection limit, Chromatography, medicine.diagnostic_test, biology, 010401 analytical chemistry, Reproducibility of Results, Hydrogen Peroxide, 021001 nanoscience & nanotechnology, Fluorescence, 0104 chemical sciences, Spectrometry, Fluorescence, chemistry, Reagent, biology.protein, Hydroxyl radical, Poly A, 0210 nano-technology, Copper
Abstract: Herein, for the first time we report a novel competitive fluorescence immunoassay for the ultrasensitive detection of aflatoxin B1 (AFB1) using histone-ds-poly(AT) templated copper nanoparticles (His-pAT CuNPs) as the fluorescent indicator. In this immunoassay, glucose oxidase (Gox) was used as the carrier of the competing antigen to catalyze the formation of hydrogen peroxide (H2O2) from glucose. H2O2 was converted to a hydroxyl radical using Fenton's reagent, which further regulated the fluorescence signals of His-pAT CuNPs. Owing to the ultrahigh sensitivity of the ss-DNA to the hydroxyl radical, the proposed fluorescence immunoassay exhibited a favorable dynamic linear detection of AFB1 ranging from 0.46 pg mL-1 to 400 pg mL-1 with an good half maximal inhibitory concentration and limit of detection of 6.13 and 0.15 pg mL-1, respectively. The intra- and inter-assay showed that the average recoveries for AFB1 spiked corn samples ranged from 96.87% to 100.73% and 96.67% to 114.92%, respectively. The reliability of this method was further confirmed by adopting ultra-performance liquid chromatography coupled with the fluorescence detector method. In summary, this work offers a novel screening strategy with high sensitivity and robustness for the quantitative detection of mycotoxins or other pollutants for food safety and clinical diagnosis.
Published: 2018
Full Text: View/download PDF

47. Clustering life course to understand the heterogeneous effects of life events, gender, and generation on habitual travel modes

Author: Alina Lazar, James W. Sears, Alex Sim, Ling Jin, C. Anna Spurlock, Annika Todd-Blick, Hung-Chia Yang, and Kesheng Wu
Subjects: Technology, General Computer Science, Life cycle, 0211 other engineering and technologies, Psychological intervention, 02 engineering and technology, joint social sequence clustering, Engineering, generation, Information and Computing Sciences, 0502 economics and business, Situated, gender, General Materials Science, Mode choice, Pediatric, 050210 logistics & transportation, Event (computing), 05 social sciences, Perspective (graphical), General Engineering, Mode (statistics), 021107 urban & regional planning, Quality Education, Sustainable transport, machine learning, Life course approach, lcsh:Electrical engineering. Electronics. Nuclear engineering, mode use, Psychology, lcsh:TK1-9971, Cognitive psychology
Abstract: Daily transportation mode choice is largely habitual, but transitions between life events may disrupt travel habits and can shift choices between alternative transportation modes. Although much is known about general mode switches following life event transitions, less is understood about differences that may exist between subpopulations, especially from a long-term perspective. Understanding these differences will help planners and policymakers introduce more targeted policy interventions to promote sustainable transportation modes and inform longer-term predictions. Extending beyond existing literature, we use data collected from a retrospective survey to investigate the effects of life course events on mode use situated within different long-term life trajectory contexts. We apply a machine-learning method called joint social sequence clustering to define five distinct and interpretable cohorts based on trajectory patterns in family and career domains over their life courses. We use these patterns as an innovative contextual system to investigate (1) the heterogeneous effects of life events on travel mode use and (2) further differentiation between gender and generation groups in these life event effects. We find that events occurring relatively early in life are more strongly associated with changes in mode-use behavior, and that mode use can also be affected by the relative order of events. This timing and order effect can have lasting impacts on mode use aggregated over entire life cycles: members of our “Have-it-alls” cohort-who finish their education, start working, partner up, and have children early in life-ramp up car use at each event, resulting in the highest rate of car use occurring the earliest among all the cohorts. Women drive more when having children primarily when their family formation and career formation are intertwined early in life, and younger generations rely relatively more on car use during familial events when their careers have a later start.
Published: 2020

48. Federated Wireless Network Intrusion Detection

Author: Alina Lazar, Kesheng Wu, Burak Cetin, Jinoh Kim, and Alex Sim
Subjects: Edge device, Computer science, business.industry, Network security, Process (engineering), Wireless network, Deep learning, 020206 networking & telecommunications, 02 engineering and technology, Intrusion detection system, Set (abstract data type), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, The Internet, Artificial intelligence, business, Computer network
Abstract: Wi-Fi has become the wireless networking standard that allows short- to medium-range device to connect without wires. For the last 20 year, the Wi-Fi technology has so pervasive that most devices in use today are mobile and connect to the internet through Wi-Fi. Unlike wired network, a wireless network lacks a clear boundary, which leads to significant Wi-Fi network security concerns, especially because the current security measures are prone to several types of intrusion. To address this problem, machine learning and deep learning methods have been successfully developed to identify network attacks. However, collecting data to develop models is expensive and raises privacy concerns. The goal of this paper is to evaluate a federated learning approach that would alleviate such privacy concerns. This initial work on intrusion detection is performed in a simulated environment. Once proven feasible, this process would allow edge devices to collaboratively update global anomaly detection models, without sharing sensitive training data. On a set of tests with the AWID intrusion detection data set, we show that our federated approach is effective in terms of classification accuracy, computation cost, as well as communication cost.
Published: 2019

49. Understanding Data Similarity in Large-Scale Scientific Datasets

Author: Alina Lazar, Deb Agarwal, Lavanya Ramakrishnan, Ludovico Bianchi, Gilberto Pastorello, Payton Linton, Devarshi Ghoshal, William Melodia, Kesheng Wu, Baru, Chaitanya, Huan, Jun, Khan, Latifur, Hu, Xiaohua, Ak, Ronay, Tian, Yuanyuan, Barga, Roger S, Zaniolo, Carlo, Lee, Kisung, and Ye, Yanfang Fanny
Subjects: 010504 meteorology & atmospheric sciences, Computer science, Dimensionality reduction, Context (language use), 02 engineering and technology, Similarity measure, computer.software_genre, 01 natural sciences, Euclidean distance, similarity measure, Similarity (network science), Outlier, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Time series, Cluster analysis, computer, 0105 earth and related environmental sciences, dimensionality reduction, clustering
Abstract: Today, scientific experiments and simulations produce massive amounts of heterogeneous data that need to be stored and analyzed. Given that these large datasets are stored in many files, formats and locations, how can scientists find relevant data, duplicates or similarities? In this context, we concentrate on developing algorithms to compare similarity of time series for the purpose of search, classification and clustering. For example, generating accurate patterns from climate related time series is important not only for building models for weather forecasting and climate prediction, but also for modeling and predicting the cycle of carbon, water, and energy. We developed the methodology and ran an exploratory analysis of climatic and ecosystem variables from the FLUXNET2015 dataset. The proposed combination of similarity metrics, nonlinear dimension reduction, clustering methods and validity measures for time series data has never been applied to unlabeled datasets before, and provides a process that can be easily extended to other scientific time series data. The dimensionality reduction step provides a good way to identify the optimum number of clusters, detect outliers and assign initial labels to the time series data. We evaluated multiple similarity metrics, in terms of the internal cluster validity for driver as well as response variables. While the best metric often depends on a number of factor, the Euclidean distance seems to perform well for most variables and also in terms of computational expense.
Published: 2019

50. Analysis and Prediction of Data Transfer Throughput for Data-Intensive Workloads

Author: Erich Strohmaier, Devarshi Ghoshal, Eric Pouyoul, and Kesheng Wu
Subjects: Job scheduler, 020203 distributed computing, Computer science, business.industry, Distributed computing, Big data, 020206 networking & telecommunications, 02 engineering and technology, Network monitoring, computer.software_genre, Supercomputer, 0202 electrical engineering, electronic engineering, information engineering, Performance prediction, Resource management, Heuristics, business, Throughput (business), computer, Host (network)
Abstract: Scientific workflows are increasingly transferring large amounts of data between high performance computing (HPC) systems. Even though these HPC systems are connected via high-speed dedicated networks and use dedicated data transfer nodes (DTNs), it is still difficult to predict the data transfer throughput because of variations in data transfer protocols, host configurations, performance of file systems, and overlapping workloads. In order to provide reliable performance prediction for better resource management and job scheduling, we need models for predicting data transfer throughput under real-world conditions. In this paper, we explore different machine learning approaches for building data-driven models to improve performance and prediction of large-scale data transfer throughput. In addition to the variables already collected by the network monitoring system, we also develop heuristics to derive additional metrics for improving the prediction accuracy. We use the prediction results to identify the importance of different network parameters in predicting the throughput for large-scale data transfers. Through extensive tests, we identify key network parameters, discover interesting variations among different HPC sites, and show that we can predict throughput with high accuracy. We also analyze our models and results to provide recommendations for improving the performance of big data transfers.
Published: 2019
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Journal

Database

Publisher

223 results on '"Kesheng Wu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources