Author: "Xianzhang Chen" / Topic: computer science - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xianzhang Chen"' showing total 70 results

Start Over Author "Xianzhang Chen" Topic computer science

70 results on '"Xianzhang Chen"'

1. Self-Adapting Channel Allocation for Multiple Tenants Sharing SSD Devices

Author: Duo Liu, Yujuan Tan, Renping Liu, Runyu Zhang, Xianzhang Chen, and Liang Liang
Subjects: Channel allocation schemes, business.industry, Computer science, Electrical and Electronic Engineering, business, Computer Graphics and Computer-Aided Design, Software, Computer network
Published: 2022
Full Text: View/download PDF

2. Bridging Mismatched Granularity Between Embedded File Systems and Flash Memory

Author: Runyu Zhang, Duo Liu, Chengliang Wang, Chaoshu Yang, Xiongxiong She, Xianzhang Chen, Zhaoyan Shen, and Yujuan Tan
Subjects: File system, Bridging (networking), Write amplification, business.industry, Computer science, computer.software_genre, Computer Graphics and Computer-Aided Design, Flash memory, Metadata, Logical conjunction, Embedded system, Granularity, Electrical and Electronic Engineering, Performance improvement, business, computer, Software
Abstract: The mismatch between logical and physical I/O granularity inhibits the deployment of embedded file systems. Most existing embedded file systems manage logical space with a small unit, which is no longer the case of the flash operation granularity. Manually enlarging the logical I/O granularity of file systems requires enormous transplanting efforts. Moreover, large logical pages signify the write amplification problem, which turns to severe space consumption and performance collapse. This article designs a novel storage middleware, NV-middle, for legacy-embedded file systems with large-capacity flash memories. Legacy-embedded storage schemes can be smoothly transplanted into new platforms with different hardware read/write granularity. Moreover, the legacy optimization schemes can be maximally reserved, without inducing write amplification problems. We implement NV-middle with the state-of-the-art embedded file system, YAFFS2. Comprehensive evaluations show that NV-middle can achieve times of performance improvement over manually transplanted YAFFS2 with various workloads.
Published: 2021
Full Text: View/download PDF

3. Contour: A Process Variation Aware Wear-Leveling Mechanism for Inodes of Persistent Memory File Systems

Author: Wang Xinxin, Chaoshu Yang, Qingfeng Zhuge, Weiwen Jiang, Xianzhang Chen, and Edwin H.-M. Sha
Subjects: File system, Computer science, Linux kernel, 02 engineering and technology, inode, Parallel computing, computer.software_genre, 020202 computer hardware & architecture, Theoretical Computer Science, Process variation, Memory management, Computational Theory and Mathematics, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Table (database), computer, Software, Wear leveling
Abstract: Existing persistent memory file systems exploit the fast, byte-addressable persistent memory (PM) to boost storage performance but ignore the limited endurance of PM. Particularly, the PM storing the inode section is extremely vulnerable for the inodes are most frequently updated, fixed on a location throughout lifetime, and require immediate persistency. The huge endurance variation of persistent memory domains caused by process variation makes things even worse. In this article, we propose a process variation aware wear leveling mechanism called Contour for the inode section of persistent memory file system. Contour first enables the movement of inodes by virtualizing the inodes with a deflection table. Then, Contour adopts cross-domain migration algorithm and intra-domain migration algorithm to balance the writes across and within the memory domains. We implement the proposed Contour mechanism in Linux kernel 4.4.30 based on a real persistent memory file system, SIMFS. We use standard benchmarks, including Filebench, MySQL, and FIO, to evaluate Contour. Extensive experimental results show Contour can improve the wear ratios of pages 417.8× and 4.5× over the original SIMFS and PCV , the state-of-the-art inode wear-leveling algorithm, respectively. Meanwhile, the average performance overhead and wear overhead of Contour are 0.87 and 0.034 percent in application-level workloads, respectively.
Published: 2021
Full Text: View/download PDF

4. On the Design of Minimal-Cost Pipeline Systems Satisfying Hard/Soft Real-Time Constraints

Author: Qingfeng Zhuge, Edwin H.-M. Sha, Lei Yang, Weiwen Jiang, Xianzhang Chen, and Hailiang Dong
Subjects: 020203 distributed computing, Mathematical optimization, Computer science, Pipeline (computing), Probabilistic logic, Approximation algorithm, 02 engineering and technology, 020202 computer hardware & architecture, Computer Science Applications, Human-Computer Interaction, Pipeline transport, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Time complexity, Throughput (business), Random variable, Integer programming, Information Systems
Abstract: Pipeline systems provide high throughput for applications by overlapping the executions of tasks. In the architectures with heterogeneity, two basic issues in the design of application-specific pipelines need to be studied: what type of functional unit to execute each task, and where to place buffers. Due to the increasing complexity of applications, pipeline designs face a bundle of problems. One of the most challenging problems is the uncertainty on the execution times, which makes the deterministic techniques inapplicable. In this paper, the execution times are modeled as random variables. Given an application, our objective is to construct the optimal pipeline, such that the total cost of the resultant pipeline can be minimized while satisfying the required timing constraints with the given guaranteed probability. We first prove the NP-hardness of the problem. Then, we present Mixed Integer Linear Programming (MILP) formulations to obtain the optimal solution. Due to the high time complexity of MILP, we devise an efficient $(1+\varepsilon)$ ( 1 + ɛ ) -approximation algorithm, where the value of $\varepsilon$ ɛ is less than 5 percent in practice. Experimental results show that our algorithms can achieve significant reductions in cost over the existing techniques, reaching up to 31.93 percent on average.
Published: 2021
Full Text: View/download PDF

5. Improving the Performance of Deduplication-Based Storage Cache via Content-Driven Cache Management Methods

Author: Hong Jiang, Xianzhang Chen, Zhichao Yan, Witawas Srisa-an, Duo Liu, Jing Xie, Yujuan Tan, and Congcong Xu
Subjects: Hardware_MEMORYSTRUCTURES, Computational Theory and Mathematics, Distributed database, Hardware and Architecture, Computer science, CPU cache, Backup, Distributed computing, Signal Processing, Redundancy (engineering), Data deduplication, Cache, Cache algorithms
Abstract: Data deduplication, as a proven technology for effective data reduction in backup and archiving storage systems, is also showing promises in increasing the logical space capacity for storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they only work well when the cached block size is set to 4 KB. Unfortunately, modern storage systems often set the block size to be much larger than 4 KB, and in this scenario, the overall performance of these caching schemes drops below that of the conventional replacement algorithms without any deduplication. There are several reasons for this performance degradation. The first reason is the deduplication overhead, which is the time spent on generating the data fingerprints and their use to identify duplicate data. Such overhead offsets the benefits of deduplication. The second reason is the extremely low cache space utilization caused by read and write alignment. The third reason is that existing algorithms only exploit access locality to identify block replacement. There is a lost opportunity to effectively leverage the content usage patterns such as intensity of content redundancy and sharing in deduplication-based storage caches to further improve performance. We propose CDAC, a Content-driven Deduplication-Aware Cache, to address this problem. CDAC focuses on exploiting the content redundancy in blocks and intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDAC-LRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU, and D-ARC, by up to 23.83X in read cache hit ratio, with an average of 3.23X, and up to 53.3 percent in IOPS, with an average of 49.8 percent, under a real-world mixed workload when the cache size ranges from 20 to 50 percent of the workload size and the block size ranges from 4KB to 32 KB.
Published: 2021
Full Text: View/download PDF

6. Self-Balancing Federated Learning With Global Imbalanced Data in Mobile Systems

Author: Duo Liu, Yujuan Tan, Liang Liang, Renping Liu, Xianzhang Chen, and Moming Duan
Subjects: Training set, Distributed database, Artificial neural network, Computer science, business.industry, Deep learning, Machine learning, computer.software_genre, Federated learning, Data modeling, Computational Theory and Mathematics, Hardware and Architecture, Server, Signal Processing, Artificial intelligence, Divergence (statistics), business, computer
Abstract: Federated learning (FL) is a distributed deep learning method that enables multiple participants, such as mobile and IoT devices, to contribute a neural network while their private training data remains in local devices. This distributed approach is promising in the mobile systems where have a large corpus of decentralized data and require high privacy. However, unlike the common datasets, the data distribution of the mobile systems is imbalanced which will increase the bias of model. In this article, we demonstrate that the imbalanced distributed training data will cause an accuracy degradation of FL applications. To counter this problem, we build a self-balancing FL framework named Astraea, which alleviates the imbalances by 1) Z-score-based data augmentation, and 2) Mediator-based multi-client rescheduling. The proposed framework relieves global imbalance by adaptive data augmentation and downsampling, and for averaging the local imbalance, it creates the mediator to reschedule the training of clients based on Kullback–Leibler divergence (KLD) of their data distribution. Compared with FedAvg , the vanilla FL algorithm, Astraea shows +4.39 and +6.51 percent improvement of top-1 accuracy on the imbalanced EMNIST and imbalanced CINIC-10 datasets, respectively. Meanwhile, the communication traffic of Astraea is reduced by 75 percent compared to FedAvg .
Published: 2021
Full Text: View/download PDF

7. Optimizing synchronization mechanism for block-based file systems using persistent memory

Author: Xianzhang Chen, Qingfeng Zhuge, Duo Liu, Edwin H.-M. Sha, Runyu Zhang, and Chaoshu Yang
Subjects: File system, Hardware_MEMORYSTRUCTURES, Data consistency, Computer Networks and Communications, business.industry, Computer science, ext4, 020206 networking & telecommunications, Linux kernel, 02 engineering and technology, Data loss, computer.software_genre, Synchronization, Persistence (computer science), Hardware and Architecture, Embedded system, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), 020201 artificial intelligence & image processing, business, computer, Software, Block (data storage)
Abstract: Existing block-based file systems employ buffer caching mechanism to improve performance, which may result in data loss in the case of power failure or system crash. To avoid data loss, the file systems provide synchronization operations for applications to synchronously write the dirty data in DRAM cache back to the slow block devices. However, the synchronization operations can severely degrade the performance of the file system since violating the intention of buffer caching mechanism. In this paper, we propose to relieve the overhead of synchronization operations while ensuring data reliability by utilizing a small Persistent Memory. The proposed Persistent Memory assisted Write-back (PMW) mechanism includes a dedicated Copy-on-Write mechanism to guarantee data consistency and a write-back mechanism across PM and the block devices. We implement the proposed PMW in Linux kernel based on Ext4. The experimental results show that PMW can achieve about 2.2 × and 1.6 × performance improvement over the original Ext4 and AFCM, the state-of-the-art PM-based synchronization mechanism, on the TPCC workload, respectively.
Published: 2020
Full Text: View/download PDF

8. Separable Binary Convolutional Neural Network on Embedded Systems

Author: Yujuan Tan, Chaoshu Yang, Liang Liang, Renping Liu, Yingjian Ling, Duo Liu, Runyu Zhang, Weilue Wang, Chunhua Xiao, and Xianzhang Chen
Subjects: business.industry, Computer science, Binary number, 02 engineering and technology, Convolutional neural network, 020202 computer hardware & architecture, Theoretical Computer Science, Separable space, Computational Theory and Mathematics, Kernel (image processing), Hardware and Architecture, Embedded system, Principal component analysis, 0202 electrical engineering, electronic engineering, information engineering, Network performance, business, Software
Abstract: We have witnessed the tremendous success of deep neural networks. However, this success comes with the considerable memory and computational costs which make it difficult to deploy these networks directly on resource-constrained embedded systems. To address this problem, we propose TaijiNet, a separable binary network, to reduce the storage and computational overhead while maintaining a comparable accuracy. Furthermore, we also introduce a strategy called partial binarized convolution which binarizes only unimportant kernels to efficiently balance network performance and accuracy. Our approach is evaluated on the CIFAR-10 and ImageNet datasets. The experimental results show that with the proposed TaijiNet, the separable binary versions of AlexNet and ResNet-18 can achieve 26× and 6.4× compression rates with comparable accuracy when comparing with the full-precision versions respectively. In addition, by adjusting the PCA threshold, the xnor version of Taiji-AlexNet improves accuracy by 4-8 percent comparing with other state-of-the-art methods.
Published: 2020
Full Text: View/download PDF

9. APMigration: Improving Performance of Hybrid Memory Performance via An Adaptive Page Migration Method

Author: Duo Liu, Yujuan Tan, Witawas Srisa-an, Xianzhang Chen, Zhichao Yan, and Baiping Wang
Subjects: 020203 distributed computing, Random access memory, Hardware_MEMORYSTRUCTURES, Computer science, Frame (networking), 02 engineering and technology, computer.software_genre, Flash memory, Non-volatile memory, Memory management, Computational Theory and Mathematics, Hardware and Architecture, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Operating system, Non-volatile random-access memory, computer, Dram, Data migration
Abstract: Byte-addressable, non-volatile memory (NVRAM) combines the benefits of DRAM and flash memory. However, due to its slower speed than DRAM, it is best to deploy it in combination with typical DRAM. In such Hybrid NVRAM systems, frequently accessed, hot pages can be stored in DRAM while other cold pages can reside in NVRAM, providing the benefits of both high performance (from DRAM) and lower power consumption and cost/performance (from NVRAM). While the idea seems beneficial, realizing an efficient hybrid NVRAM system requires careful page migration and accurate data temperature measurement. Existing solutions, however, often cause invalid migrations due to inaccurate data temperature accounting, because hot and cold pages are separately identified in DRAM and NVRAM regions. Moreover, since a new NVRAM frame is always allocated for each page swapped back NVRAM, a large amount of unnecessary NVRAM writes are generated during each page migration. Based on these observations, we propose APMigrate, an adaptive data migration approach for hybrid NVRAM systems. APMigrate consist of two parts, UIMigrate and LazyWriteback . UIMigrate focuses on eliminating invalid page migrations by considering data temperature in the entire DRAM-NVRAM space, while LazyWriteback focus on rewriting only dirty data back when the page is swapped back to NVRAM. Our experiments using SPEC 2006 show that APMigrate can reduce the number of migrations and improves performance by up to 90 percent compared to existing state-of-the-art approaches. For some workloads, LazyWriteback can reduce unnecessary NVRAM writes for existing page migrations by up to 75 percent.
Published: 2020
Full Text: View/download PDF

10. Downsizing Without Downgrading: Approximated Dynamic Time Warping on Nonvolatile Memories

Author: Yingjian Ling, Xianzhang Chen, Duo Liu, Po-Chun Huang, Renping Liu, Yi Gu, Liang Liang, Kan Zhong, and Xingni Li
Subjects: Dynamic time warping, Similarity (geometry), Computer science, 02 engineering and technology, computer.software_genre, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Euclidean distance, Upsampling, 0202 electrical engineering, electronic engineering, information engineering, Data mining, Electrical and Electronic Engineering, Time series, Wireless sensor network, computer, Software
Abstract: In recent years, time-series data have emerged in a variety of application domains, such as wireless sensor networks and surveillance systems. To identify the similarity between time-series data, the Euclidean distance and its variations are common metrics that quantify the differences between time-series data. However, the Euclidean distance is limited by its inability to elastically shift with the time axis, which motivates the development of dynamic time warping (DTW) algorithms. While DTW algorithms have been proven very useful in diversified applications like speech recognition, their efficacy might be seriously affected by the resolution of the time-series data. However, high-resolution time-series data might take up a gigantic amount of main memory and storage space, which will slow down the DTW analysis procedure. This makes the upscaling of DTW analysis more challenging, especially for in-memory data analytics platforms with limited nonvolatile memory space. In this paper, we propose a strategy to downsample time-series data to significantly reduce their size without seriously affecting the precision of the results obtained by DTW algorithms (downsizing without downgrading). In other words, this paper proposes a technique to remove the unimportant details that are largely ignored by DTW algorithms. The efficacy of the proposed technique is verified by a series of experimental studies, where the results are quite encouraging.
Published: 2020
Full Text: View/download PDF

11. ChordMap: Automated Mapping of Streaming Applications onto CGRA

Author: Zhaoying Li, Tulika Mitra, Anuj Pathania, Dhananjaya Wijerathne, Xianzhang Chen, and Parallel Computing Systems (IvI, FNWI)
Subjects: Computer science, Compiler, Parallel computing, Electrical and Electronic Engineering, computer.software_genre, Computer Graphics and Computer-Aided Design, Throughput (business), computer, Software
Abstract: Streaming applications, consisting of several communicating kernels, are ubiquitous in the embedded computing systems. The synchronous data flow (SDF) is commonly used to capture the complex communication patterns among the kernels. The general-purpose processors cannot meet the throughput requirement of the compute-intensive kernels in the current and emerging applications. The coarse-grained reconfigurable arrays (CGRAs) are well-suited to accelerate the individual kernel and the compiler technology is well-developed to support the mapping of a kernel onto a CGRA accelerator. However, the system-level mapping of the entire streaming application onto a resource-constrained CGRA to maximize throughput remains unexplored. We introduce a novel CGRA mapper, called ChordMap, to automatically generate a high-quality mapping of streaming applications represented as SDF onto CGRAs. We propose an optimized spatio-temporal mapping with modulo-scheduling that judiciously employs concurrent execution of multiple kernels to improve parallelism and thereby maximize throughput. ChordMap achieves, on average, 1.74× higher throughput across eight streaming applications compared to the state-of-the-art.
Published: 2022

12. CSAFL: A Clustered Semi-Asynchronous Federated Learning Framework

Author: Xianzhang Chen, Ao Ren, Moming Duan, Duo Liu, Chengliang Wang, Yujuan Tan, Li Li, and Yu Zhang
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Information privacy, Artificial neural network, Distributed database, Computer science, Distributed computing, Federated learning, Machine Learning (cs.LG), Data modeling, Computer Science - Distributed, Parallel, and Cluster Computing, Asynchronous communication, Convergence (routing), Distributed, Parallel, and Cluster Computing (cs.DC), Baseline (configuration management)
Abstract: Federated learning (FL) is an emerging distributed machine learning paradigm that protects privacy and tackles the problem of isolated data islands. At present, there are two main communication strategies of FL: synchronous FL and asynchronous FL. The advantages of synchronous FL are that the model has high precision and fast convergence speed. However, this synchronous communication strategy has the risk that the central server waits too long for the devices, namely, the straggler effect which has a negative impact on some time-critical applications. Asynchronous FL has a natural advantage in mitigating the straggler effect, but there are threats of model quality degradation and server crash. Therefore, we combine the advantages of these two strategies to propose a clustered semi-asynchronous federated learning (CSAFL) framework. We evaluate CSAFL based on four imbalanced federated datasets in a non-IID setting and compare CSAFL to the baseline methods. The experimental results show that CSAFL significantly improves test accuracy by more than +5% on the four datasets compared to TA-FedAvg. In particular, CSAFL improves absolute test accuracy by +34.4% on non-IID FEMNIST compared to TA-FedAvg., This paper will be presented at IJCNN 2021
Published: 2021
Full Text: View/download PDF

13. On the Design of Time-Constrained and Buffer-Optimal Self-Timed Pipelines

Author: Edwin H.-M. Sha, Lei Yang, Weiwen Jiang, Jingtong Hu, Qingfeng Zhuge, and Xianzhang Chen
Subjects: Marked graph, Matching (graph theory), Computer science, Pipeline (computing), 02 engineering and technology, Parallel computing, Computer Graphics and Computer-Aided Design, Synchronization, 020202 computer hardware & architecture, Reduction (complexity), Asynchronous communication, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Field-programmable gate array, Integer programming, Software
Abstract: Pipelining is a powerful technique to achieve high performance in computing systems. However, as computing platforms become large-scale and integrate with heterogeneous processing elements (PEs) (CPUs, GPUs, field-programmable gate arrays, etc.), it is difficult to employ a global clock to achieve synchronous pipelines. Therefore, self-timed (or asynchronous) pipelines are usually adopted. Nevertheless, due to their complex running behavior, the performance modeling and systematic optimizations for self-timed pipeline (STP) systems are more complicated than those for synchronous ones. This paper employs marked graph theory to model STPs and presents algorithms to detect performance bottlenecks. Based on the proposed model, we observe that the system performance can be improved by inserting buffers. Due to the limited memory resources on the PEs, it is critical to minimize the number of buffers for STPs while satisfying the required timing constraints. In this paper, we propose integer linear programming formulations to obtain the optimal solutions and devise efficient algorithms to obtain the near-optimal solutions. Experimental results show that the proposed algorithms can achieve 53.10% improvement in the maximum performance and 54.04% reduction in the number of buffers, compared with the technique for the slack matching problem.
Published: 2019
Full Text: View/download PDF

14. HydraFS: an efficient NUMA-aware in-memory file system

Author: Kai Liu, Edwin H.-M. Sha, Xianzhang Chen, Zhixiang Liu, Ting Wu, Qingfeng Zhuge, and Chunhua Xiao
Subjects: File system, Hardware_MEMORYSTRUCTURES, Computer Networks and Communications, Computer science, 020206 networking & telecommunications, Linux kernel, 02 engineering and technology, Thread (computing), computer.software_genre, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Operating system, 020201 artificial intelligence & image processing, computer, Software
Abstract: Emerging persistent file systems are designed to achieve high-performance data processing by effectively exploiting the advanced features of Non-volatile Memory (NVM). Non-uniform memory access (NUMA) architectures are universally used in high-performance computing and data centers due to its scalability. However, existing NVM-based in-memory file systems are all designed for uniformed memory access systems. Their performance is not satisfactory on NUMA machine as they do not consider the architecture of multiple nodes and the asymmetric memory access speed. In this paper, we design an efficient NUMA-aware in-memory file system which distributes file data on all nodes to effectively balance the loads of file requests. Three approaches for improving the performance of the file system on NUMA machine are proposed, including Node-oriented File Creation algorithm to dispatch files over multiple nodes, File-oriented Thread Binding algorithm to bind threads to the gainful nodes and a buffer assignment technique to allocate the user buffer from the proper node. Further, based on the new design, we implement a functional NUMA-aware in-memory file system, HydraFS, in Linux kernel. Extensive experiments show that HydraFS significantly outperforms existing representative in-memory file systems on NUMA machine. The average performance of HydraFS is 76.6%, 91.9%, 26.7% higher than EXT4-DAX, PMFS, and SIMFS, respectively.
Published: 2019
Full Text: View/download PDF

15. HiNextApp: A context-aware and adaptive framework for app prediction in mobile systems

Author: Renping Liu, Shiming Li, Duo Liu, Liang Liang, Yong Guan, Chaoneng Xiang, Xianzhang Chen, and Jinting Ren
Subjects: General Computer Science, business.industry, Computer science, 020209 energy, Response time, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, Machine learning, computer.software_genre, Variety (cybernetics), Bayes' theorem, Memory management, Systems management, mental disorders, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Contextual information, Artificial intelligence, Electrical and Electronic Engineering, business, computer
Abstract: A variety of applications (App) installed on mobile systems such as smartphones enrich our lives, but make it more difficult to the system management. For example, finding the specific Apps becomes more inconvenient due to more Apps installed on smartphones, and App response time could become longer because of the gap between more, larger Apps and limited memory capacity. Recent work has proposed several methods of predicting next used Apps in the immediate future (here in after app-prediction) to solve the issues, but faces the problems of the low prediction accuracy and high training costs. Especially, applying app-prediction to memory management (such as LMK) and App prelaunching has high requirements for the prediction accuracy and training costs. In this paper, we propose an app-prediction framework, named HiNextApp, to improve the app-prediction accuracy and reduce training costs in mobile systems. HiNextApp is based on contextual information, and can adjust the size of prediction periods adaptively. The framework mainly consists of two parts: non-uniform Bayes model and an elastic algorithm. The experimental results show that HiNextApp can effectively improve the prediction accuracy and reduce training times. Besides, compared with traditional Bayes model, the overhead of our framework is relatively low.
Published: 2019
Full Text: View/download PDF

16. FitCNN: A cloud-assisted and low-cost framework for updating CNNs on IoT devices

Author: Yujuan Tan, Chaoshu Yang, Liang Liang, Jinting Ren, Duo Liu, Xianzhang Chen, Moming Duan, Renping Liu, and Shiming Li
Subjects: Artificial neural network, Contextual image classification, Computer Networks and Communications, business.industry, Computer science, Real-time computing, 020206 networking & telecommunications, Cloud computing, 02 engineering and technology, Convolutional neural network, Upload, User experience design, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), 020201 artificial intelligence & image processing, business, Mobile device, Software
Abstract: Recently convolutional neural networks (CNNs) have essentially achieved the state-of-the-art accuracies in image classification and recognition tasks. CNNs are usually deployed in the cloud to handle data collected from IoT devices, such as smartphones and unmanned systems. However, significant data transmission overhead and privacy issues have made it necessary to use CNNs directly in device side. Nevertheless, the trained model deployed on mobile devices cannot effectively handle the unknown data and objects in new environments, which could lead to low accuracy and poor user experience. Hence, it would be crucial to re-train a better model via future unknown data. However, with tremendous computing cost and memory usage, training a CNN on IoT devices with limited hardware resources is intolerable in practice. To solve this issue, using the power of cloud to assist mobile devices to train a deep neural network becomes a promising solution . Therefore, this paper proposes a cloud-assisted CNN framework, named FitCNN, with incremental learning and low data transmission, to reduce the overhead of updating CNNs deployed on devices. To reduce the data transmission during incremental learning, we propose a strategy, called Distiller, to selectively upload the data that is worth learning, and develop an extracting strategy, called Juicer, to choose light amount of weights from the new CNN model generated on the cloud to update the corresponding old ones on devices. Experimental results show that the Distiller strategy can reduce 39.4% data transmission of uploading based on a certain dataset, and the Juicer strategy reduces by more than 60% data transmission of updating with multiple CNNs and datasets.
Published: 2019
Full Text: View/download PDF

17. LPE: Locality-Based Dead Prediction in Exclusive TLB for Large Coverage

Author: Yujuan Tan, Jing Yan, Jingcheng Liu, Chengliang Wang, Xianzhang Chen, and Zhulin Ma
Subjects: Hardware and Architecture, Computer science, Locality, Translation lookaside buffer, General Medicine, Parallel computing, Electrical and Electronic Engineering, Memory systems
Abstract: Translation lookaside buffer (TLB) is critical to modern multi-level memory systems’ performance. However, due to the limited size of the TLB itself, its address coverage is limited. Adopting a two-level exclusive TLB hierarchy can increase the coverage [M. Swanson, L. Stoller and J. Carter, Increasing TLB reach using superpages backed by shadow memory, 25th Annual Int. Symp. Computer Architecture (1998); H.P. Chang, T. Heo, J. Jeong and J. Huh Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations, ACM SIGARCH Comput. Arch. News 45 (2017) 444–456] to improve memory performance. However, after analyzing the existing two-level exclusive TLBs, we find that a large number of “dead” entries (they will have no further use) exist in the last-level TLB (LLT) for a long time, which occupy much cache space and result in low TLB hit-rate. Based on this observation, we first propose exploiting temporal and spatial locality to predict and identify dead entries in the exclusive LLT and remove them as soon as possible to leave room for more valid data to increase the TLB hit rates. Extensive experiments show that our method increases the average hit rate by 8.67%, to a maximum of 19.95%, and reduces total latency by an average of 9.82%, up to 24.41%.
Published: 2021
Full Text: View/download PDF

18. DFShards

Author: Duo Liu, Congcong Xu, Ailing Yu, Xianzhang Chen, Zhulin Ma, and Yujuan Tan
Subjects: 020203 distributed computing, Hardware_MEMORYSTRUCTURES, Computer science, CPU cache, 02 engineering and technology, Construct (python library), Function (mathematics), Set (abstract data type), Data access, Shard, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), Cache, Algorithm
Abstract: The Miss Ratio Curve (MRC) describes the cache miss ratio as a function of the cache size. It has various shapes that represent the data access behaviors of workloads in the cache. MRC is an effective tool to guide cache partitioning, but its real-time construction is challenging. Miniature Simulation is a novel approach that constructs MRCs for non-stack algorithms in real time, via feeding a small number of sample references to multiple mini caches simultaneously to get the miss ratios. However, while using the Miniature Simulation, the size and number of mini-caches are difficult to set before the program runs. First, it may set too many mini-caches and cause repeated simulations. Second, it may miss some important cache sizes and consequently construct a less precise shape of MRC and result in incorrect cache partitioning. To address this problem, we propose DFShards, an adaptive cache shards (mini-caches) configuration approach based on program access patterns. The key idea is to dynamically adjust the configuration of the cache shards, including the number of the total cache shards and the size of each cache shard, based on the access behaviors to reflect changes in workload to build an precise MRC, thereby achieving better cache partitioning and overall performance. Our extensive experiments show that DFShards can construct precise MRCs in real-time during program running. Compared to the state-of-the-art approaches, it can save up to 47% of the cache space for MRC constructions while increasing the cache hit ratio by up to 17%.
Published: 2021
Full Text: View/download PDF

19. Forseti: An Efficient Basic-block-level Sensitivity Analysis Framework Towards Multi-bit Faults

Author: Moming Duan, Jinting Ren, Xianzhang Chen, Duo Liu, Chengliang Wang, and Renping Liu
Subjects: Speedup, Artificial neural network, Computer engineering, Computer science, Basic block, Overhead (computing), Sensitivity (control systems)
Abstract: The per-instruction sensitivity analysis framework is developed to evaluate the resiliency of a program and identify the segments of the program needing protection. However, for multi-bit hardware faults, the per-instruction sensitivity analysis frameworks can cause large overhead for redundant analyses. In this paper, we propose a basic-block-level sensitivity analysis framework, Forseti, to reduce the analysis overhead in analyzing impacts of modern microprocessors' multi-bit faults on programs. We implement Forseti in LLVM and evaluate it with five typical workloads. Extensive experimental results show that Forseti can achieve more than 90% sensitivity classification accuracy and 6.16× speedup over instruction-level analysis.
Published: 2021
Full Text: View/download PDF

20. FedSAE: A Novel Self-Adaptive Federated Learning Framework in Heterogeneous Systems

Author: Yujuan Tan, Yu Zhang, Chengliang Wang, Duo Liu, Moming Duan, Ao Ren, Li Li, and Xianzhang Chen
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Edge device, Artificial neural network, Computer science, Active learning (machine learning), Reliability (computer networking), Distributed computing, Machine Learning (cs.LG), Upload, Computer Science - Distributed, Parallel, and Cluster Computing, Complete information, Server, Overhead (computing), Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated Learning (FL) is a novel distributed machine learning which allows thousands of edge devices to train model locally without uploading data concentrically to the server. But since real federated settings are resource-constrained, FL is encountered with systems heterogeneity which causes a lot of stragglers directly and then leads to significantly accuracy reduction indirectly. To solve the problems caused by systems heterogeneity, we introduce a novel self-adaptive federated framework FedSAE which adjusts the training task of devices automatically and selects participants actively to alleviate the performance degradation. In this work, we 1) propose FedSAE which leverages the complete information of devices' historical training tasks to predict the affordable training workloads for each device. In this way, FedSAE can estimate the reliability of each device and self-adaptively adjust the amount of training load per client in each round. 2) combine our framework with Active Learning to self-adaptively select participants. Then the framework accelerates the convergence of the global model. In our framework, the server evaluates devices' value of training based on their training loss. Then the server selects those clients with bigger value for the global model to reduce communication overhead. The experimental result indicates that in a highly heterogeneous system, FedSAE converges faster than FedAvg, the vanilla FL framework. Furthermore, FedSAE outperforms than FedAvg on several federated datasets - FedSAE improves test accuracy by 26.7% and reduces stragglers by 90.3% on average., Comment: This paper will be presented at IJCNN 2021
Published: 2021
Full Text: View/download PDF

21. AIR Cache: A Variable-Size Block Cache Based on Fine-Grained Management Method

Author: Yuxiong Li, Xianzhang Chen, Congcong Xu, Duo Liu, Chengliang Wang, Leong Hou U, Yujuan Tan, and Mingliang Zhou
Subjects: Hardware_MEMORYSTRUCTURES, Computer science, Overhead (business), Variable size, Management methods, Parallel computing, Cache, Throughput (business), Market fragmentation, Block (data storage)
Abstract: Recently, adopting large cache blocks has received widespread attention in server-side storage caching. Besides reducing the management overheads of cache blocks, it can significantly boost the I/O throughput. However, although using large blocks has advantages in management overhead and I/O performance, existing fixed-size block management schemes in storage cache cannot effectively handle them under the complicated real-world workloads. We find that existing fixed-size block management methods will suffer from the fragmentation within the cache block and fail to identify hot/cold cache blocks correctly when adopting large blocks for caching.
Published: 2021
Full Text: View/download PDF

22. WMAlloc: A Wear-Leveling-Aware Multi-Grained Allocator for Persistent Memory File Systems

Author: Wenbin Wang, Shun Nie, Chaoshu Yang, Xianzhang Chen, Duo Liu, and Runyu Zhang
Subjects: File system, Computer science, business.industry, 020206 networking & telecommunications, Linux kernel, Memory bus, 02 engineering and technology, computer.software_genre, 020202 computer hardware & architecture, Persistence (computer science), Allocator, Memory management, Embedded system, 0202 electrical engineering, electronic engineering, information engineering, Binary heap, Persistent data structure, business, computer, Wear leveling, Heap (data structure)
Abstract: Emerging Persistent Memories (PMs) are promised to revolutionize the storage systems by providing fast, persistent data access on the memory bus. Therefore, persistent memory file systems are developed to achieve high performance by exploiting the advanced features of PMs. Unfortunately, the PMs have the problem of limited write endurance. Furthermore, the existing space management strategies of persistent memory file systems usually ignore this problem, which can cause that the write operations concentrate on a few cells of PM. Then, the unbalanced writes can damage the underlying PMs quickly, which seriously damages the data reliability of the file systems. However, existing wear-leveling-aware space management techniques mainly focus on improving the wear-leveling accuracy of PMs rather than reducing the overhead, which can seriously reduce the performance of persistent memory file systems. In this paper, we propose a Wear-Leveling-Aware Multi-Grained Allocator, called WMAlloc, to achieve the wear-leveling of PM while improving the performance for persistent memory file systems. WMAlloc adopts multiple heap trees to manage the unused space of PM, and each heap tree represents an allocation granularity. Then, WMAlloc allocates less-worn required blocks from the heap tree for each allocation. We implement the proposed WMAlloc in Linux kernel based on NOVA, a typical persistent memory file system. Compared with DWARM, the state-of-the-art and wear-leveling-aware space management technique, experimental results show that WMAlloc can achieve 1.52× lifetime of PM and 1.44× performance improvement on average.
Published: 2020
Full Text: View/download PDF

23. Themis: Malicious Wear Detection and Defense for Persistent Memory File Systems

Author: Xianzhang Chen, Wenbin Wang, Shun Nie, Duo Liu, Chaoshu Yang, and Runyu Zhang
Subjects: Scheme (programming language), File system, Random access memory, Hardware_MEMORYSTRUCTURES, business.industry, Computer science, Reliability (computer networking), 020206 networking & telecommunications, Linux kernel, 02 engineering and technology, computer.software_genre, 020202 computer hardware & architecture, Persistence (computer science), Memory management, 0202 electrical engineering, electronic engineering, information engineering, Set (psychology), business, computer, Dram, Computer network, computer.programming_language
Abstract: The persistent memory file systems can significantly improve the performance by utilizing the advanced features of emerging Persistent Memories (PMs). Unfortunately, the PMs have the problem of limited write endurance. However, the design of persistent memory file systems usually ignores this problem. Accordingly, the write-intensive applications, especially for the malicious wear attack virus, can damage underlying PMs quickly by calling the common interfaces of persistent memory file systems to write a few cells of PM continuously. Which seriously threat to the data reliability of file systems. However, existing solutions to solve this problem based on persistent memory file systems are not systematic and ignore the unlimited write endurance of DRAM. In this paper, we propose a malicious wear detection and defense mechanism for persistent memory file systems, called Themis, to solve this problem. The proposed Themis identifies the malicious wear attack according to the write traffic and the set lifespan of PM. Then, we design a wear-leveling scheme and migrate the writes of malicious wear attackers into DRAM to improve the lifespan of PMs. We implement the proposed Themis in Linux kernel based on NOVA, a state-of-the-art persistent memory file system. Compared with DWARM, the state-of-the-art and wear-aware memory management technique, experimental results show that Themis can improve 5774× lifetime of PM and 1.13× performance, respectively.
Published: 2020
Full Text: View/download PDF

24. MobileRE: A Hybrid Fault Tolerance Strategy Combining Erasure Codes and Replicas for Mobile Distributed Cluster

Author: Duo Liu, Zilin Zhang, Yujuan Tan, Xianzhang Chen, Yu Wu, Renping Liu, and Jinting Ren
Subjects: Distributed Computing Environment, Dynamic network analysis, business.industry, Computer science, Reliability (computer networking), Distributed computing, Bandwidth (signal processing), 020206 networking & telecommunications, Fault tolerance, 02 engineering and technology, Supercomputer, Computer data storage, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, business, Erasure code
Abstract: Fault tolerance techniques are of vital importance to promise data storage reliability for mobile distributed file systems. Traditional fault tolerance techniques, namely erasure codes and replicas, are suitable for wired data centers. However, they face challenges in mobile distributed environment, where nodes suffer from high failure probability and fluctuating bandwidth.In this paper, we present a hybrid fault tolerance strategy combining erasure codes and replicas for the mobile distributed cluster (MobileRE), to improve data reliability with dynamic network status. In MobileRE, we first formulate a reliability cost rate to indicate the cost of ensuring data reliability of the mobile cluster. MobileRE further adaptively applies the erasure codes and replicas algorithms based on real-time network bandwidth status to minimize system reliability cost rate. Simulation results show that compared with traditional designs that only adopt erasure codes or replicas, MobileRE can significantly reduce the system reliability cost rate.
Published: 2020
Full Text: View/download PDF

25. Unified-TP: A Unified TLB and Page Table Cache Structure for Efficient Address Translation

Author: Qingfeng Zhuge, Hong Jiang, Zhichao Yan, Yujuan Tan, Duo Liu, Edwin H.-M. Sha, Xianzhang Chen, Zhulin Ma, and Chengliang Wang
Subjects: 010302 applied physics, Structure (mathematical logic), Scheme (programming language), Miss rate, Hardware_MEMORYSTRUCTURES, Computer science, Translation lookaside buffer, 02 engineering and technology, Parallel computing, 01 natural sciences, 020202 computer hardware & architecture, Memory management, 0103 physical sciences, Virtual memory, 0202 electrical engineering, electronic engineering, information engineering, Cache, Latency (engineering), Page table, Cache algorithms, computer, computer.programming_language
Abstract: To improve the performance of address translation in applications with large memory footprints, techniques, such as hugepages and HW coalescing, are proposed to increase the coverage of limited hardware translation entries by exploiting the contiguous memory allocation to lower Tanslation Lookaside Buffer (TLB) miss rate. Furthermore, Page Table Caches (PTCs) are proposed to store the upper-level page table entries to reduce the TLB miss handling latency. Both increasing TLB coverage and reducing TLB miss handling latency have proved to be effective in speeding up address translation, to a certain extent. Nevertheless, our preliminary studies suggest that the structural separation between TLBs and PTCs in existing computer systems makes these two methods less effective because they are exclusively used in TLBs and PTCs respectively. In particular, the separate structures cannot dynamically adjust their sizes according to the workloads, resulting in low resource utilization and inefficient address translation. To address these issues, we propose a unified structure, called Unified - Tp,which stores PTC and TLB entries together. Besides, Our modified LRU algorithm helps identify the cold TLB and PTC entries and dynamically adjust the numbers of TLB and PTC entries to adapt to different workloads. Furthermore, we introduce a scheme of parallel search when receiving memory access requests. Our experimental results show that Unified-TP can reduce the numbers of TLB misses by an average of 35.69 % and improve the performance by an average of 11.12% compared with separately structured TLBs and PTCs.
Published: 2020
Full Text: View/download PDF

26. LOFFS: A Low-Overhead File System for Large Flash Memory on Embedded Devices

Author: Zhaoyan Shen, Duo Liu, Yujuan Tan, Chaoshu Yang, Xiongxiong She, Zili Shao, Runyu Zhang, and Xianzhang Chen
Subjects: File system, 050210 logistics & transportation, Computer science, YAFFS, business.industry, 05 social sciences, 02 engineering and technology, Construct (python library), computer.software_genre, Flash memory, 020202 computer hardware & architecture, Flash (photography), Embedded system, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, business, computer, Booting
Abstract: Emerging applications like machine learning in embedded devices (e.g., satellite and vehicles) require huge storage space, which recently stimulates the widespread deployment of large-capacity flash memory in IoT devices. However, existing embedded file systems fall short in managing large-capacity storage efficiently for excessive memory consumption and poor booting performance. In this paper, we propose a novel embedded file system, LOFFS, to tackle the above issues and manage large-capacity NAND flash on resource-limited embedded devices. We redesign the space management mechanisms and construct hybrid file structures to achieve high performance with minimum resource occupation. We have implemented LOFFS in Linux, and the experimental results show that LOFFS outperforms YAFFS by 55.8% on average with orders of magnitude reductions on memory footprint.
Published: 2020
Full Text: View/download PDF

27. Efficient Multi-Grained Wear Leveling for Inodes of Persistent Memory File Systems

Author: Qingfeng Zhuge, Shun Nie, Chaoshu Yang, Xianzhang Chen, Fengshun Wang, Duo Liu, Edwin H.-M. Sha, and Runyu Zhang
Subjects: File system, Computer science, 0211 other engineering and technologies, Linux kernel, 02 engineering and technology, inode, computer.software_genre, 020202 computer hardware & architecture, Persistence (computer science), 0202 electrical engineering, electronic engineering, information engineering, Operating system, Table (database), computer, Wear leveling, 021106 design practice & management
Abstract: Existing persistent memory file systems usually store inodes in fixed locations, which ignores the external and internal imbalanced wears of inodes on the persistent memory (PM). Therefore, the PM for storing inodes can be easily damaged. Existing solutions achieve low accuracy of wear-leveling with high-overhead data migrations. In this paper, we propose a Lightweight and Multi-grained Wear-leveling Mechanism, called LMWM, to solve these problems. We implement the proposed LMWM in Linux kernel based on NOVA, a typical persistent memory file system. Compared with MARCH, the state-of-theart wear-leveling mechanism for inode table, experimental results show that LMWM can improve 2.5× lifetime of PM and 1.12× performance, respectively.
Published: 2020
Full Text: View/download PDF

28. SSDKeeper: Self-Adapting Channel Allocation to Improve the Performance of SSD Devices

Author: Runyu Zhang, Duo Liu, Xianzhang Chen, Liang Liang, Yujuan Tan, and Renping Liu
Subjects: 020203 distributed computing, Hardware_MEMORYSTRUCTURES, Channel allocation schemes, business.industry, Computer science, Distributed computing, 0202 electrical engineering, electronic engineering, information engineering, Temporal isolation among virtual machines, Data center, 02 engineering and technology, business, 020202 computer hardware & architecture
Abstract: Solid state drives (SSDs) have been widely deployed in high performance data center environments, where multiple tenants usually share the same hardware. However, traditional SSDs distribute the users’ incoming data uniformly across all SSD channels, which leads to numerous access conflicts. Meanwhile, SSDs that statically allocate one or several channels to one tenant sacrifice device parallelism and capacity. When SSDs are shared by tenants with different access patterns, inappropriate channel allocation results in SSDs performance degradation. In this paper, we propose a self-adapting channel allocation mechanism, named SSDKeeper, for multiple tenants to share one SSD. SSDKeeper employs a machine learning assisted algorithm to take full advantage of SSD parallelism while providing performance isolation. By collecting multi-tenant access patterns and training a model, SSDKeeper selects an optimal channel allocation strategy for multiple tenants with the lowest overall response latency. Experimental results show that SSDKeeper improves the overall performance by 24% with negligible overhead.
Published: 2020
Full Text: View/download PDF

29. Optimizing Performance of Persistent Memory File Systems using Virtual Superpages

Author: Duo Liu, Shun Nie, Edwin H.-M. Sha, Runyu Zhang, Chaoshu Yang, Qingfeng Zhuge, and Xianzhang Chen
Subjects: 010302 applied physics, File system, Hardware_MEMORYSTRUCTURES, Data consistency, Write amplification, Computer science, Translation lookaside buffer, Linux kernel, 02 engineering and technology, computer.software_genre, 01 natural sciences, 020202 computer hardware & architecture, Persistence (computer science), Non-volatile memory, Virtual address space, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Operating system, Overhead (computing), computer, Data migration
Abstract: Existing persistent memory file systems can significantly improve the performance by utilizing the advantages of emerging Persistent Memories (PMs). Especially, they can employ superpages (e.g., 2MB a page) of PMs to alleviate the overhead of locating file data and reduce TLB misses. Unfortunately, superpage also induces two critical problems. First, the data consistency of file systems using superpages causes severe write amplification during overwrite of file data. Second, existing management of superpages may lead to large waste of PM space. In this paper, we propose a Virtual Superpage Mechanism (VSM) to solve the problems by taking advantages of virtual address space. On one hand, VSM adopts multi-grained copy-on-write mechanism to reduce the write amplification while ensuring data consistency. On the other hand, VSM presents zero-copy file data migration mechanism to eliminate the loss of space utilization efficiency caused by superpages. We implement the proposed VSM mechanism in Linux kernel based on PMFS. Compared with the original PMFS and NOVA, the experimental results show that VSM improves 36% and 14% on average for write and read performance, respectively. Meanwhile, VSM can achieve the same space utilization efficiency of file system that uses the normal 4KB pages to organize files.
Published: 2020
Full Text: View/download PDF

30. DWARM: A wear-aware memory management scheme for in-memory file systems

Author: Lin Wu, Linfeng Cheng, Qingfeng Zhuge, Edwin H.-M. Sha, and Xianzhang Chen
Subjects: 010302 applied physics, Scheme (programming language), Hardware_MEMORYSTRUCTURES, Computer Networks and Communications, business.industry, Path (computing), Computer science, Memory bus, 02 engineering and technology, Data structure, 01 natural sciences, 020202 computer hardware & architecture, Memory management, Hardware and Architecture, Embedded system, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Persistent data structure, business, computer, Software, Dram, computer.programming_language
Abstract: Emerging non-volatile memories (NVMs) are promised to revolutionize storage systems by providing fast, persistent data accesses on memory bus. A hybrid NVM/DRAM architecture that combines faster, volatile DRAM with slightly slower, denser NVM can harness the characteristics of both technologies. In order to fully take advantage of NVM, state-of-the-art in-memory file systems are designed to provide high performance and strong consistency guarantees. However, the free space management schemes of existing in-memory file systems can easily cause “hot spots” when updating data structures on NVM, leading to significant skewness in terms of writes to each data page. In this paper, we propose dynamic wear-aware range management (DWARM) scheme, a novel free space management technique for in-memory file systems. This scheme achieves wear-leveling with high performance for allocation/deallocation. The essential idea is to allocate less-written pages for each allocation request. Specifically, this scheme works by associating a write counter with each data page and updating the counters in the file write path. We build an “index” structure to fast locate the pages that have received less writes. The index divides the NVM pages into different subranges according to the write counters. Allocation always starts from the minimal subrange. Also, we propose Adaptive Wear Range Determination Algorithm to adjust the wear ranges dynamically. To accelerate lookup, we keep the index in DRAM and avoid the overhead of strong consistency by rebuilding the index in case of system failure. Experimental results show that this scheme can provide 4.9 × to 158.1 × wear-leveling improvement compared to the state-of-the-art memory management schemes. For application workloads, the DWARM strategy can improve the lifetime of NVM by up to 125 × , 39 × , and 25 × , compared with the standard memory management schemes of PMFS, NOVA and SIMFS.
Published: 2018
Full Text: View/download PDF

31. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs

Author: Lei Yang, Qingfeng Zhuge, Jingtong Hu, Edwin H.-M. Sha, Weiwen Jiang, and Xianzhang Chen
Subjects: 010302 applied physics, Optimization problem, Speedup, Cost efficiency, Data parallelism, Computer science, Pipeline (computing), Task parallelism, 02 engineering and technology, 01 natural sciences, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Dynamic programming, Reduction (complexity), Memory management, Computer engineering, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Software
Abstract: Field programmable gate array (FPGA) has been one of the most popular platforms to implement convolutional neural networks (CNNs) due to its high performance and cost efficiency; however, limited by the on-chip resources, the existing single-FPGA architectures cannot fully exploit the parallelism in CNNs. In this paper, we explore heterogeneous FPGA-based designs to effectively leverage both task and data parallelism, such that the resultant system can achieve the minimum cost while satisfying timing constraints. In order to maximize the task parallelism, we investigate two critical problems: 1) buffer placement , where to place buffers to partition CNNs into pipeline stages and 2) task assignment , what type of FPGA to implement different CNN layers. We first formulate the system-level optimization problem with a mixed integer linear programming model. Then, we propose an efficient dynamic programming algorithm to obtain the optimal solutions. On top of that, we devise an efficient algorithm that exploits data parallelism within CNN layers to further improve cost efficiency. Evaluations on well-known CNNs demonstrate that the proposed techniques can obtain an average of 30.82% reduction in system cost under the same timing constraint, and an average of 1.5 times speedup in performance under the same cost budget, compared with the state-of-the-art techniques.
Published: 2018
Full Text: View/download PDF

32. UMFS: An efficient user-space file system for non-volatile memory

Author: Xianzhang Chen, Ting Wu, Lin Wu, Edwin H.-M. Sha, Weiwen Jiang, Zeng Xiaoping, and Qingfeng Zhuge
Subjects: File system, Data consistency, Computer science, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, 020202 computer hardware & architecture, Non-volatile memory, File size, Kernel (image processing), Virtual address space, Hardware and Architecture, Journaling file system, 0202 electrical engineering, electronic engineering, information engineering, Operating system, User space, computer, Software
Abstract: Emerging non-volatile memory (NVM) is expected to be a mainstream storage media in embedded systems for its low-power consumption, near-DRAM speed, high density, and byte-addressability. In-memory file systems are proposed to achieve high-performance file accesses by storing files in NVM. Existing in-memory file systems, such as NOVA and EXT4-DAX, operate in kernel space and have additional overhead caused by kernel layers and mode change. In this paper, we propose a new design of User-space in-Memory File System (UMFS) to improve file access speed by minimizing the overhead of kernel. We implement UMFS in Linux system to verify the proposed design. In open operation, UMFS exposes a file into user-space in constant time independent from the file size. Then, UMFS can achieve high-performance file accesses taking advantages of user virtual address space and existing address translation hardware in processors. We also propose an efficient user-space journaling to ensure data consistency while minimizing kernel cost. Extensive experiments are conducted on standard benchmarks to compare UMFS with NOVA, EXT4-DAX, and SIMFS, the state-of-the-art in-memory file system. The experimental results show that UMFS outperforms any of existing in-memory file systems.
Published: 2018
Full Text: View/download PDF

33. Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory

Author: Runyu Zhang, Xianzhang Chen, Zhulin Ma, Weiwen Jiang, Edwin H.-M. Sha, Hailiang Dong, and Qingfeng Zhuge
Subjects: 010302 applied physics, Speedup, CPU cache, Computer science, Search engine indexing, 02 engineering and technology, Linked list, Parallel computing, Data structure, 01 natural sciences, 020202 computer hardware & architecture, Theoretical Computer Science, Database index, Tree (data structure), Tree structure, Computational Theory and Mathematics, Data retrieval, Hardware and Architecture, Search algorithm, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Software
Abstract: Index structures can significantly accelerate the data retrieval operations in data intensive systems, such as databases. Tree structures, such as B $^{+}$ -tree alike, are commonly employed as index structures; however, we found that the tree structure may not be appropriate for Non-Volatile Memory (NVM) in terms of the requirements for high-performance and high-endurance. This paper studies what is the best index structure for NVM-based systems and how to design such index structures. The design of an NVM-friendly index structure faces a lot of challenges. First , in order to prolong the lifetime of NVM, the write activities on NVM should be minimized. To this end, the index structure should be as simple as possible. The index proposed in this paper is based on the simplest data structure, i.e., linked list. Second , the simple structure brings challenges to achieve high-performance data retrieval operations. To overcome this challenge, we design a novel technique by explicitly building up a contiguous virtual address space on the linked list, such that efficient search algorithms can be performed. Third , we need to carefully consider data consistency issues in NVM-based systems, because the order of memory writes may be changed and the data content in NVM may be inconsistent due to write-back effects of CPU cache. This paper devises a novel indexing scheme, called “ V irtual L inear A ddressable B uckets” (VLAB). We implement VLAB in a storage engine and plug it into MySQL. Evaluations are conducted on an NVDIMM workstation using YCSB workloads and real-world traces. Results show that write activities of the state-of-the-art indexes are 6.98 times more than ours; meanwhile, VLAB achieves 2.53 times speedup.
Published: 2018
Full Text: View/download PDF

34. A machine learning assisted data placement mechanism for hybrid storage systems

Author: Duo Liu, Jinting Ren, Xianzhang Chen, Moming Duan, Liang Liang, Yujuan Tan, and Ruolan Li
Subjects: Hybrid storage system, business.industry, Computer science, Machine learning, computer.software_genre, Mechanism (engineering), File size, Data access, Hardware and Architecture, Data_FILES, Key (cryptography), Hybrid storage, Artificial intelligence, business, computer, Software, Data placement
Abstract: Emerging applications produce massive files that show different properties in file size, lifetime, and read/write frequency. Existing hybrid storage systems place these files onto different storage mediums assuming that the access patterns of files are fixed. However, we find that the access patterns of files are changeable during their lifetime. The key to improve the file access performance is to adaptively place the files on the hybrid storage system using the run-time status and the properties of both files and the storage systems. In this paper, we propose a machine learning assisted data placement mechanism that adaptively places files onto the proper storage medium by predicting access patterns of files. We design a PMFS based tracer to collect file access features for prediction and show how this approach is adaptive to the changeable access pattern. Based on data access prediction results, we present a linear data placement algorithm to optimize the data access performance on the hybrid storage mediums. Extensive experimental results show that the proposed learning algorithm can achieve over 90% accuracy for predicting file access patterns. Meanwhile, this paper can achieve over 17% improvement of system performance for file accesses compared with the state-of-the-art linear-time data placement methods.
Published: 2021
Full Text: View/download PDF

35. MobileRE: A replicas prioritized hybrid fault tolerance strategy for mobile distributed system

Author: Yujuan Tan, Jinting Ren, Yu Wu, Duo Liu, Ziling Zhang, Renping Liu, and Xianzhang Chen
Subjects: Dynamic network analysis, Hardware and Architecture, Computer science, Replica, Distributed computing, Failure probability, Bandwidth (computing), Data reliability, Fault tolerance, Erasure code, Software, Reliability (statistics)
Abstract: Fault tolerance techniques are of vital importance to promise data reliability for mobile distributed system. In mobile environments, nodes suffer from high failure probability and fluctuating bandwidth. Thus, traditional fault tolerance techniques are no longer suitable. In this paper, we present a replica prioritized hybrid fault tolerance strategy combining erasure codes and replicas for a mobile distributed system (MobileRE), to guarantee data reliability with dynamic network status. In MobileRE, we first formulate a reliability cost rate to indicate the cost of ensuring data reliability of the mobile system. MobileRE further adaptively applies the erasure codes and replicas algorithms based on real-time network bandwidth status to minimize system reliability cost rate. MobileRE also obtains the optimal reliability cost rate by customizing redundant configuration parameters. The numerical and simulation results verify the effectiveness of the proposed schemes, and show that compared with traditional designs that only adopt erasure codes or replicas, MobileRE can significantly reduce the system reliability cost rate.
Published: 2021
Full Text: View/download PDF

36. Refinery swap: An efficient swap mechanism for hybrid DRAM–NVM systems

Author: Qingfeng Zhuge, Ting Wu, Weiwen Jiang, Xianzhang Chen, Edwin H.-M. Sha, and Chaoshu Yang
Subjects: Hardware_MEMORYSTRUCTURES, Computer Networks and Communications, Computer science, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, Refinery, 020202 computer hardware & architecture, Non-volatile memory, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Operating system, computer, Swap (computer programming), Software, Dram
Abstract: Emerging Non-Volatile Memory (NVM) technologies have shown great promise for enabling high performance swapping mechanism in embedded systems. Most of existing swap mechanisms have limited performance for lacking the knowledge of memory accesses and cause large overhead of swap operations by entirely avoiding direct writes to NVM. This paper, we find out the feature of “write count disparity”, i.e., most pages are rarely written and most writes are concentrated on a few pages. With the observations in mind, this paper rethinks and re-designs the swap mechanism to reduce the number of swap operations and writes to NVM in hybrid DRAM–NVM systems by tolerating small writes to NVM. A new swap mechanism, Refinery Swap, is proposed with a ( 1 + e ) -competitive algorithm for swap-in operations, a multilevel priority algorithm for selecting the victim pages of swap-out operations, and a swap-based wear-leveling algorithm for NVM. Extensive experiments are conducted with standard benchmarks. Compared with Dr.Swap, the state-of-the-art swap mechanism, Refinery Swap reduces more than 90% of swap operations and writes to NVM. Refiner Swap achieves encouraging improvements over existing swap mechanisms in the aspects of performance, energy consumption, and the lifetime of NVM.
Published: 2017
Full Text: View/download PDF

37. Optimal Functional-Unit Assignment for Heterogeneous Systems Under Timing Constraint

Author: Edwin H.-M. Sha, Qingfeng Zhuge, Xianzhang Chen, Lei Zhou, Weiwen Jiang, and Lei Yang
Subjects: 020203 distributed computing, Mathematical optimization, Computer science, 02 engineering and technology, Directed acyclic graph, Graph, 020202 computer hardware & architecture, Computational Theory and Mathematics, Hardware and Architecture, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), Algorithm design, Retiming, Algorithm, Time complexity, Integer programming
Abstract: In high-level synthesis for real-time systems, it typically employs heterogeneous functional-unit types to achieve high-performance and low-cost designs. In the design phase, it is critical to determine which functional-unit type to be mapped for each operation in a given application such that the total cost is minimized while the deadline can be met. For a path or tree structured application, existing approaches can obtain the minimum-cost assignment, called “optimal assignment”, under which the resultant system satisfies a given timing constraint. However, it is still an open question whether there exist efficient algorithms to obtain the optimal assignment for the directed acyclic graph (DAG), or more generally, the data-flow graph with cycles (cyclic DFG). For DAGs, by analyzing the property of the problem, this paper designs an efficient algorithm to obtain the optimal assignments. For cyclic DFGs, we approach this problem with the combination of retiming technique to thoroughly explore the design space. We formulate a Mixed Integer Linear Programming (MILP) model to give the optimal solution. But because of the high degree of its time complexity, we devise a practical algorithm to obtain near-optimal solutions within a minute. Experimental results show the effectiveness of our algorithms. Specifically, compared with existing techniques, we can achieve 25.70 and 30.23 percent reductions in total cost on DAGs and cyclic DFGs, respectively.
Published: 2017
Full Text: View/download PDF

38. Optimal functional unit assignment and voltage selection for pipelined MPSoC with guaranteed probability on time performance

Author: Edwin H.-M. Sha, Weiwen Jiang, Hailiang Dong, Xianzhang Chen, and Qingfeng Zhuge
Subjects: 020203 distributed computing, Computer science, Multiprocessing, Parallel computing, 02 engineering and technology, MPSoC, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Set (abstract data type), 0202 electrical engineering, electronic engineering, information engineering, On-time performance, Throughput (business), Software, Efficient energy use
Abstract: Pipelined heterogeneous multiprocessor system-on-chip (MPSoC) can provide high throughput for streaming applications. In the design of such systems, time performance and system cost are the most concerning issues. By analyzing runtime behaviors of benchmarks in real-world platforms, we find that execution times of tasks are not fixed but spread with probabilities. In terms of this feature, we model execution times of tasks as random variables. In this paper, we study how to design high-performance and low-cost MPSoC systems to execute a set of such tasks with data dependencies in a pipelined fashion. Our objective is to obtain the optimal functional unit assignment and voltage selection for the pipelined MPSoC systems, such that the system cost is minimized while timing constraints can be met with a given guaranteed probability. For each required probability, our proposed algorithm can efficiently obtain the optimal solution. Experiments show that other existing algorithms cannot find feasible solutions in most cases, but ours can. Even for those solutions that other algorithms can obtain, ours can reach 30% reductions in total cost compared with others.
Published: 2017
Full Text: View/download PDF

39. BOSS: An Efficient Data Distribution Strategy for Object Storage Systems With Hybrid Devices

Author: Edwin H.-M. Sha, Linfeng Cheng, Qingfeng Zhuge, Xianzhang Chen, and Lin Wu
Subjects: General Computer Science, Computer science, Reliability (computer networking), Distributed computing, 02 engineering and technology, computer.software_genre, object storage, 0202 electrical engineering, electronic engineering, information engineering, hybrid storage systems, General Materials Science, 020203 distributed computing, Hardware_MEMORYSTRUCTURES, Database, Enterprise storage, General Engineering, 020206 networking & telecommunications, Enterprise data management, Object storage, Metadata, Data access, Ceph, Boss, lcsh:Electrical engineering. Electronics. Nuclear engineering, data distribution, computer, lcsh:TK1-9971
Abstract: Hybrid object storage systems provide opportunities to achieve high performance and energy efficiency with low cost for enterprise data centers. Existing object storage systems, however, distribute data objects in the system without considering the heterogeneity of the underlying devices and the asymmetric data access patterns. Therefore, the system performance and energy efficiency may degrade as data are placed on improper storage devices. For example, energy-efficient high-density archive hard disk drives (archive HDDs) are significantly slower than normal HDDs and solid state disks (SSDs), which mean that the archive HDDs are not appropriate for storing frequently accessed objects. Besides, flash-based SSDs have limited write endurance, which makes SSDs vulnerable for storing write-intensive objects. In this paper, we analyze various real enterprise workloads and find that read and write requests are not uniformly distributed to data objects. Based on the observations, we propose a novel strategy, biased object storage strategy (BOSS), to reduce writes to SSDs and improve system performance for hybrid object storage systems. Different from conventional uniform and fixed data distribution strategies, the BOSS can distribute and migrate data objects to various types of devices dynamically, according to the data access patterns collected online. The experimental results show that the BOSS can reduce 64% of writes on SSDs and improve system performance by 29.51% on average, while maintaining a high level of load balance.
Published: 2017

40. Astraea: Self-Balancing Federated Learning for Improving Classification Accuracy of Mobile Deep Learning Applications

Author: Xianzhang Chen, Moming Duan, Jinting Ren, Duo Liu, Yujuan Tan, Liang Liang, and Lei Qiao
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, Machine Learning (stat.ML), 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, Federated learning, Machine Learning (cs.LG), Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Divergence (statistics), Edge computing, 0105 earth and related environmental sciences, Training set, Artificial neural network, biology, business.industry, Deep learning, biology.organism_classification, Astraea, Computer Science - Distributed, Parallel, and Cluster Computing, 020201 artificial intelligence & image processing, Distributed, Parallel, and Cluster Computing (cs.DC), Artificial intelligence, Internet of Things, business, computer
Abstract: Federated learning (FL) is a distributed deep learning method which enables multiple participants, such as mobile phones and IoT devices, to contribute a neural network model while their private training data remains in local devices. This distributed approach is promising in the edge computing system where have a large corpus of decentralized data and require high privacy. However, unlike the common training dataset, the data distribution of the edge computing system is imbalanced which will introduce biases in the model training and cause a decrease in accuracy of federated learning applications. In this paper, we demonstrate that the imbalanced distributed training data will cause accuracy degradation in FL. To counter this problem, we build a self-balancing federated learning framework call Astraea, which alleviates the imbalances by 1) Global data distribution based data augmentation, and 2) Mediator based multi-client rescheduling. The proposed framework relieves global imbalance by runtime data augmentation, and for averaging the local imbalance, it creates the mediator to reschedule the training of clients based on Kullback-Leibler divergence (KLD) of their data distribution. Compared with FedAvg, the state-of-the-art FL algorithm, Astraea shows +5.59% and +5.89% improvement of top-1 accuracy on the imbalanced EMNIST and imbalanced CINIC-10 datasets, respectively. Meanwhile, the communication traffic of Astraea can be 82% lower than that of FedAvg., Comment: Published as a conference paper at IEEE 37th International Conference on Computer Design (ICCD) 2019
Published: 2019
Full Text: View/download PDF

41. Archivist: A Machine Learning Assisted Data Placement Mechanism for Hybrid Storage Systems

Author: Yujuan Tan, Moming Duan, Jinting Ren, Xianzhang Chen, Lei Qiao, Duo Liu, and Liang Liang
Subjects: business.industry, Group method of data handling, Computer science, 020206 networking & telecommunications, Cloud computing, 010103 numerical & computational mathematics, 02 engineering and technology, Supercomputer, Machine learning, computer.software_genre, 01 natural sciences, Archivist, Computer data storage, 0202 electrical engineering, electronic engineering, information engineering, Hybrid storage, Artificial intelligence, 0101 mathematics, business, computer, Data placement
Abstract: With the rapid growth of edge-cloud computing, emerging applications pose higher performance demand on the storage system for storing massive data that are generated from various sources. The multi-sourced data shows different properties in size, retention time, and read/write frequency. Hybrid storage system is promised to efficiently handle the data in edge-cloud computing environment satisfying different data demands. The key problem is how to place the data on the hybrid storage system according to the run-time status and the properties of both data and the storage systems. In this paper, we propose Archivist — a machine learning assisted data placement mechanism for hybrid storage systems to reduce file access latency. We first design a machine learning based approach for predicting the access patterns of the incoming data. Then, we present a data placement algorithm to optimize the data on the hybrid storage mediums by matching the properties of data and the features of storage mediums. Extensive experimental results show that Archivist can achieve up to 49% improvement of system performance for file accesses compared with baseline.
Published: 2019
Full Text: View/download PDF

42. Power-Aware Virtual Machine Placement for Mobile Edge Computing

Author: Duo Liu, Yuxin Sun, Yujuan Tan, and Xianzhang Chen
Subjects: 020203 distributed computing, Data processing, Mobile edge computing, Edge device, Computer science, business.industry, Cloud computing, 02 engineering and technology, Energy consumption, computer.software_genre, 020202 computer hardware & architecture, Virtual machine, Server, 0202 electrical engineering, electronic engineering, information engineering, Enhanced Data Rates for GSM Evolution, business, computer, Computer network
Abstract: The Mobile Edge Computing provides an attractive platform to bring data processing closer to its source in a networked environment. The responsibility of the MEC layer is effectively handling the jobs offloaded by edge devices located in the edge layer, and the jobs are served by virtual machines on the MEC servers. In this paper, we propose a power-efficient jobs placement approach for Mobile Edge Computing, which aims to minimize the number of required active MEC servers and reduce power consumption. The experimental results show that the proposed algorithm can significantly reduce the power consumption.
Published: 2019
Full Text: View/download PDF

43. A Wear-Leveling-Aware Fine-Grained Allocator for Non-Volatile Memory

Author: Chun Jason Xue, Chaoshu Yang, Edwin H.-M. Sha, Shouzhen Gu, Xianzhang Chen, Zhuge Qingfeng, and Qiang Sun
Subjects: 010302 applied physics, Computer science, Linux kernel, 02 engineering and technology, computer.software_genre, 01 natural sciences, 020202 computer hardware & architecture, Non-volatile memory, Allocator, Memory management, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Operating system, computer, Wear leveling
Abstract: Emerging non-volatile memories (NVMs) are promising main memory for their advanced characteristics. However, the low endurance of NVM cells makes them vulnerable to frequent fine-grained updates. This paper proposes a Wear-leveling Aware Fine-grained Allocator (WAFA) for NVM. WAFA divides pages into basic memory units to support fine-grained updates. WAFA allocates the basic memory units of a page in a rotational manner to distribute fine-grained updates evenly on memory cells. The fragmented basic memory units of each page caused by the memory allocation and deallocation operations are reorganized by reform operation. We implement WAFA in Linux kernel 4.4.4. Experimental results show that WAFA can reduce 81.1% and 40.1% of the total writes of pages over NVMalloc and nvm_alloc, the state-of-the-art wear-conscious allocator for NVM. Meanwhile, WAFA shows 48.6% and 42.3% performance improvement over NVMalloc and nvm_alloc, respectively.
Published: 2019
Full Text: View/download PDF

44. Tumbler

Author: Yujuan Tan, Hyung Gyu Lee, Liang Liang, Yu Wu, Yue Xu, Xianzhang Chen, Lei Qiao, and Duo Liu
Subjects: Computer science, business.industry, Quality of service, Real-time computing, Reinforcement learning, Solar energy, business, Energy source, Energy harvesting, Efficient energy use, Scheduling (computing)
Abstract: Energy harvesting technology has been popularly adopted in embedded systems. However, unstable energy source results in unsteady operation. In this paper, we devise a long-term energy efficient task scheduling targeting for solar-powered sensor nodes. The proposed method exploits a reinforcement learning with a solar energy prediction method to maximize the energy efficiency, which finally enhances the long-term quality of services (QoS) of the sensor nodes. Experimental results show that the proposed scheduling improves the energy efficiency by 6.0%, on average and achieves the better QoS level by 54.0%, compared with a state-of the-art task scheduling algorithm.
Published: 2019
Full Text: View/download PDF

45. Optimizing the Data Transmission Scheme for Edge-Based Automatic Driving

Author: Duo Liu, Yujuan Tan, Chenwei Wang, and Xianzhang Chen
Subjects: 0209 industrial biotechnology, Mobile edge computing, Linear programming, Computer science, Real-time computing, 02 engineering and technology, Scheduling (computing), Upload, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Greedy algorithm, 5G, Edge computing, Data transmission
Abstract: With the development of 5G communication and mobile edge computing, edge-based automatic driving is becoming a promising solution for relieving the computing loads and improving the scheduling of autonomous vehicles. In the edge-based automatic driving scenario, the vehicles need to upload many types of sensor data to the edge node, which may cause large latency and endanger the vehicles. In this paper, we define a Vehicle-to-edge Data Transmission (VDT) problem for sensor data transmission between the vehicle and edge node of the edge-based automatic driving, considering the requirements on the accuracy of data and the real-time of transmission. To solve the VDT problem optimally, we construct a Mixed-Integer Linear Programming (MILP) formula. Furthermore, we also present the Deviation-Detection (DD) algorithm and Greedy algorithm to efficiently gain near-optimal solution of the VDT problem. We evaluate the proposed algorithms by a set of simulated automatic driving data. The experimental results show that the proposed Greedy algorithm can reduce 5%∼13% communication cost over the instinct DD algorithm.
Published: 2019
Full Text: View/download PDF

46. CDAC: Content-Driven Deduplication-Aware Storage Cache

Author: Min Fu, Zhichao Yan, Wen Xia, Hong Jiang, Duo Liu, Yajun Zhao, Congcong Xu, Xianzhang Chen, Yujuan Tan, and Jing Xie
Subjects: Hardware_MEMORYSTRUCTURES, CPU cache, Computer science, Working set size, 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, 020202 computer hardware & architecture, Backup, Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, Redundancy (engineering), Data deduplication, Cache, Cache algorithms, Block size
Abstract: Data deduplication, as a proven technology for effective data reduction in backup and archive storage systems, also demonstrates the promise in increasing the logical space capacity of storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they do improve the hit ratios compared to the caching algorithms without deduplication, especially when the cache block size is set to 4KB. But when the block size is larger than 4KB, a clear trend for modern storage systems, their hit ratios are significantly reduced. A slight increase in hit ratios due to deduplicationmay not be able to improve the overall storage performance because of the high overhead created by deduplication. To address this problem, in this paper we propose CDAC, a Content-driven Deduplication-Aware Cache, which focuses on exploiting the blocks' content redundancy and their intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDACLRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU and DARC, by up to 19.49X in read cache hit ratio, with an average of 1.95X under real-world traces when the cache size ranges from 20% to 80% of the working set size and the block size ranges from 4KB to 64 KB.
Published: 2019
Full Text: View/download PDF

47. UIMigrate: Adaptive Data Migration for Hybrid Non-Volatile Memory Systems

Author: Baiping Wang, Duo Liu, Zhichao Yan, Xianzhang Chen, Yujuan Tan, and Qiuwei Deng
Subjects: 010302 applied physics, Random access memory, Hardware_MEMORYSTRUCTURES, Computer science, business.industry, Memory bus, 02 engineering and technology, 01 natural sciences, Flash memory, 020202 computer hardware & architecture, Non-volatile memory, Embedded system, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), Non-volatile random-access memory, business, Dram, Data migration
Abstract: Byte-addressable, non-volatile memory (NVRAM) combines the benefits of DRAM and flash memory. Its slower speed compared to DRAM, however, makes it hard to entirely replace DRAM with NVRAM. Hybrid NVRAM systems that equip both DRAM and NVRAM on the memory bus become a better solution: frequently accessed, hot pages can be stored in DRAM while other cold pages can reside in NVRAM. This way, the system gets the benefits of both high performance (from DRAM) and lower power consumption and cost/performance (from NVRAM). Realizing an efficient hybrid NVRAM system requires careful page migration and accurate data temperature measurement. Existing solutions, however, often cause invalid migrations due to inaccurate data temperature accounting, because hot and cold pages are separately identified in DRAM and NVRAM regions.Based on this observation, we propose UIMigrate, an adaptive data migration approach for hybrid NVRAM systems. The key idea is to consider data temperature across the whole DRAM-NVRAM space when determining whether a page should be migrated between DRAM and NVRAM. In addition, UIMigrate adapts workload changes by dynamically adjusting migration decisions as workload changes. Our experiments using SPEC 2006 show that UIMigrate can reduce the number of migrations and improves performance by up to 90.4% compared to existing state-of-the-art approaches.
Published: 2019
Full Text: View/download PDF

48. Reducing Write Amplification for Inodes of Journaling File System using Persistent Memory

Author: Yujuan Tan, Moming Duan, Duo Liu, Chaoshu Yang, Wenbin Wang, Runyu Zhang, and Xianzhang Chen
Subjects: Phase-change memory, Data consistency, Write amplification, Computer science, Journaling file system, ext4, Device file, Linux kernel, Parallel computing, inode
Abstract: Conventional journaling file systems, such as Ext4, guarantee data consistency by writing in-memory dirty inodes to block devices twice. The write back of inodes may contain up to 80% clean inode that is unnecessary to be written back, which caused severe write amplification problem and largely reduce performance since the size of an inode is several times less than the size of a basic unit for updating the block device. Emerging persistent memories (PMs), such as phase change memory, provide the possibility for storing the offset of inodes in memory persistently. In this paper, we propose an efficient scheme, Updating Frequency based Inode Aggregation (UFIA), to reduce the write amplification of dirty inodes using PM. The main idea of UFIA is to identify the frequently-updated inodes and reorganize them in adjacent physical locations on block device. Firstly, UFIA adopts PM as an inode mapping table for remapping logical inodes to any physical inodes. Secondly, we design an efficient algorithm for UFIA to identify and reorganize the frequently-updated inodes. We implement UFIA and integrate it into Ext4 (denoted by UFIA-Ext4) in Linux kernel 4.4.4. The experiments are conducted with widely-used benchmark Filebench. Compared with original Ext4, the experimental results show that UFIA significantly reduces the write amplification of inodes and improves 54% of the performance on average.
Published: 2019
Full Text: View/download PDF

49. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory

Author: Edwin H.-M. Sha, Qingfeng Zhuge, Chun Jason Xue, Xianzhang Chen, Weiwen Jiang, and Wang Yuangang
Subjects: 010302 applied physics, Computer science, Locality, 02 engineering and technology, Parallel computing, 01 natural sciences, 020202 computer hardware & architecture, Data access, Hardware and Architecture, 0103 physical sciences, Hardware_INTEGRATEDCIRCUITS, 0202 electrical engineering, electronic engineering, information engineering, Algorithm design, Electrical and Electronic Engineering, Integer programming, Software
Abstract: A domain-wall memory (DWM) is becoming an attractive candidate to replace the traditional memories for its high density, low-power leakage, and low access latency. Accessing data on DWM is accomplished by shift operations that move data located on nanowires to read/write ports. Due to this kind of construction, data accesses on DWM exhibit varying access latencies. Therefore, data placement (DP) strategy has a significant impact on the performance of data accesses on DWM. In this paper, we prove the nondeterministic polynomial time (NP)-completeness of the DP problem on DWM. For the DWMs organized in single DWM block cluster (DBC), we present integer linear programming formulations to solve the problem optimally. We also propose an efficient single DBC placement (S-DBC-P) algorithm to exploit the benefits of multiple read/write ports and data locality. Compared with the sequential DP strategy, S-DBC-P reduces 76.9% shift operations on average for eight-port DWMs. Furthermore, for DP problem on the DWMs organized in multiple DBCs, we develop an efficient multiple DBC placement (M-DBC-P) algorithm to utilize the parallelism of DBCs. The experimental results show that the M-DBC-P achieves 90% performance improvement over the sequential DP strategy.
Published: 2016
Full Text: View/download PDF

50. A New Design of In-Memory File System Based on File Virtual Address Framework

Author: Xianzhang Chen, Edwin H.-M. Sha, Liang Shi, Weiwen Jiang, and Qingfeng Zhuge
Subjects: Computer science, Stub file, 02 engineering and technology, computer.software_genre, Theoretical Computer Science, Persistence (computer science), Design rule for Camera File system, Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, Versioning file system, SSH File Transfer Protocol, File system fragmentation, Flash file system, File system, Random access memory, Address space, Computer file, Device file, 020206 networking & telecommunications, computer.file_format, Everything is a file, Unix file types, Virtual file system, 020202 computer hardware & architecture, Torrent file, Memory-mapped file, File Control Block, Self-certifying File System, Computational Theory and Mathematics, Virtual address space, Hardware and Architecture, Journaling file system, Operating system, computer, Software
Abstract: The emerging technologies of persistent memory, such as PCM, MRAM, provide opportunities for preserving files in memory. Traditional file system structures may need to be re-studied. Even though there are several file systems proposed for memory, most of them have limited performance without fully utilizing the hardware at the processor side. This paper presents a framework based on a new concept, “File Virtual Address Space”. A file system, Sustainable In-Memory File System (SIMFS), is designed and implemented, which fully utilizes the memory mapping hardware at the file access path. First, SIMFS embeds the address space of an open file into the process’ address space. Then, file accesses are handled by the memory mapping hardware. Several optimization approaches are also presented for the proposed SIMFS. Extensive experiments are conducted. The experimental results show that the throughput of SIMFS achieves significant performance improvement over the state-of-the-art in-memory file systems.
Published: 2016
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

70 results on '"Xianzhang Chen"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources