233 results on '"Xianzhang Chen"'
Search Results
102. 面向内存文件系统的数据一致性更新机制研究 (Research on Data Consistency for In-memory File Systems).
- Author
-
Zhilong Sun, Edwin H.-M. Sha, Qingfeng Zhuge, Xianzhang Chen, and Kaijie Wu 0001
- Published
- 2017
- Full Text
- View/download PDF
103. Optimal Functional-Unit Assignment for Heterogeneous Systems Under Timing Constraint.
- Author
-
Weiwen Jiang, Edwin Hsing-Mean Sha, Xianzhang Chen, Lei Yang 0018, Lei Zhou, and Qingfeng Zhuge
- Published
- 2017
- Full Text
- View/download PDF
104. Refinery swap: An efficient swap mechanism for hybrid DRAM-NVM systems.
- Author
-
Xianzhang Chen, Edwin Hsing-Mean Sha, Weiwen Jiang, Chaoshu Yang, Ting Wu 0012, and Qingfeng Zhuge
- Published
- 2017
- Full Text
- View/download PDF
105. On the Design of High-Performance and Energy-Efficient Probabilistic Self-Timed Systems.
- Author
-
Edwin Hsing-Mean Sha, Weiwen Jiang, Qingfeng Zhuge, Lei Yang 0018, and Xianzhang Chen
- Published
- 2015
- Full Text
- View/download PDF
106. Designing an efficient persistent in-memory file system.
- Author
-
Edwin Hsing-Mean Sha, Xianzhang Chen, Qingfeng Zhuge, Liang Shi, and Weiwen Jiang
- Published
- 2015
- Full Text
- View/download PDF
107. Prevent Deadlock and Remove Blocking for Self-Timed Systems.
- Author
-
Edwin Hsing-Mean Sha, Weiwen Jiang, Qingfeng Zhuge, Xianzhang Chen, and Lei Yang 0018
- Published
- 2015
- Full Text
- View/download PDF
108. Optimizing data placement for reducing shift operations on domain wall memories.
- Author
-
Xianzhang Chen, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Penglin Dai, and Weiwen Jiang
- Published
- 2015
- Full Text
- View/download PDF
109. MobileRE: A replicas prioritized hybrid fault tolerance strategy for mobile distributed system.
- Author
-
Yu Wu 0016, Duo Liu, Xianzhang Chen, Jinting Ren, Renping Liu, Yujuan Tan, and Ziling Zhang
- Published
- 2021
- Full Text
- View/download PDF
110. A machine learning assisted data placement mechanism for hybrid storage systems.
- Author
-
Jinting Ren, Xianzhang Chen, Duo Liu, Yujuan Tan, Moming Duan, Ruolan Li, and Liang Liang 0002
- Published
- 2021
- Full Text
- View/download PDF
111. 连接操作在SIMFS和EXT4上的性能比较 (Performance Comparison of Join Operations on SIMFS and EXT4).
- Author
-
Liwei Zhao, Xianzhang Chen, and Qingfeng Zhuge
- Published
- 2016
- Full Text
- View/download PDF
112. A unified framework for designing high performance in-memory and hybrid memory file systems.
- Author
-
Xianzhang Chen, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Weiwen Jiang, Junxi Chen, Jun Chen, and Jun Xu
- Published
- 2016
- Full Text
- View/download PDF
113. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory.
- Author
-
Xianzhang Chen, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Chun Jason Xue, Weiwen Jiang, and Yuangang Wang
- Published
- 2016
- Full Text
- View/download PDF
114. A New Design of In-Memory File System Based on File Virtual Address Framework.
- Author
-
Edwin Hsing-Mean Sha, Xianzhang Chen, Qingfeng Zhuge, Liang Shi, and Weiwen Jiang
- Published
- 2016
- Full Text
- View/download PDF
115. Properties of Self-Timed Ring Architectures for Deadlock-Free and Consistent Configuration Reaching Maximum Throughput.
- Author
-
Weiwen Jiang, Qingfeng Zhuge, Xianzhang Chen, Lei Yang 0018, Juan Yi, and Edwin Hsing-Mean Sha
- Published
- 2016
- Full Text
- View/download PDF
116. Tumbler: Energy Efficient Task Scheduling for Dual-Channel Solar-Powered Sensor Nodes.
- Author
-
Yue Xu, Hyung Gyu Lee, Yujuan Tan, Yu Wu 0016, Xianzhang Chen, Liang Liang 0002, Lei Qiao, and Duo Liu
- Published
- 2019
- Full Text
- View/download PDF
117. A Wear-Leveling-Aware Fine-Grained Allocator for Non-Volatile Memory.
- Author
-
Xianzhang Chen, Qingfeng Zhuge, Qiang Sun, Edwin Hsing-Mean Sha, Shouzhen Gu, Chaoshu Yang, and Chun Jason Xue
- Published
- 2019
- Full Text
- View/download PDF
118. Self-Adapting Channel Allocation for Multiple Tenants Sharing SSD Devices
- Author
-
Duo Liu, Yujuan Tan, Renping Liu, Runyu Zhang, Xianzhang Chen, and Liang Liang
- Subjects
Channel allocation schemes ,business.industry ,Computer science ,Electrical and Electronic Engineering ,business ,Computer Graphics and Computer-Aided Design ,Software ,Computer network - Published
- 2022
- Full Text
- View/download PDF
119. Scanning gate microscopy in graphene nanostructures
- Author
-
Xianzhang Chen, Guillaume Weick, Dietmar Weinmann, and Rodolfo A. Jalabert
- Subjects
Condensed Matter - Mesoscale and Nanoscale Physics ,Mesoscale and Nanoscale Physics (cond-mat.mes-hall) ,FOS: Physical sciences - Abstract
The conductance of graphene nanoribbons and nanoconstrictions under the effect of a scanning gate microscopy tip is systematically studied. Using a scattering approach for noninvasive probes, the first- and second-order conductance corrections caused by the tip potential disturbance are expressed explicitly in terms of the scattering states of the unperturbed structure. Numerical calculations confirm the perturbative results, showing that the second-order term prevails in the conductance plateaus, exhibiting a universal scaling law for armchair graphene strips. For stronger tips, at specific probe potential widths and strengths beyond the perturbative regime, the conductance corrections reveal the appearance of resonances originated from states trapped below the tip. The zero-transverse-energy mode of an armchair metallic strip is shown to be insensitive to the long-range electrostatic potential of the probe. For nanoconstrictions defined on a strip, scanning gate microscopy allows to get insight into the breakdown of conductance quantization. The first-order correction generically dominates at low tip strength, while for Fermi energies associated with faint conductance plateaus, the second-order correction becomes dominant for relatively small potential tip strengths. In accordance with the spatial dependence of the partial local density of states, the largest tip effect occurs in the central part of the constriction, close to the edges. Nanoribbons and nanoconstrictions with zigzag edges exhibit a similar response as in the case of armchair nanostructures, except when the intervalley coupling induced by the tip potential destroys the chiral edge states., 21 pages, 16 figures
- Published
- 2022
120. Contour: A Process Variation Aware Wear-Leveling Mechanism for Inodes of Persistent Memory File Systems
- Author
-
Wang Xinxin, Chaoshu Yang, Qingfeng Zhuge, Weiwen Jiang, Xianzhang Chen, and Edwin H.-M. Sha
- Subjects
File system ,Computer science ,Linux kernel ,02 engineering and technology ,inode ,Parallel computing ,computer.software_genre ,020202 computer hardware & architecture ,Theoretical Computer Science ,Process variation ,Memory management ,Computational Theory and Mathematics ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Table (database) ,computer ,Software ,Wear leveling - Abstract
Existing persistent memory file systems exploit the fast, byte-addressable persistent memory (PM) to boost storage performance but ignore the limited endurance of PM. Particularly, the PM storing the inode section is extremely vulnerable for the inodes are most frequently updated, fixed on a location throughout lifetime, and require immediate persistency. The huge endurance variation of persistent memory domains caused by process variation makes things even worse. In this article, we propose a process variation aware wear leveling mechanism called Contour for the inode section of persistent memory file system. Contour first enables the movement of inodes by virtualizing the inodes with a deflection table. Then, Contour adopts cross-domain migration algorithm and intra-domain migration algorithm to balance the writes across and within the memory domains. We implement the proposed Contour mechanism in Linux kernel 4.4.30 based on a real persistent memory file system, SIMFS. We use standard benchmarks, including Filebench, MySQL, and FIO, to evaluate Contour. Extensive experimental results show Contour can improve the wear ratios of pages 417.8× and 4.5× over the original SIMFS and PCV , the state-of-the-art inode wear-leveling algorithm, respectively. Meanwhile, the average performance overhead and wear overhead of Contour are 0.87 and 0.034 percent in application-level workloads, respectively.
- Published
- 2021
- Full Text
- View/download PDF
121. Effective file data-block placement for different types of page cache on hybrid main memory architectures.
- Author
-
Penglin Dai, Qingfeng Zhuge, Xianzhang Chen, Weiwen Jiang, and Edwin Hsing-Mean Sha
- Published
- 2013
- Full Text
- View/download PDF
122. On the Design of Minimal-Cost Pipeline Systems Satisfying Hard/Soft Real-Time Constraints
- Author
-
Qingfeng Zhuge, Edwin H.-M. Sha, Lei Yang, Weiwen Jiang, Xianzhang Chen, and Hailiang Dong
- Subjects
020203 distributed computing ,Mathematical optimization ,Computer science ,Pipeline (computing) ,Probabilistic logic ,Approximation algorithm ,02 engineering and technology ,020202 computer hardware & architecture ,Computer Science Applications ,Human-Computer Interaction ,Pipeline transport ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Time complexity ,Throughput (business) ,Random variable ,Integer programming ,Information Systems - Abstract
Pipeline systems provide high throughput for applications by overlapping the executions of tasks. In the architectures with heterogeneity, two basic issues in the design of application-specific pipelines need to be studied: what type of functional unit to execute each task, and where to place buffers. Due to the increasing complexity of applications, pipeline designs face a bundle of problems. One of the most challenging problems is the uncertainty on the execution times, which makes the deterministic techniques inapplicable. In this paper, the execution times are modeled as random variables. Given an application, our objective is to construct the optimal pipeline, such that the total cost of the resultant pipeline can be minimized while satisfying the required timing constraints with the given guaranteed probability. We first prove the NP-hardness of the problem. Then, we present Mixed Integer Linear Programming (MILP) formulations to obtain the optimal solution. Due to the high time complexity of MILP, we devise an efficient $(1+\varepsilon)$ ( 1 + ɛ ) -approximation algorithm, where the value of $\varepsilon$ ɛ is less than 5 percent in practice. Experimental results show that our algorithms can achieve significant reductions in cost over the existing techniques, reaching up to 31.93 percent on average.
- Published
- 2021
- Full Text
- View/download PDF
123. Improving the Performance of Deduplication-Based Storage Cache via Content-Driven Cache Management Methods
- Author
-
Hong Jiang, Xianzhang Chen, Zhichao Yan, Witawas Srisa-an, Duo Liu, Jing Xie, Yujuan Tan, and Congcong Xu
- Subjects
Hardware_MEMORYSTRUCTURES ,Computational Theory and Mathematics ,Distributed database ,Hardware and Architecture ,Computer science ,CPU cache ,Backup ,Distributed computing ,Signal Processing ,Redundancy (engineering) ,Data deduplication ,Cache ,Cache algorithms - Abstract
Data deduplication, as a proven technology for effective data reduction in backup and archiving storage systems, is also showing promises in increasing the logical space capacity for storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they only work well when the cached block size is set to 4 KB. Unfortunately, modern storage systems often set the block size to be much larger than 4 KB, and in this scenario, the overall performance of these caching schemes drops below that of the conventional replacement algorithms without any deduplication. There are several reasons for this performance degradation. The first reason is the deduplication overhead, which is the time spent on generating the data fingerprints and their use to identify duplicate data. Such overhead offsets the benefits of deduplication. The second reason is the extremely low cache space utilization caused by read and write alignment. The third reason is that existing algorithms only exploit access locality to identify block replacement. There is a lost opportunity to effectively leverage the content usage patterns such as intensity of content redundancy and sharing in deduplication-based storage caches to further improve performance. We propose CDAC, a Content-driven Deduplication-Aware Cache, to address this problem. CDAC focuses on exploiting the content redundancy in blocks and intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDAC-LRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU, and D-ARC, by up to 23.83X in read cache hit ratio, with an average of 3.23X, and up to 53.3 percent in IOPS, with an average of 49.8 percent, under a real-world mixed workload when the cache size ranges from 20 to 50 percent of the workload size and the block size ranges from 4KB to 32 KB.
- Published
- 2021
- Full Text
- View/download PDF
124. Self-Balancing Federated Learning With Global Imbalanced Data in Mobile Systems
- Author
-
Duo Liu, Yujuan Tan, Liang Liang, Renping Liu, Xianzhang Chen, and Moming Duan
- Subjects
Training set ,Distributed database ,Artificial neural network ,Computer science ,business.industry ,Deep learning ,Machine learning ,computer.software_genre ,Federated learning ,Data modeling ,Computational Theory and Mathematics ,Hardware and Architecture ,Server ,Signal Processing ,Artificial intelligence ,Divergence (statistics) ,business ,computer - Abstract
Federated learning (FL) is a distributed deep learning method that enables multiple participants, such as mobile and IoT devices, to contribute a neural network while their private training data remains in local devices. This distributed approach is promising in the mobile systems where have a large corpus of decentralized data and require high privacy. However, unlike the common datasets, the data distribution of the mobile systems is imbalanced which will increase the bias of model. In this article, we demonstrate that the imbalanced distributed training data will cause an accuracy degradation of FL applications. To counter this problem, we build a self-balancing FL framework named Astraea, which alleviates the imbalances by 1) Z-score-based data augmentation, and 2) Mediator-based multi-client rescheduling. The proposed framework relieves global imbalance by adaptive data augmentation and downsampling, and for averaging the local imbalance, it creates the mediator to reschedule the training of clients based on Kullback–Leibler divergence (KLD) of their data distribution. Compared with FedAvg , the vanilla FL algorithm, Astraea shows +4.39 and +6.51 percent improvement of top-1 accuracy on the imbalanced EMNIST and imbalanced CINIC-10 datasets, respectively. Meanwhile, the communication traffic of Astraea is reduced by 75 percent compared to FedAvg .
- Published
- 2021
- Full Text
- View/download PDF
125. Optimizing synchronization mechanism for block-based file systems using persistent memory
- Author
-
Xianzhang Chen, Qingfeng Zhuge, Duo Liu, Edwin H.-M. Sha, Runyu Zhang, and Chaoshu Yang
- Subjects
File system ,Hardware_MEMORYSTRUCTURES ,Data consistency ,Computer Networks and Communications ,business.industry ,Computer science ,ext4 ,020206 networking & telecommunications ,Linux kernel ,02 engineering and technology ,Data loss ,computer.software_genre ,Synchronization ,Persistence (computer science) ,Hardware and Architecture ,Embedded system ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,020201 artificial intelligence & image processing ,business ,computer ,Software ,Block (data storage) - Abstract
Existing block-based file systems employ buffer caching mechanism to improve performance, which may result in data loss in the case of power failure or system crash. To avoid data loss, the file systems provide synchronization operations for applications to synchronously write the dirty data in DRAM cache back to the slow block devices. However, the synchronization operations can severely degrade the performance of the file system since violating the intention of buffer caching mechanism. In this paper, we propose to relieve the overhead of synchronization operations while ensuring data reliability by utilizing a small Persistent Memory. The proposed Persistent Memory assisted Write-back (PMW) mechanism includes a dedicated Copy-on-Write mechanism to guarantee data consistency and a write-back mechanism across PM and the block devices. We implement the proposed PMW in Linux kernel based on Ext4. The experimental results show that PMW can achieve about 2.2 × and 1.6 × performance improvement over the original Ext4 and AFCM, the state-of-the-art PM-based synchronization mechanism, on the TPCC workload, respectively.
- Published
- 2020
- Full Text
- View/download PDF
126. Separable Binary Convolutional Neural Network on Embedded Systems
- Author
-
Yujuan Tan, Chaoshu Yang, Liang Liang, Renping Liu, Yingjian Ling, Duo Liu, Runyu Zhang, Weilue Wang, Chunhua Xiao, and Xianzhang Chen
- Subjects
business.industry ,Computer science ,Binary number ,02 engineering and technology ,Convolutional neural network ,020202 computer hardware & architecture ,Theoretical Computer Science ,Separable space ,Computational Theory and Mathematics ,Kernel (image processing) ,Hardware and Architecture ,Embedded system ,Principal component analysis ,0202 electrical engineering, electronic engineering, information engineering ,Network performance ,business ,Software - Abstract
We have witnessed the tremendous success of deep neural networks. However, this success comes with the considerable memory and computational costs which make it difficult to deploy these networks directly on resource-constrained embedded systems. To address this problem, we propose TaijiNet, a separable binary network, to reduce the storage and computational overhead while maintaining a comparable accuracy. Furthermore, we also introduce a strategy called partial binarized convolution which binarizes only unimportant kernels to efficiently balance network performance and accuracy. Our approach is evaluated on the CIFAR-10 and ImageNet datasets. The experimental results show that with the proposed TaijiNet, the separable binary versions of AlexNet and ResNet-18 can achieve 26× and 6.4× compression rates with comparable accuracy when comparing with the full-precision versions respectively. In addition, by adjusting the PCA threshold, the xnor version of Taiji-AlexNet improves accuracy by 4-8 percent comparing with other state-of-the-art methods.
- Published
- 2020
- Full Text
- View/download PDF
127. APMigration: Improving Performance of Hybrid Memory Performance via An Adaptive Page Migration Method
- Author
-
Duo Liu, Yujuan Tan, Witawas Srisa-an, Xianzhang Chen, Zhichao Yan, and Baiping Wang
- Subjects
020203 distributed computing ,Random access memory ,Hardware_MEMORYSTRUCTURES ,Computer science ,Frame (networking) ,02 engineering and technology ,computer.software_genre ,Flash memory ,Non-volatile memory ,Memory management ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Non-volatile random-access memory ,computer ,Dram ,Data migration - Abstract
Byte-addressable, non-volatile memory (NVRAM) combines the benefits of DRAM and flash memory. However, due to its slower speed than DRAM, it is best to deploy it in combination with typical DRAM. In such Hybrid NVRAM systems, frequently accessed, hot pages can be stored in DRAM while other cold pages can reside in NVRAM, providing the benefits of both high performance (from DRAM) and lower power consumption and cost/performance (from NVRAM). While the idea seems beneficial, realizing an efficient hybrid NVRAM system requires careful page migration and accurate data temperature measurement. Existing solutions, however, often cause invalid migrations due to inaccurate data temperature accounting, because hot and cold pages are separately identified in DRAM and NVRAM regions. Moreover, since a new NVRAM frame is always allocated for each page swapped back NVRAM, a large amount of unnecessary NVRAM writes are generated during each page migration. Based on these observations, we propose APMigrate, an adaptive data migration approach for hybrid NVRAM systems. APMigrate consist of two parts, UIMigrate and LazyWriteback . UIMigrate focuses on eliminating invalid page migrations by considering data temperature in the entire DRAM-NVRAM space, while LazyWriteback focus on rewriting only dirty data back when the page is swapped back to NVRAM. Our experiments using SPEC 2006 show that APMigrate can reduce the number of migrations and improves performance by up to 90 percent compared to existing state-of-the-art approaches. For some workloads, LazyWriteback can reduce unnecessary NVRAM writes for existing page migrations by up to 75 percent.
- Published
- 2020
- Full Text
- View/download PDF
128. Downsizing Without Downgrading: Approximated Dynamic Time Warping on Nonvolatile Memories
- Author
-
Yingjian Ling, Xianzhang Chen, Duo Liu, Po-Chun Huang, Renping Liu, Yi Gu, Liang Liang, Kan Zhong, and Xingni Li
- Subjects
Dynamic time warping ,Similarity (geometry) ,Computer science ,02 engineering and technology ,computer.software_genre ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Euclidean distance ,Upsampling ,0202 electrical engineering, electronic engineering, information engineering ,Data mining ,Electrical and Electronic Engineering ,Time series ,Wireless sensor network ,computer ,Software - Abstract
In recent years, time-series data have emerged in a variety of application domains, such as wireless sensor networks and surveillance systems. To identify the similarity between time-series data, the Euclidean distance and its variations are common metrics that quantify the differences between time-series data. However, the Euclidean distance is limited by its inability to elastically shift with the time axis, which motivates the development of dynamic time warping (DTW) algorithms. While DTW algorithms have been proven very useful in diversified applications like speech recognition, their efficacy might be seriously affected by the resolution of the time-series data. However, high-resolution time-series data might take up a gigantic amount of main memory and storage space, which will slow down the DTW analysis procedure. This makes the upscaling of DTW analysis more challenging, especially for in-memory data analytics platforms with limited nonvolatile memory space. In this paper, we propose a strategy to downsample time-series data to significantly reduce their size without seriously affecting the precision of the results obtained by DTW algorithms (downsizing without downgrading). In other words, this paper proposes a technique to remove the unimportant details that are largely ignored by DTW algorithms. The efficacy of the proposed technique is verified by a series of experimental studies, where the results are quite encouraging.
- Published
- 2020
- Full Text
- View/download PDF
129. Federated learning with workload-aware client scheduling in heterogeneous systems
- Author
-
Li Li, Duo Liu, Moming Duan, Yu Zhang, Ao Ren, Xianzhang Chen, Yujuan Tan, and Chengliang Wang
- Subjects
Machine Learning ,Artificial Intelligence ,Cognitive Neuroscience ,Humans ,Workload - Abstract
Federated Learning (FL) is a novel distributed machine learning, which allows thousands of edge devices to train models locally without uploading data to the central server. Since devices in real federated settings are resource-constrained, FL encounters systems heterogeneity, which causes considerable stragglers and incurs significant accuracy degradation. To tackle the challenges of systems heterogeneity and improve the robustness of the global model, we propose a novel adaptive federated framework in this paper. Specifically, we propose FedSAE that leverages the workload completion history of clients to adaptively predict the affordable training workload for each device. Consequently, FedSAE can significantly reduce stragglers in highly heterogeneous systems. We incorporate Active Learning into FedSAE to dynamically schedule participants. The server evaluates the devices' training value based on their training loss in each round, and larger-value clients are selected with a higher probability. As a result, the model convergence is accelerated. Furthermore, we propose q-FedSAE that combines FedSAE and q-FFL to improve global fairness in highly heterogeneous systems. The evaluations conducted in a highly heterogeneous system demonstrate that both FedSAE and q-FedSAE converge faster than FedAvg. In particular, FedSAE outperforms FedAvg across multiple federated datasets - FedSAE improves testing accuracy by 22.19% and reduces stragglers by 90.69% on average. Moreover, holding the same accuracy as FedSAE, q-FedSAE allows for more robust convergence and fairer model performance than q-FedAvg, FedSAE.
- Published
- 2021
130. CSAFL: A Clustered Semi-Asynchronous Federated Learning Framework
- Author
-
Xianzhang Chen, Ao Ren, Moming Duan, Duo Liu, Chengliang Wang, Yujuan Tan, Li Li, and Yu Zhang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Information privacy ,Artificial neural network ,Distributed database ,Computer science ,Distributed computing ,Federated learning ,Machine Learning (cs.LG) ,Data modeling ,Computer Science - Distributed, Parallel, and Cluster Computing ,Asynchronous communication ,Convergence (routing) ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Baseline (configuration management) - Abstract
Federated learning (FL) is an emerging distributed machine learning paradigm that protects privacy and tackles the problem of isolated data islands. At present, there are two main communication strategies of FL: synchronous FL and asynchronous FL. The advantages of synchronous FL are that the model has high precision and fast convergence speed. However, this synchronous communication strategy has the risk that the central server waits too long for the devices, namely, the straggler effect which has a negative impact on some time-critical applications. Asynchronous FL has a natural advantage in mitigating the straggler effect, but there are threats of model quality degradation and server crash. Therefore, we combine the advantages of these two strategies to propose a clustered semi-asynchronous federated learning (CSAFL) framework. We evaluate CSAFL based on four imbalanced federated datasets in a non-IID setting and compare CSAFL to the baseline methods. The experimental results show that CSAFL significantly improves test accuracy by more than +5% on the four datasets compared to TA-FedAvg. In particular, CSAFL improves absolute test accuracy by +34.4% on non-IID FEMNIST compared to TA-FedAvg., This paper will be presented at IJCNN 2021
- Published
- 2021
- Full Text
- View/download PDF
131. On the Design of Time-Constrained and Buffer-Optimal Self-Timed Pipelines
- Author
-
Edwin H.-M. Sha, Lei Yang, Weiwen Jiang, Jingtong Hu, Qingfeng Zhuge, and Xianzhang Chen
- Subjects
Marked graph ,Matching (graph theory) ,Computer science ,Pipeline (computing) ,02 engineering and technology ,Parallel computing ,Computer Graphics and Computer-Aided Design ,Synchronization ,020202 computer hardware & architecture ,Reduction (complexity) ,Asynchronous communication ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Field-programmable gate array ,Integer programming ,Software - Abstract
Pipelining is a powerful technique to achieve high performance in computing systems. However, as computing platforms become large-scale and integrate with heterogeneous processing elements (PEs) (CPUs, GPUs, field-programmable gate arrays, etc.), it is difficult to employ a global clock to achieve synchronous pipelines. Therefore, self-timed (or asynchronous) pipelines are usually adopted. Nevertheless, due to their complex running behavior, the performance modeling and systematic optimizations for self-timed pipeline (STP) systems are more complicated than those for synchronous ones. This paper employs marked graph theory to model STPs and presents algorithms to detect performance bottlenecks. Based on the proposed model, we observe that the system performance can be improved by inserting buffers. Due to the limited memory resources on the PEs, it is critical to minimize the number of buffers for STPs while satisfying the required timing constraints. In this paper, we propose integer linear programming formulations to obtain the optimal solutions and devise efficient algorithms to obtain the near-optimal solutions. Experimental results show that the proposed algorithms can achieve 53.10% improvement in the maximum performance and 54.04% reduction in the number of buffers, compared with the technique for the slack matching problem.
- Published
- 2019
- Full Text
- View/download PDF
132. HydraFS: an efficient NUMA-aware in-memory file system
- Author
-
Kai Liu, Edwin H.-M. Sha, Xianzhang Chen, Zhixiang Liu, Ting Wu, Qingfeng Zhuge, and Chunhua Xiao
- Subjects
File system ,Hardware_MEMORYSTRUCTURES ,Computer Networks and Communications ,Computer science ,020206 networking & telecommunications ,Linux kernel ,02 engineering and technology ,Thread (computing) ,computer.software_genre ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,020201 artificial intelligence & image processing ,computer ,Software - Abstract
Emerging persistent file systems are designed to achieve high-performance data processing by effectively exploiting the advanced features of Non-volatile Memory (NVM). Non-uniform memory access (NUMA) architectures are universally used in high-performance computing and data centers due to its scalability. However, existing NVM-based in-memory file systems are all designed for uniformed memory access systems. Their performance is not satisfactory on NUMA machine as they do not consider the architecture of multiple nodes and the asymmetric memory access speed. In this paper, we design an efficient NUMA-aware in-memory file system which distributes file data on all nodes to effectively balance the loads of file requests. Three approaches for improving the performance of the file system on NUMA machine are proposed, including Node-oriented File Creation algorithm to dispatch files over multiple nodes, File-oriented Thread Binding algorithm to bind threads to the gainful nodes and a buffer assignment technique to allocate the user buffer from the proper node. Further, based on the new design, we implement a functional NUMA-aware in-memory file system, HydraFS, in Linux kernel. Extensive experiments show that HydraFS significantly outperforms existing representative in-memory file systems on NUMA machine. The average performance of HydraFS is 76.6%, 91.9%, 26.7% higher than EXT4-DAX, PMFS, and SIMFS, respectively.
- Published
- 2019
- Full Text
- View/download PDF
133. HiNextApp: A context-aware and adaptive framework for app prediction in mobile systems
- Author
-
Renping Liu, Shiming Li, Duo Liu, Liang Liang, Yong Guan, Chaoneng Xiang, Xianzhang Chen, and Jinting Ren
- Subjects
General Computer Science ,business.industry ,Computer science ,020209 energy ,Response time ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,Machine learning ,computer.software_genre ,Variety (cybernetics) ,Bayes' theorem ,Memory management ,Systems management ,mental disorders ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Contextual information ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,computer - Abstract
A variety of applications (App) installed on mobile systems such as smartphones enrich our lives, but make it more difficult to the system management. For example, finding the specific Apps becomes more inconvenient due to more Apps installed on smartphones, and App response time could become longer because of the gap between more, larger Apps and limited memory capacity. Recent work has proposed several methods of predicting next used Apps in the immediate future (here in after app-prediction) to solve the issues, but faces the problems of the low prediction accuracy and high training costs. Especially, applying app-prediction to memory management (such as LMK) and App prelaunching has high requirements for the prediction accuracy and training costs. In this paper, we propose an app-prediction framework, named HiNextApp, to improve the app-prediction accuracy and reduce training costs in mobile systems. HiNextApp is based on contextual information, and can adjust the size of prediction periods adaptively. The framework mainly consists of two parts: non-uniform Bayes model and an elastic algorithm. The experimental results show that HiNextApp can effectively improve the prediction accuracy and reduce training times. Besides, compared with traditional Bayes model, the overhead of our framework is relatively low.
- Published
- 2019
- Full Text
- View/download PDF
134. FitCNN: A cloud-assisted and low-cost framework for updating CNNs on IoT devices
- Author
-
Yujuan Tan, Chaoshu Yang, Liang Liang, Jinting Ren, Duo Liu, Xianzhang Chen, Moming Duan, Renping Liu, and Shiming Li
- Subjects
Artificial neural network ,Contextual image classification ,Computer Networks and Communications ,business.industry ,Computer science ,Real-time computing ,020206 networking & telecommunications ,Cloud computing ,02 engineering and technology ,Convolutional neural network ,Upload ,User experience design ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,020201 artificial intelligence & image processing ,business ,Mobile device ,Software - Abstract
Recently convolutional neural networks (CNNs) have essentially achieved the state-of-the-art accuracies in image classification and recognition tasks. CNNs are usually deployed in the cloud to handle data collected from IoT devices, such as smartphones and unmanned systems. However, significant data transmission overhead and privacy issues have made it necessary to use CNNs directly in device side. Nevertheless, the trained model deployed on mobile devices cannot effectively handle the unknown data and objects in new environments, which could lead to low accuracy and poor user experience. Hence, it would be crucial to re-train a better model via future unknown data. However, with tremendous computing cost and memory usage, training a CNN on IoT devices with limited hardware resources is intolerable in practice. To solve this issue, using the power of cloud to assist mobile devices to train a deep neural network becomes a promising solution . Therefore, this paper proposes a cloud-assisted CNN framework, named FitCNN, with incremental learning and low data transmission, to reduce the overhead of updating CNNs deployed on devices. To reduce the data transmission during incremental learning, we propose a strategy, called Distiller, to selectively upload the data that is worth learning, and develop an extracting strategy, called Juicer, to choose light amount of weights from the new CNN model generated on the cloud to update the corresponding old ones on devices. Experimental results show that the Distiller strategy can reduce 39.4% data transmission of uploading based on a certain dataset, and the Juicer strategy reduces by more than 60% data transmission of updating with multiple CNNs and datasets.
- Published
- 2019
- Full Text
- View/download PDF
135. LPE: Locality-Based Dead Prediction in Exclusive TLB for Large Coverage
- Author
-
Yujuan Tan, Jing Yan, Jingcheng Liu, Chengliang Wang, Xianzhang Chen, and Zhulin Ma
- Subjects
Hardware and Architecture ,Computer science ,Locality ,Translation lookaside buffer ,General Medicine ,Parallel computing ,Electrical and Electronic Engineering ,Memory systems - Abstract
Translation lookaside buffer (TLB) is critical to modern multi-level memory systems’ performance. However, due to the limited size of the TLB itself, its address coverage is limited. Adopting a two-level exclusive TLB hierarchy can increase the coverage [M. Swanson, L. Stoller and J. Carter, Increasing TLB reach using superpages backed by shadow memory, 25th Annual Int. Symp. Computer Architecture (1998); H.P. Chang, T. Heo, J. Jeong and J. Huh Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations, ACM SIGARCH Comput. Arch. News 45 (2017) 444–456] to improve memory performance. However, after analyzing the existing two-level exclusive TLBs, we find that a large number of “dead” entries (they will have no further use) exist in the last-level TLB (LLT) for a long time, which occupy much cache space and result in low TLB hit-rate. Based on this observation, we first propose exploiting temporal and spatial locality to predict and identify dead entries in the exclusive LLT and remove them as soon as possible to leave room for more valid data to increase the TLB hit rates. Extensive experiments show that our method increases the average hit rate by 8.67%, to a maximum of 19.95%, and reduces total latency by an average of 9.82%, up to 24.41%.
- Published
- 2021
- Full Text
- View/download PDF
136. DFShards
- Author
-
Duo Liu, Congcong Xu, Ailing Yu, Xianzhang Chen, Zhulin Ma, and Yujuan Tan
- Subjects
020203 distributed computing ,Hardware_MEMORYSTRUCTURES ,Computer science ,CPU cache ,02 engineering and technology ,Construct (python library) ,Function (mathematics) ,Set (abstract data type) ,Data access ,Shard ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,Cache ,Algorithm - Abstract
The Miss Ratio Curve (MRC) describes the cache miss ratio as a function of the cache size. It has various shapes that represent the data access behaviors of workloads in the cache. MRC is an effective tool to guide cache partitioning, but its real-time construction is challenging. Miniature Simulation is a novel approach that constructs MRCs for non-stack algorithms in real time, via feeding a small number of sample references to multiple mini caches simultaneously to get the miss ratios. However, while using the Miniature Simulation, the size and number of mini-caches are difficult to set before the program runs. First, it may set too many mini-caches and cause repeated simulations. Second, it may miss some important cache sizes and consequently construct a less precise shape of MRC and result in incorrect cache partitioning. To address this problem, we propose DFShards, an adaptive cache shards (mini-caches) configuration approach based on program access patterns. The key idea is to dynamically adjust the configuration of the cache shards, including the number of the total cache shards and the size of each cache shard, based on the access behaviors to reflect changes in workload to build an precise MRC, thereby achieving better cache partitioning and overall performance. Our extensive experiments show that DFShards can construct precise MRCs in real-time during program running. Compared to the state-of-the-art approaches, it can save up to 47% of the cache space for MRC constructions while increasing the cache hit ratio by up to 17%.
- Published
- 2021
- Full Text
- View/download PDF
137. Forseti: An Efficient Basic-block-level Sensitivity Analysis Framework Towards Multi-bit Faults
- Author
-
Moming Duan, Jinting Ren, Xianzhang Chen, Duo Liu, Chengliang Wang, and Renping Liu
- Subjects
Speedup ,Artificial neural network ,Computer engineering ,Computer science ,Basic block ,Overhead (computing) ,Sensitivity (control systems) - Abstract
The per-instruction sensitivity analysis framework is developed to evaluate the resiliency of a program and identify the segments of the program needing protection. However, for multi-bit hardware faults, the per-instruction sensitivity analysis frameworks can cause large overhead for redundant analyses. In this paper, we propose a basic-block-level sensitivity analysis framework, Forseti, to reduce the analysis overhead in analyzing impacts of modern microprocessors' multi-bit faults on programs. We implement Forseti in LLVM and evaluate it with five typical workloads. Extensive experimental results show that Forseti can achieve more than 90% sensitivity classification accuracy and 6.16× speedup over instruction-level analysis.
- Published
- 2021
- Full Text
- View/download PDF
138. FedSAE: A Novel Self-Adaptive Federated Learning Framework in Heterogeneous Systems
- Author
-
Yujuan Tan, Yu Zhang, Chengliang Wang, Duo Liu, Moming Duan, Ao Ren, Li Li, and Xianzhang Chen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Edge device ,Artificial neural network ,Computer science ,Active learning (machine learning) ,Reliability (computer networking) ,Distributed computing ,Machine Learning (cs.LG) ,Upload ,Computer Science - Distributed, Parallel, and Cluster Computing ,Complete information ,Server ,Overhead (computing) ,Distributed, Parallel, and Cluster Computing (cs.DC) - Abstract
Federated Learning (FL) is a novel distributed machine learning which allows thousands of edge devices to train model locally without uploading data concentrically to the server. But since real federated settings are resource-constrained, FL is encountered with systems heterogeneity which causes a lot of stragglers directly and then leads to significantly accuracy reduction indirectly. To solve the problems caused by systems heterogeneity, we introduce a novel self-adaptive federated framework FedSAE which adjusts the training task of devices automatically and selects participants actively to alleviate the performance degradation. In this work, we 1) propose FedSAE which leverages the complete information of devices' historical training tasks to predict the affordable training workloads for each device. In this way, FedSAE can estimate the reliability of each device and self-adaptively adjust the amount of training load per client in each round. 2) combine our framework with Active Learning to self-adaptively select participants. Then the framework accelerates the convergence of the global model. In our framework, the server evaluates devices' value of training based on their training loss. Then the server selects those clients with bigger value for the global model to reduce communication overhead. The experimental result indicates that in a highly heterogeneous system, FedSAE converges faster than FedAvg, the vanilla FL framework. Furthermore, FedSAE outperforms than FedAvg on several federated datasets - FedSAE improves test accuracy by 26.7% and reduces stragglers by 90.3% on average., Comment: This paper will be presented at IJCNN 2021
- Published
- 2021
- Full Text
- View/download PDF
139. WMAlloc: A Wear-Leveling-Aware Multi-Grained Allocator for Persistent Memory File Systems
- Author
-
Wenbin Wang, Shun Nie, Chaoshu Yang, Xianzhang Chen, Duo Liu, and Runyu Zhang
- Subjects
File system ,Computer science ,business.industry ,020206 networking & telecommunications ,Linux kernel ,Memory bus ,02 engineering and technology ,computer.software_genre ,020202 computer hardware & architecture ,Persistence (computer science) ,Allocator ,Memory management ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Binary heap ,Persistent data structure ,business ,computer ,Wear leveling ,Heap (data structure) - Abstract
Emerging Persistent Memories (PMs) are promised to revolutionize the storage systems by providing fast, persistent data access on the memory bus. Therefore, persistent memory file systems are developed to achieve high performance by exploiting the advanced features of PMs. Unfortunately, the PMs have the problem of limited write endurance. Furthermore, the existing space management strategies of persistent memory file systems usually ignore this problem, which can cause that the write operations concentrate on a few cells of PM. Then, the unbalanced writes can damage the underlying PMs quickly, which seriously damages the data reliability of the file systems. However, existing wear-leveling-aware space management techniques mainly focus on improving the wear-leveling accuracy of PMs rather than reducing the overhead, which can seriously reduce the performance of persistent memory file systems. In this paper, we propose a Wear-Leveling-Aware Multi-Grained Allocator, called WMAlloc, to achieve the wear-leveling of PM while improving the performance for persistent memory file systems. WMAlloc adopts multiple heap trees to manage the unused space of PM, and each heap tree represents an allocation granularity. Then, WMAlloc allocates less-worn required blocks from the heap tree for each allocation. We implement the proposed WMAlloc in Linux kernel based on NOVA, a typical persistent memory file system. Compared with DWARM, the state-of-the-art and wear-leveling-aware space management technique, experimental results show that WMAlloc can achieve 1.52× lifetime of PM and 1.44× performance improvement on average.
- Published
- 2020
- Full Text
- View/download PDF
140. Themis: Malicious Wear Detection and Defense for Persistent Memory File Systems
- Author
-
Xianzhang Chen, Wenbin Wang, Shun Nie, Duo Liu, Chaoshu Yang, and Runyu Zhang
- Subjects
Scheme (programming language) ,File system ,Random access memory ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,Reliability (computer networking) ,020206 networking & telecommunications ,Linux kernel ,02 engineering and technology ,computer.software_genre ,020202 computer hardware & architecture ,Persistence (computer science) ,Memory management ,0202 electrical engineering, electronic engineering, information engineering ,Set (psychology) ,business ,computer ,Dram ,Computer network ,computer.programming_language - Abstract
The persistent memory file systems can significantly improve the performance by utilizing the advanced features of emerging Persistent Memories (PMs). Unfortunately, the PMs have the problem of limited write endurance. However, the design of persistent memory file systems usually ignores this problem. Accordingly, the write-intensive applications, especially for the malicious wear attack virus, can damage underlying PMs quickly by calling the common interfaces of persistent memory file systems to write a few cells of PM continuously. Which seriously threat to the data reliability of file systems. However, existing solutions to solve this problem based on persistent memory file systems are not systematic and ignore the unlimited write endurance of DRAM. In this paper, we propose a malicious wear detection and defense mechanism for persistent memory file systems, called Themis, to solve this problem. The proposed Themis identifies the malicious wear attack according to the write traffic and the set lifespan of PM. Then, we design a wear-leveling scheme and migrate the writes of malicious wear attackers into DRAM to improve the lifespan of PMs. We implement the proposed Themis in Linux kernel based on NOVA, a state-of-the-art persistent memory file system. Compared with DWARM, the state-of-the-art and wear-aware memory management technique, experimental results show that Themis can improve 5774× lifetime of PM and 1.13× performance, respectively.
- Published
- 2020
- Full Text
- View/download PDF
141. MobileRE: A Hybrid Fault Tolerance Strategy Combining Erasure Codes and Replicas for Mobile Distributed Cluster
- Author
-
Duo Liu, Zilin Zhang, Yujuan Tan, Xianzhang Chen, Yu Wu, Renping Liu, and Jinting Ren
- Subjects
Distributed Computing Environment ,Dynamic network analysis ,business.industry ,Computer science ,Reliability (computer networking) ,Distributed computing ,Bandwidth (signal processing) ,020206 networking & telecommunications ,Fault tolerance ,02 engineering and technology ,Supercomputer ,Computer data storage ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,Erasure code - Abstract
Fault tolerance techniques are of vital importance to promise data storage reliability for mobile distributed file systems. Traditional fault tolerance techniques, namely erasure codes and replicas, are suitable for wired data centers. However, they face challenges in mobile distributed environment, where nodes suffer from high failure probability and fluctuating bandwidth.In this paper, we present a hybrid fault tolerance strategy combining erasure codes and replicas for the mobile distributed cluster (MobileRE), to improve data reliability with dynamic network status. In MobileRE, we first formulate a reliability cost rate to indicate the cost of ensuring data reliability of the mobile cluster. MobileRE further adaptively applies the erasure codes and replicas algorithms based on real-time network bandwidth status to minimize system reliability cost rate. Simulation results show that compared with traditional designs that only adopt erasure codes or replicas, MobileRE can significantly reduce the system reliability cost rate.
- Published
- 2020
- Full Text
- View/download PDF
142. Unified-TP: A Unified TLB and Page Table Cache Structure for Efficient Address Translation
- Author
-
Qingfeng Zhuge, Hong Jiang, Zhichao Yan, Yujuan Tan, Duo Liu, Edwin H.-M. Sha, Xianzhang Chen, Zhulin Ma, and Chengliang Wang
- Subjects
010302 applied physics ,Structure (mathematical logic) ,Scheme (programming language) ,Miss rate ,Hardware_MEMORYSTRUCTURES ,Computer science ,Translation lookaside buffer ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Memory management ,0103 physical sciences ,Virtual memory ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Latency (engineering) ,Page table ,Cache algorithms ,computer ,computer.programming_language - Abstract
To improve the performance of address translation in applications with large memory footprints, techniques, such as hugepages and HW coalescing, are proposed to increase the coverage of limited hardware translation entries by exploiting the contiguous memory allocation to lower Tanslation Lookaside Buffer (TLB) miss rate. Furthermore, Page Table Caches (PTCs) are proposed to store the upper-level page table entries to reduce the TLB miss handling latency. Both increasing TLB coverage and reducing TLB miss handling latency have proved to be effective in speeding up address translation, to a certain extent. Nevertheless, our preliminary studies suggest that the structural separation between TLBs and PTCs in existing computer systems makes these two methods less effective because they are exclusively used in TLBs and PTCs respectively. In particular, the separate structures cannot dynamically adjust their sizes according to the workloads, resulting in low resource utilization and inefficient address translation. To address these issues, we propose a unified structure, called Unified - Tp,which stores PTC and TLB entries together. Besides, Our modified LRU algorithm helps identify the cold TLB and PTC entries and dynamically adjust the numbers of TLB and PTC entries to adapt to different workloads. Furthermore, we introduce a scheme of parallel search when receiving memory access requests. Our experimental results show that Unified-TP can reduce the numbers of TLB misses by an average of 35.69 % and improve the performance by an average of 11.12% compared with separately structured TLBs and PTCs.
- Published
- 2020
- Full Text
- View/download PDF
143. LOFFS: A Low-Overhead File System for Large Flash Memory on Embedded Devices
- Author
-
Zhaoyan Shen, Duo Liu, Yujuan Tan, Chaoshu Yang, Xiongxiong She, Zili Shao, Runyu Zhang, and Xianzhang Chen
- Subjects
File system ,050210 logistics & transportation ,Computer science ,YAFFS ,business.industry ,05 social sciences ,02 engineering and technology ,Construct (python library) ,computer.software_genre ,Flash memory ,020202 computer hardware & architecture ,Flash (photography) ,Embedded system ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,Memory footprint ,business ,computer ,Booting - Abstract
Emerging applications like machine learning in embedded devices (e.g., satellite and vehicles) require huge storage space, which recently stimulates the widespread deployment of large-capacity flash memory in IoT devices. However, existing embedded file systems fall short in managing large-capacity storage efficiently for excessive memory consumption and poor booting performance. In this paper, we propose a novel embedded file system, LOFFS, to tackle the above issues and manage large-capacity NAND flash on resource-limited embedded devices. We redesign the space management mechanisms and construct hybrid file structures to achieve high performance with minimum resource occupation. We have implemented LOFFS in Linux, and the experimental results show that LOFFS outperforms YAFFS by 55.8% on average with orders of magnitude reductions on memory footprint.
- Published
- 2020
- Full Text
- View/download PDF
144. Efficient Multi-Grained Wear Leveling for Inodes of Persistent Memory File Systems
- Author
-
Qingfeng Zhuge, Shun Nie, Chaoshu Yang, Xianzhang Chen, Fengshun Wang, Duo Liu, Edwin H.-M. Sha, and Runyu Zhang
- Subjects
File system ,Computer science ,0211 other engineering and technologies ,Linux kernel ,02 engineering and technology ,inode ,computer.software_genre ,020202 computer hardware & architecture ,Persistence (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Table (database) ,computer ,Wear leveling ,021106 design practice & management - Abstract
Existing persistent memory file systems usually store inodes in fixed locations, which ignores the external and internal imbalanced wears of inodes on the persistent memory (PM). Therefore, the PM for storing inodes can be easily damaged. Existing solutions achieve low accuracy of wear-leveling with high-overhead data migrations. In this paper, we propose a Lightweight and Multi-grained Wear-leveling Mechanism, called LMWM, to solve these problems. We implement the proposed LMWM in Linux kernel based on NOVA, a typical persistent memory file system. Compared with MARCH, the state-of-theart wear-leveling mechanism for inode table, experimental results show that LMWM can improve 2.5× lifetime of PM and 1.12× performance, respectively.
- Published
- 2020
- Full Text
- View/download PDF
145. SSDKeeper: Self-Adapting Channel Allocation to Improve the Performance of SSD Devices
- Author
-
Runyu Zhang, Duo Liu, Xianzhang Chen, Liang Liang, Yujuan Tan, and Renping Liu
- Subjects
020203 distributed computing ,Hardware_MEMORYSTRUCTURES ,Channel allocation schemes ,business.industry ,Computer science ,Distributed computing ,0202 electrical engineering, electronic engineering, information engineering ,Temporal isolation among virtual machines ,Data center ,02 engineering and technology ,business ,020202 computer hardware & architecture - Abstract
Solid state drives (SSDs) have been widely deployed in high performance data center environments, where multiple tenants usually share the same hardware. However, traditional SSDs distribute the users’ incoming data uniformly across all SSD channels, which leads to numerous access conflicts. Meanwhile, SSDs that statically allocate one or several channels to one tenant sacrifice device parallelism and capacity. When SSDs are shared by tenants with different access patterns, inappropriate channel allocation results in SSDs performance degradation. In this paper, we propose a self-adapting channel allocation mechanism, named SSDKeeper, for multiple tenants to share one SSD. SSDKeeper employs a machine learning assisted algorithm to take full advantage of SSD parallelism while providing performance isolation. By collecting multi-tenant access patterns and training a model, SSDKeeper selects an optimal channel allocation strategy for multiple tenants with the lowest overall response latency. Experimental results show that SSDKeeper improves the overall performance by 24% with negligible overhead.
- Published
- 2020
- Full Text
- View/download PDF
146. Optimizing Performance of Persistent Memory File Systems using Virtual Superpages
- Author
-
Duo Liu, Shun Nie, Edwin H.-M. Sha, Runyu Zhang, Chaoshu Yang, Qingfeng Zhuge, and Xianzhang Chen
- Subjects
010302 applied physics ,File system ,Hardware_MEMORYSTRUCTURES ,Data consistency ,Write amplification ,Computer science ,Translation lookaside buffer ,Linux kernel ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,Persistence (computer science) ,Non-volatile memory ,Virtual address space ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Overhead (computing) ,computer ,Data migration - Abstract
Existing persistent memory file systems can significantly improve the performance by utilizing the advantages of emerging Persistent Memories (PMs). Especially, they can employ superpages (e.g., 2MB a page) of PMs to alleviate the overhead of locating file data and reduce TLB misses. Unfortunately, superpage also induces two critical problems. First, the data consistency of file systems using superpages causes severe write amplification during overwrite of file data. Second, existing management of superpages may lead to large waste of PM space. In this paper, we propose a Virtual Superpage Mechanism (VSM) to solve the problems by taking advantages of virtual address space. On one hand, VSM adopts multi-grained copy-on-write mechanism to reduce the write amplification while ensuring data consistency. On the other hand, VSM presents zero-copy file data migration mechanism to eliminate the loss of space utilization efficiency caused by superpages. We implement the proposed VSM mechanism in Linux kernel based on PMFS. Compared with the original PMFS and NOVA, the experimental results show that VSM improves 36% and 14% on average for write and read performance, respectively. Meanwhile, VSM can achieve the same space utilization efficiency of file system that uses the normal 4KB pages to organize files.
- Published
- 2020
- Full Text
- View/download PDF
147. Transport signatures of relativistic quantum scars in a graphene cavity
- Author
-
Hongqi Xu, Ning Kang, Zhongfan Liu, Li Lin, Xianzhang Chen, Guoming Zhang, Hailin Peng, and Liang Huang
- Subjects
Physics ,Condensed Matter - Mesoscale and Nanoscale Physics ,Condensed matter physics ,FOS: Physical sciences ,Fermi energy ,02 engineering and technology ,021001 nanoscience & nanotechnology ,01 natural sciences ,Quantum chaos ,Magnetic field ,Relativistic particle ,Vortex ,Contour line ,Mesoscale and Nanoscale Physics (cond-mat.mes-hall) ,0103 physical sciences ,Quantum system ,010306 general physics ,0210 nano-technology ,Wave function - Abstract
We study a relativistic quantum cavity system realized by etching out from a graphene sheet by quantum transport measurements and theoretical calculations. The conductance of the graphene cavity has been measured as a function of the back gate voltage (or the Fermi energy) and the magnetic field applied perpendicular to the graphene sheet, and characteristic conductance contour patterns are observed in the measurements. In particular, two types of high conductance contour lines, i.e., straight and parabolic-like high conductance contour lines, are found in the measurements. The theoretical calculations are performed within the framework of tight-binding approach and Green's function formalism. Similar characteristic high conductance contour features as in the experiments are found in the calculations. The wave functions calculated at points selected along a straight conductance contour line are found to be dominated by a chain of scars of high probability distributions arranged as a necklace following the shape of cavity and the current density distributions calculated at these point are dominated by an overall vortex in the cavity. These characteristics are found to be insensitive to increasing magnetic field. However, the wave function probability distributions and the current density distributions calculated at points selected along a parabolic-like contour line show a clear dependence on increasing magnetic field, and the current density distributions at these points are characterized by the complex formation of several localized vortices in the cavity. Our work brings a new insight into quantum chaos in relativistic particle systems and would greatly stimulate experimental and theoretical efforts towards this still emerging field., Comment: 20 pages, 6 figures
- Published
- 2020
- Full Text
- View/download PDF
148. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs
- Author
-
Lei Yang, Qingfeng Zhuge, Jingtong Hu, Edwin H.-M. Sha, Weiwen Jiang, and Xianzhang Chen
- Subjects
010302 applied physics ,Optimization problem ,Speedup ,Cost efficiency ,Data parallelism ,Computer science ,Pipeline (computing) ,Task parallelism ,02 engineering and technology ,01 natural sciences ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Dynamic programming ,Reduction (complexity) ,Memory management ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Software - Abstract
Field programmable gate array (FPGA) has been one of the most popular platforms to implement convolutional neural networks (CNNs) due to its high performance and cost efficiency; however, limited by the on-chip resources, the existing single-FPGA architectures cannot fully exploit the parallelism in CNNs. In this paper, we explore heterogeneous FPGA-based designs to effectively leverage both task and data parallelism, such that the resultant system can achieve the minimum cost while satisfying timing constraints. In order to maximize the task parallelism, we investigate two critical problems: 1) buffer placement , where to place buffers to partition CNNs into pipeline stages and 2) task assignment , what type of FPGA to implement different CNN layers. We first formulate the system-level optimization problem with a mixed integer linear programming model. Then, we propose an efficient dynamic programming algorithm to obtain the optimal solutions. On top of that, we devise an efficient algorithm that exploits data parallelism within CNN layers to further improve cost efficiency. Evaluations on well-known CNNs demonstrate that the proposed techniques can obtain an average of 30.82% reduction in system cost under the same timing constraint, and an average of 1.5 times speedup in performance under the same cost budget, compared with the state-of-the-art techniques.
- Published
- 2018
- Full Text
- View/download PDF
149. Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory
- Author
-
Runyu Zhang, Xianzhang Chen, Zhulin Ma, Weiwen Jiang, Edwin H.-M. Sha, Hailiang Dong, and Qingfeng Zhuge
- Subjects
010302 applied physics ,Speedup ,CPU cache ,Computer science ,Search engine indexing ,02 engineering and technology ,Linked list ,Parallel computing ,Data structure ,01 natural sciences ,020202 computer hardware & architecture ,Theoretical Computer Science ,Database index ,Tree (data structure) ,Tree structure ,Computational Theory and Mathematics ,Data retrieval ,Hardware and Architecture ,Search algorithm ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Software - Abstract
Index structures can significantly accelerate the data retrieval operations in data intensive systems, such as databases. Tree structures, such as B $^{+}$ -tree alike, are commonly employed as index structures; however, we found that the tree structure may not be appropriate for Non-Volatile Memory (NVM) in terms of the requirements for high-performance and high-endurance. This paper studies what is the best index structure for NVM-based systems and how to design such index structures. The design of an NVM-friendly index structure faces a lot of challenges. First , in order to prolong the lifetime of NVM, the write activities on NVM should be minimized. To this end, the index structure should be as simple as possible. The index proposed in this paper is based on the simplest data structure, i.e., linked list. Second , the simple structure brings challenges to achieve high-performance data retrieval operations. To overcome this challenge, we design a novel technique by explicitly building up a contiguous virtual address space on the linked list, such that efficient search algorithms can be performed. Third , we need to carefully consider data consistency issues in NVM-based systems, because the order of memory writes may be changed and the data content in NVM may be inconsistent due to write-back effects of CPU cache. This paper devises a novel indexing scheme, called “ V irtual L inear A ddressable B uckets” (VLAB). We implement VLAB in a storage engine and plug it into MySQL. Evaluations are conducted on an NVDIMM workstation using YCSB workloads and real-world traces. Results show that write activities of the state-of-the-art indexes are 6.98 times more than ours; meanwhile, VLAB achieves 2.53 times speedup.
- Published
- 2018
- Full Text
- View/download PDF
150. A machine learning assisted data placement mechanism for hybrid storage systems
- Author
-
Duo Liu, Jinting Ren, Xianzhang Chen, Moming Duan, Liang Liang, Yujuan Tan, and Ruolan Li
- Subjects
Hybrid storage system ,business.industry ,Computer science ,Machine learning ,computer.software_genre ,Mechanism (engineering) ,File size ,Data access ,Hardware and Architecture ,Data_FILES ,Key (cryptography) ,Hybrid storage ,Artificial intelligence ,business ,computer ,Software ,Data placement - Abstract
Emerging applications produce massive files that show different properties in file size, lifetime, and read/write frequency. Existing hybrid storage systems place these files onto different storage mediums assuming that the access patterns of files are fixed. However, we find that the access patterns of files are changeable during their lifetime. The key to improve the file access performance is to adaptively place the files on the hybrid storage system using the run-time status and the properties of both files and the storage systems. In this paper, we propose a machine learning assisted data placement mechanism that adaptively places files onto the proper storage medium by predicting access patterns of files. We design a PMFS based tracer to collect file access features for prediction and show how this approach is adaptive to the changeable access pattern. Based on data access prediction results, we present a linear data placement algorithm to optimize the data access performance on the hybrid storage mediums. Extensive experimental results show that the proposed learning algorithm can achieve over 90% accuracy for predicting file access patterns. Meanwhile, this paper can achieve over 17% improvement of system performance for file accesses compared with the state-of-the-art linear-time data placement methods.
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.