Descriptor: "Memory footprint" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Memory footprint"' showing total 2,203 results

Start Over Descriptor "Memory footprint"

2,203 results on '"Memory footprint"'

101. A Comparative Study of Stampede Garbage Collection Algorithms

Author: Mandviwala, Hasnain A., Harel, Nissim, Knobe, Kathleen, Ramachandran, Umakishore, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Pugh, Bill, editor, and Tseng, Chau-Wen, editor
Published: 2005
Full Text: View/download PDF

102. Optimizing Applications

Author: Virkus, Robert
Published: 2005
Full Text: View/download PDF

103. Optimistically Compressed Hash Tables & Strings in theUSSR

Author: Peter Boncz, Viktor Leis, and Tim Gubner
Subjects: Computer science, Value (computer science), 02 engineering and technology, Parallel computing, Hash table, Prefix, String operations, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), Memory footprint, 020201 artificial intelligence & image processing, Tuple, Software, Information Systems, Integer (computer science)
Abstract: Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently- and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Self-aligned Region (USSR) accelerates handling frequently occurring strings, which are widespread in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2-4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.
Published: 2021
Full Text: View/download PDF

104. A Memory-Efficient Implementation of Perfectly Matched Layer With Smoothly Varying Coefficients in Discontinuous Galerkin Time-Domain Method

Author: Liang Chen, Mehmet Burak Ozakin, Shehab Ahmed, and Hakan Bagci
Subjects: FOS: Computer and information sciences, Physics, Attenuation, Computation, Mathematical analysis, FOS: Physical sciences, 020206 networking & telecommunications, Numerical Analysis (math.NA), 02 engineering and technology, Computational Physics (physics.comp-ph), Method of moments (statistics), Mass matrix, Mathematics::Numerical Analysis, Computational Engineering, Finance, and Science (cs.CE), Perfectly matched layer, Reflection (mathematics), Discontinuous Galerkin method, FOS: Mathematics, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Mathematics - Numerical Analysis, Electrical and Electronic Engineering, Computer Science - Computational Engineering, Finance, and Science, Physics - Computational Physics
Abstract: Wrapping a computation domain with a perfectly matched layer (PML) is one of the most effective methods of imitating/approximating the radiation boundary condition in the Maxwell and wave equation solvers. Many PML implementations often use a smoothly increasing attenuation coefficient to increase the absorption for a given layer thickness, and, at the same time, to reduce the numerical reflection from the interface between the computation domain and the PML. In discontinuous Galerkin time-domain (DGTD) methods, using a PML coefficient that varies within a mesh element requires a different mass matrix to be stored for every element and therefore significantly increases the memory footprint. In this work, this bottleneck is addressed by applying a weight-adjusted approximation to these mass matrices. The resulting DGTD scheme has the same advantages as the scheme that stores individual mass matrices, namely, higher accuracy (due to reduced numerical reflection) and increased meshing flexibility (since the PML does not have to be defined layer by layer), but it requires significantly less memory.
Published: 2021
Full Text: View/download PDF

105. Opportunistic Caching in NoC: Exploring Ways to Reduce Miss Penalty

Author: Abhijit Das, Maurizio Palesi, Abhishek Kumar, and John Jose
Subjects: Router, Computer science, Buffer storage , System-on-chip , Hardware , Program processors , System performance , Production , Routing, 02 engineering and technology, Theoretical Computer Science, Hardware, Buffer storage, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), System on a chip, Routing, Hardware_MEMORYSTRUCTURES, business.industry, Production, Provisioning, Program processors, System performance, 020202 computer hardware & architecture, Network congestion, Computational Theory and Mathematics, Hardware and Architecture, Dynamic demand, Memory footprint, Cache, business, Software, System-on-chip, Computer network
Abstract: Due to limited on-chip caching, data-driven applications with large memory footprint encounter frequent cache misses. Such applications suffer from recurring miss penalty when they re-reference recently evicted cache blocks. To meet the worst-case performance requirements, Network-on-Chip (NoC) routers are provisioned with input port buffers. However, recent studies reveal that these buffers remain underutilised except during network congestion. Trace buffers are Design-for-Debug (DfD) hardware employed in NoC routers for post-silicon debug and validation. Nevertheless, they become non-functional once a design goes into production and remain in the routers left unused. In this article, we exploit the underutilised NoC router buffers and the unused trace buffers to store recently evicted cache blocks. While these blocks are stored in the buffers, future re-reference to these blocks can be replied from the NoC router. Such an opportunistic caching of evicted blocks in NoC routers significantly reduce the miss penalty. Experimental analysis shows that the proposed architectures can achieve up to 21 percent (16 percent on average) reduction in miss penalty and 19 percent (14 percent on average) improvement in overall system performance. While we have a negligible area and leakage power overhead of 2.58 and 3.94 percent, respectively, dynamic power reduces by 6.12 percent due to the improvement in performance.
Published: 2021
Full Text: View/download PDF

106. More Accurate Streaming Cardinality Estimation With Vectorized Counters

Author: Giuseppe Bianchi, Salvatore Pontarelli, Pedro Reviriego, Daniel Ting, and Valerio Bruschi
Subjects: Network monitoring, Computer science, Hash function, Approximation algorithm, Set (abstract data type), high speed networks, Range (mathematics), cardinality, Cardinality, Memory management, Network monitoring , high speed networks , cardinality , hyperloglog, Approximation error, Memory footprint, hyperloglog, Algorithm
Abstract: Cardinality estimation, also known as count-distinct, is the problem of finding the number of different elements in a set with repeated elements. Among the many approximate algorithms proposed for this task, HyperLogLog (HLL) has established itself as the state of the art due to its ability to accurately estimate cardinality over a large range of values using a small memory footprint. When elements arrive in a stream, as in the case of most networking applications, improved techniques are possible. We specifically propose a new algorithm that improves the accuracy of cardinality estimation by grouping counters, and by using their new organization to further track all updates within a given counter size range (compared with just the last update as in the standard HLL). Results show that when using the same number of counters, one configuration of the new scheme reduces the relative error by approximately 0.86x using the same amount of memory as the streaming HLL and another configuration achieves a similar accuracy reducing the memory needed by approximately 0.85x.
Published: 2021
Full Text: View/download PDF

107. Implementation of Model Predictive Control in Programmable Logic Controllers

Author: Pablo Krupa, Daniel Limon, and Teodoro Alamo
Subjects: 0209 industrial biotechnology, 021103 operations research, Optimization problem, Computer science, Multivariable calculus, 0211 other engineering and technologies, Programmable logic controller, Control engineering, 02 engineering and technology, Footprint, Model predictive control, 020901 industrial engineering & automation, Control and Systems Engineering, Control theory, Memory footprint, Code generation, Electrical and Electronic Engineering
Abstract: In this article, we present an implementation of a low-memory footprint model predictive control (MPC)-based controller in programmable logic controllers (PLCs). Automatic code generation of standardized IEC 61131–3 PLC programming languages is used to solve the MPC’s optimization problem online. The implementation is designed for its application in a realistic industrial environment, including timing considerations and accounting for the possibility of the PLC not being exclusively dedicated to the MPC controller. We describe the controller architecture and algorithm, show the results of its memory footprint with regard to the problem dimensions, and present the results of its implementation to control a hardware-in-the-loop multivariable chemical plant.
Published: 2021
Full Text: View/download PDF

108. Binary Precision Neural Network Manycore Accelerator

Author: Tinoosh Mohsenin and Morteza Hosseini
Subjects: 010302 applied physics, business.industry, Cycles per instruction, Computer science, Deep learning, Clock rate, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, Instruction set, Hardware and Architecture, Multilayer perceptron, 0103 physical sciences, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Artificial intelligence, Electrical and Electronic Engineering, business, Bitwise operation, Software, Computer hardware
Abstract: This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR , Patch-Select , and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm 2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm 2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.
Published: 2021
Full Text: View/download PDF

109. A Theory of Persistent Containers and Its Application to Ada

Author: Alves, Mário Amado, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Llamosí, Albert, editor, and Strohmeier, Alfred, editor
Published: 2004
Full Text: View/download PDF

110. Design of Energy Efficient Wireless Networks Using Dynamic Data Type Refinement Methodology

Author: Mamagkakis, Stylianos, Mpartzas, Alexandros, Pouiklis, Georgios, Atienza, David, Catthoor, Francky, Soudris, Dimitrios, Mendias, Jose Manuel, Thanailakis, Antonios, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Langendoerfer, Peter, editor, Liu, Mingyan, editor, Matta, Ibrahim, editor, and Tsaoussidis, Vassilis, editor
Published: 2004
Full Text: View/download PDF

111. Optimizing Code with GCC

Author: Wall, Kurt, Von Hagen, William, Wall, Kurt, and Von Hagen, William
Published: 2004
Full Text: View/download PDF

112. Power Estimation Approach of Dynamic Data Storage on a Hardware Software Boundary Level

Author: Leeman, Marc, Atienza, David, Catthoor, Francky, De Florio, V., Deconinck, G., Mendias, J. M., Lauwereins, R., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Chico, Jorge Juan, editor, and Macii, Enrico, editor
Published: 2003
Full Text: View/download PDF

113. Optimization: Memory Footprint

Author: Blunden, Bill and Blunden, Bill
Published: 2003
Full Text: View/download PDF

114. Accurate parallel reconstruction of unstructured datasets on rectilinear grids

Author: Xavier Tricoche, Raine Yeh, and Dana El-Rushaidat
Subjects: Computer science, Computation, 020207 software engineering, 02 engineering and technology, Condensed Matter Physics, Supercomputer, Grid, 01 natural sciences, Regularization (mathematics), 010305 fluids & plasmas, Visualization, Computational science, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Curve fitting, Electrical and Electronic Engineering, Block (data storage)
Abstract: High performance computing simulations often produce datasets defined over unstructured grids. Those grids allow for the local refinement of the resolution and can accommodate arbitrary boundary geometry. From a visualization standpoint, however, such grids have a high storage cost, require special spatial data structures, and make the computation of high-quality derivatives challenging. Rectilinear grids, in contrast, have a negligible memory footprint and readily support smooth data reconstruction, though with reduced geometric flexibility. The present work is concerned with the creation of an accurate reconstruction of large unstructured datasets on rectilinear grids. We present an efficient method to automatically determine the geometry of a rectilinear grid upon which a low-error data reconstruction can be achieved with a given reconstruction kernel. Using this rectilinear grid, we address the potential ill-posedness of the data fitting problem, as well as the necessary balance between smoothness and accuracy, through a bi-level smoothness regularization. To tackle the computational challenge posed by very large input datasets and high-resolution reconstructions, we propose a block-based approach that allows us to obtain a seamless global approximation solution from a set of independently computed sparse least-squares problems. Results are presented for several 3D datasets that demonstrate the quality of the visualization results that our reconstruction enables, at a greatly reduced computational and memory cost.
Published: 2021
Full Text: View/download PDF

115. A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation

Author: Ladner, Richard E., Fortna, Ray, Nguyen, Bao-Hoang, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Fleischer, Rudolf, editor, Moret, Bernard, editor, and Schmidt, Erik Meineche, editor
Published: 2002
Full Text: View/download PDF

116. Performance Improvement of a Blockchain Simulator by Reducing Memory Footprint

Author: Yongdae Kim, Yonggon Kim, Suebok Moon, Dongsu Han, and Hyunjin Kim
Subjects: Blockchain, business.industry, Computer science, Embedded system, Memory footprint, Data deduplication, Performance improvement, business, Network simulation
Published: 2021
Full Text: View/download PDF

117. A 510-nW Wake-Up Keyword-Spotting Chip Using Serial-FFT-Based MFCC and Binarized Depthwise Separable CNN in 28-nm CMOS

Author: Lixuan Zhu, Jun Yang, Jiaming Xu, Weiwei Shan, Hao Cai, Chengjun Wu, Longxing Shi, Tao Wang, Yicheng Lu, and Minhao Yang
Subjects: Artificial neural network, Computer science, 020208 electrical & electronic engineering, Fast Fourier transform, Feature extraction, 02 engineering and technology, Chip, Separable space, CMOS, Keyword spotting, Cepstrum, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Mel-frequency cepstrum, Electrical and Electronic Engineering, Algorithm
Abstract: We propose a sub- $\mu \text{W}$ always-ON keyword spotting ( $\mu $ KWS) chip for audio wake-up systems. It is mainly composed of a neural network (NN) and a feature extraction (FE) circuit. For significantly reducing the memory footprint and computational load, four techniques are used to achieve ultra-low-power consumption: 1) a serial-FFT-based Mel-frequency cepstrum coefficient circuit is designed for FE, instead of the common parallel FFT. 2) A small-sized binarized depthwise separable convolutional NN (DSCNN) is designed as the classifier. 3) A framewise incremental computation technique is devised in contrast to the conventional whole-word processing. 4) Reduced computation allows a low system clock frequency, which enables near-threshold voltage operation, and low leakage memory blocks are designed to minimize the leakage power. Implemented in 28-nm CMOS technology, this $\mu $ KWS consumes $0.51~\mu \text{W}$ at a 40-kHz frequency and a 0.41-V supply, with an area of 0.23 mm2. Using the Google speech command data set, 97.3% accuracy is reached for a one-word KWS task and 94.6% for a two-word task.
Published: 2021
Full Text: View/download PDF

118. New Features in the System Identification Toolbox - Rapprochements with Machine Learning

Author: Lennart Ljung, Debraj Bhattacharjee, Rajiv Singh, and Khaled F. Aljanaideh
Subjects: Nonlinear system identification, business.industry, Computer science, System identification, Kalman filter, Machine learning, computer.software_genre, Nonlinear system, Extended Kalman filter, Control and Systems Engineering, Memory footprint, Code generation, Artificial intelligence, business, MATLAB, computer, computer.programming_language
Abstract: The R2021b release of the System Identification Toolbox™ for MATLAB® contains new features that enable the use of machine learning techniques for nonlinear system identification. With this release it is possible to build nonlinear ARX models with regression tree ensemble and Gaussian process regression mapping functions. The release contains several other enhancements including, but not limited to, (a) online state estimation using the extended Kalman filter and the unscented Kalman filter with code generation capability; (b) improved handling of initial conditions for transfer functions and polynomial models; (c) a new architecture of nonlinear black-box models that streamlines regressor handling, reduces memory footprint and improves numerical accuracy; and (d) easy incorporation of identification apps in teaching tools and interactive examples by leveraging the Live Editor tasks of MATLAB.
Published: 2021
Full Text: View/download PDF

119. A Resource Efficient Integer-Arithmetic-Only FPGA-Based CNN Accelerator for Real-Time Facial Emotion Recognition

Author: Jaemyung Kim, Jin-Ku Kang, and Yongwoo Kim
Subjects: accelerator, General Computer Science, Computational complexity theory, Computer science, Feature extraction, General Engineering, convolutional neural network, Frame rate, Convolutional neural network, TK1-9971, Computer engineering, Memory footprint, General Materials Science, Multiplication, Emotion recognition, quantization, Electrical engineering. Electronics. Nuclear engineering, Field-programmable gate array, FPGA, Integer (computer science)
Abstract: Recently, many researches have been conducted on recognition of facial emotion using convolutional neural networks (CNNs), which show excellent performance in computer vision. To obtain a high classification accuracy, a CNN architecture with many parameters and high computational complexity is required. However, this is not suitable for embedded systems where hardware resources are limited. In this paper, we present a lightweight CNN architecture optimized for embedded systems. The proposed CNN architecture has a small memory footprint and low computational complexity. Furthermore, a novel hardware-friendly quantization method that uses only integer-arithmetic is proposed. The proposed hardware-friendly quantization method maps the scale factors to power-of-two terms and replaces multiplication and division operations using scale factors with shift operations. To improve the generalization and classification performance of the CNN, we create the FERPlus-A dataset. This is a new training dataset created using a variety of image processing algorithms. After training with FERPlus-A, quantization has been performed. The size of a quantized CNN parameter is about 0.39 MB, and the number of operations is about 28 M integer operations (IOPs). By evaluating the performance of the quantized CNN that uses only integer-arithmetic on the FERPlus test dataset, the classification accuracy is approximately 86.58%. It achieved higher accuracy than other lightweight CNNs in prior studies. The proposed CNN architecture that uses only integer-arithmetic is implemented on the Xilinx ZC706 SoC platform for real-time facial emotion recognition by applying parallelism strategies and efficient data caching strategies. The FPGA-based CNN accelerator implemented for real-time facial emotion recognition achieves about 10 frame per second (FPS) at 250 MHz and consumes 2.3 W.
Published: 2021
Full Text: View/download PDF

120. Rain-Free and Residue Hand-in-Hand: A Progressive Coupled Network for Real-Time Image Deraining

Author: Chen Chen, Xiao Wang, Peng Yi, Kui Jiang, Chia-Wen Lin, Junjun Jiang, Zheng Wang, and Zhongyuan Wang
Subjects: Source code, Computer science, business.industry, Computation, media_common.quotation_subject, Feature extraction, Pattern recognition, Construct (python library), Computer Graphics and Computer-Aided Design, Object detection, Memory footprint, Segmentation, Artificial intelligence, business, Software, Image restoration, media_common
Abstract: Rainy weather is a challenge for many vision-oriented tasks ( e.g. , object detection and segmentation), which causes performance degradation. Image deraining is an effective solution to avoid performance drop of downstream vision tasks. However, most existing deraining methods either fail to produce satisfactory restoration results or cost too much computation. In this work, considering both effectiveness and efficiency of image deraining, we propose a progressive coupled network (PCNet) to well separate rain streaks while preserving rain-free details. To this end, we investigate the blending correlations between them and particularly devise a novel coupled representation module (CRM) to learn the joint features and the blending correlations. By cascading multiple CRMs, PCNet extracts the hierarchical features of multi-scale rain streaks, and separates the rain-free content and rain streaks progressively. To promote computation efficiency, we employ depth-wise separable convolutions and a U-shaped structure, and construct CRM in an asymmetric architecture to reduce model parameters and memory footprint. Extensive experiments are conducted to evaluate the efficacy of the proposed PCNet in two aspects: (1) image deraining on several synthetic and real-world rain datasets and (2) joint image deraining and downstream vision tasks ( e.g. , object detection and segmentation). Furthermore, we show that the proposed CRM can be easily adopted to similar image restoration tasks including image dehazing and low-light enhancement with competitive performance. The source code is available at https://github.com/kuijiang0802/PCNet .
Published: 2021
Full Text: View/download PDF

121. Anomaly Detection in Vehicular CAN Bus Using Message Identifier Sequences

Author: Tahsin C. M. Donmez
Subjects: Hyperparameter, Vehicular ad hoc network, General Computer Science, Computer science, Real-time computing, General Engineering, ComputerApplications_COMPUTERSINOTHERSYSTEMS, Intrusion detection system, CAN bus, Identifier, Attack model, Memory footprint, General Materials Science, Anomaly detection
Abstract: As the automotive industry moves forward, security of vehicular networks becomes increasingly important. Controller area network (CAN bus) remains as one of the most widely-used protocols for in-vehicle communication. In this work, we study an intrusion detection system (IDS) which detects anomalies in vehicular CAN bus traffic by analyzing message identifier sequences. We collected CAN bus data from a heavy-duty truck over a period of several months. First, we identify the properties of CAN bus traffic which enable the described approach, and demonstrate that they hold in different datasets collected from different vehicles. Then, we perform an experimental study of the IDS, using the collected CAN bus data and procedurally generated attacks. We analyze the performance of the IDS, considering various attack types and hyperparameter values. The analysis yields promising sensitivity and specificity values, as well as very fast decision times and acceptable memory footprint.
Published: 2021
Full Text: View/download PDF

122. CIC-PIM: Trading spare computing power for memory space in graph processing

Author: Yu Hua, Xiao Renzhi, Dan Feng, Yuchong Hu, Hong Jiang, Yongli Cheng, Yongxuan Zhang, and Fang Wang
Subjects: Hardware_MEMORYSTRUCTURES, Computer Networks and Communications, Computer science, 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, Graph, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Encoding (memory), Spare part, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, 020201 artificial intelligence & image processing, Cache, Software
Abstract: Shared-memory graph processing is usually more efficient than in a cluster in terms of cost effectiveness, ease of programming and runtime. However, the limited memory capacity of a single machine and the huge sizes of graphs restrains its applicability. Hence, it is imperative to reduce memory footprint. We observe that index compression holds promise and propose CIC-PIM, a lightweight encoding with chunked index compression, to reduce the memory footprint and the runtime of graph algorithms. CIC-PIM aims for significant space saving, real random-access support and high cache efficiency by exploiting the ubiquitous power-law and sparseness features of large scale graphs. The basic idea is to divide index structures into chunks of appropriate size and compress the chunks with our lightweight fixed-length byte-aligned encoding. After CIC-PIM compression, two-fold larger graphs are processed with all data fit in memory, resulting in speedups or fast in-memory processing unattainable previously.
Published: 2021
Full Text: View/download PDF

123. Kernel Matrix Approximation on Class-Imbalanced Data With an Application to Scientific Simulation

Author: Parisa Hajibabaee, Farhad Pourkamali-Anaraki, and Mohammad Amin Hariri-Ardebili
Subjects: General Computer Science, Computational complexity theory, Computer science, 02 engineering and technology, 010501 environmental sciences, Similarity measure, computer.software_genre, unsupervised learning, 01 natural sciences, Kernel (linear algebra), pattern analysis, 0202 electrical engineering, electronic engineering, information engineering, General Materials Science, Cluster analysis, Selection (genetic algorithm), data compression, 0105 earth and related environmental sciences, Classification algorithms, computational complexity, General Engineering, 020206 networking & telecommunications, TK1-9971, Memory management, Memory footprint, Data mining, Electrical engineering. Electronics. Nuclear engineering, computer, Importance sampling
Abstract: Generating low-rank approximations of kernel matrices that arise in nonlinear machine learning techniques holds the potential to significantly alleviate the memory and computational burdens. A compelling approach centers on finding a concise set of exemplars or landmarks to reduce the number of similarity measure evaluations from quadratic to linear concerning the data size. However, a key challenge is to regulate tradeoffs between the quality of landmarks and resource consumption. Despite the volume of research in this area, current understanding is limited regarding the performance of landmark selection techniques in the presence of class-imbalanced data sets that are becoming increasingly prevalent in many applications. Hence, this paper provides a comprehensive empirical investigation using several real-world imbalanced data sets, including scientific data, by evaluating the quality of approximate low-rank decompositions and examining their influence on the accuracy of downstream tasks. Furthermore, we present a new landmark selection technique called Distance-based Importance Sampling and Clustering (DISC), in which the relative importance scores are computed for improving accuracy-efficiency tradeoffs compared to existing works that range from probabilistic sampling to clustering methods. The proposed landmark selection method follows a coarse-to-fine strategy to capture the intrinsic structure of complex data sets, allowing us to substantially reduce the computational complexity and memory footprint with minimal loss in accuracy.
Published: 2021

124. An Elephant in the Room: Using Sampling for Detecting Heavy-Hitters in Programmable Switches

Author: Pedro Rodrigues Torres, Alberto Garcia-Martinez, Marcelo Bagnulo, Eduardo Parente Ribeiro, and Ministerio de Ciencia e Innovación (España)
Subjects: Telecomunicaciones, Elephant flow, General Computer Science, Computer science, Controller (computing), Real-time computing, General Engineering, Sampling (statistics), Automatic summarization, TK1-9971, Space saving, Reduction (complexity), Identification (information), Memory management, Memory footprint, Elephant flows, General Materials Science, Electrical engineering. Electronics. Nuclear engineering, Sketches, Sampling, elephant flows, sketches, space saving
Abstract: The ability to detect elephant flows in the forwarding device itself, i.e., a switch, facilitates the deployment of new advanced applications such as load-balancing, per-flow QoS management, etc. Sketches and Space Saving summarization techniques are used for elephant flow detection. However, their memory and computing requirements force the cooperation of an external controller device, due to the scarce resources of current programmable switches. To overcome this limitation, we adapt Sketch and Space Saving elephant flow detection techniques to operate with instant notification and sampled traffic. We evaluate the performance of the resulting techniques with three real traffic traces. The use of sampling allows the identification of a large share of the total traffic corresponding to the elephant flows with a low memory footprint and a reduction of the computing requirements in two orders of magnitude compared to unsampled versions. In turn, we observe a slight increase in the number of false positives and the number of flow notifications. The work of Alberto García-Martínez and Marcelo Bagnulo was supported by the TRUE5G Project ('Evolución hacia redes y servicios auto-gestionados para el 5G del futuro') by the Spanish National Research Agency under Grant PID2019-108713RB-C52/AEI/10.13039/501100011033.
Published: 2021
Full Text: View/download PDF

125. An Entropy-Based Approach: Compressing Names for NDN Lookup

Author: Tianyuan Niu and Fan Yang
Subjects: Network architecture, General Computer Science, lookup algorithms, computer.internet_protocol, Computer science, business.industry, General Engineering, Bottleneck, TK1-9971, Named Data Networking, Variable (computer science), Memory management, Internet protocol suite, Encoding (memory), Code (cryptography), Memory footprint, General Materials Science, Electrical engineering. Electronics. Nuclear engineering, business, computer, Computer network
Abstract: NDN (Named Data Networking) is one of the most popular future network architecture, a “clean slate” design for replacing the traditional TCP/IP network. However, the lookup algorithm of FIB entry in NDN is the bottleneck of the current NDN. Owing to the unique identifier of content name, whose length is variable, the size of FIB entries is proliferating, and the effectiveness of lookup algorithms is low. This paper proposed an entropy-oriented name processing mechanism, compressing the content names effectively by bringing in an encoding scheme. This mechanism can be split into two parts: name compression and lookup. The first part compressed the content names and converted them into a kind of code with a smaller size by considering the information redundancies of content names; the second part built a compact structure to minimize the memory footprint of FIB entries with keeping the high lookup performance. This mechanism outperformed many traditional name lookup algorithms, had better flexibility and cost less memory footprint.
Published: 2021
Full Text: View/download PDF

126. Validity Tracking Based Log Management for In-Memory Databases

Author: Heon Y. Yeom, Hwajung Kim, and Kwangjin Lee
Subjects: General Computer Science, Database, Computer science, snapshot, General Engineering, Process (computing), Checkpointing, persistence, computer.software_genre, logging, in-memory database, TK1-9971, Memory management, Memory footprint, Overhead (computing), Snapshot (computer storage), General Materials Science, Electrical engineering. Electronics. Nuclear engineering, Electrical and Electronic Engineering, Latency (engineering), Log management, Throughput (business), computer
Abstract: With in-memory databases (IMDBs), where all data sets reside in main memory for fast processing speed, logging and checkpointing are essential for achieving persistence in data. Logging of IMDBs has evolved to reduce run-time overhead to suit the systems, but this causes an increase in recovery time. Checkpointing technique compensates for these problems with logging, but existing schemes often incur high costs due to reduced system throughput, increased latency, and increased memory usage. In this paper, we propose a checkpointing scheme using validity tracking-based compaction (VTC), the technique that tracks the validity of logs in a file and removes unnecessary logs. The proposed scheme shows extremely low memory usage compared to existing checkpointing schemes, which use consistent snapshots. Our experiments demonstrate that checkpoints using consistent snapshot increase memory footprint by up to two times in update-intensive workloads. In contrast, our proposed VTC only requires 2% additional memory for checkpointing. That means the system can use most of its memory to store data and process transactions.
Published: 2021

127. Accelerating Spike-by-Spike Neural Networks on FPGA With Hybrid Custom Floating-Point and Logarithmic Dot-Product Approximation

Author: Klaus Pawelzik, David Rotermund, Yarib Nevarez, and Alberto Garcia-Ortiz
Subjects: Spiking neural network, Artificial intelligence, Floating point, General Computer Science, Artificial neural network, Computer science, 020208 electrical & electronic engineering, General Engineering, Dot product, 02 engineering and technology, approximate computing, 020202 computer hardware & architecture, TK1-9971, Computer engineering, Robustness (computer science), spiking neural networks, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, General Materials Science, Spike (software development), parameterisable floating-point, Electrical engineering. Electronics. Nuclear engineering, Field-programmable gate array, logarithmic, optimization
Abstract: Spiking neural networks (SNNs) represent a promising alternative to conventional neural networks. In particular, the so-called Spike-by-Spike (SbS) neural networks provide exceptional noise robustness and reduced complexity. However, deep SbS networks require a memory footprint and a computational cost unsuitable for embedded applications. To address this problem, this work exploits the intrinsic error resilience of neural networks to improve performance and to reduce hardware complexity. More precisely, we design a vector dot-product hardware unit based on approximate computing with configurable quality using hybrid custom floating-point and logarithmic number representation. This approach reduces computational latency, memory footprint, and power dissipation while preserving inference accuracy. To demonstrate our approach, we address a design exploration flow using high-level synthesis and a Xilinx SoC-FPGA. The proposed design reduces $20.5\times $ computational latency and $8\times $ weight memory footprint, with less than 0.5% of accuracy degradation on a handwritten digit recognition task.
Published: 2021

128. System Checkpointing Using Reflection and Program Analysis

Author: Whaley, John, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Yonezawa, Akinori, editor, and Matsuoka, Satoshi, editor
Published: 2001
Full Text: View/download PDF

129. Java

Author: Hansmann, Uwe, Merk, Lothar, Nicklous, Martin S., Stober, Thomas, Hansmann, Uwe, Merk, Lothar, Nicklous, Martin S., and Stober, Thomas
Published: 2001
Full Text: View/download PDF

130. The NEST Dry-Run Mode: Efficient Dynamic Analysis of Neuronal Network Simulation Code.

Author: Kunkel, Susanne and Schenck, Wolfram
Subjects: SIMULATION methods & models, QUEUING theory, HIGH performance computing, SUPERCOMPUTERS, BRAIN
Abstract: NEST is a simulator for spiking neuronal networks that commits to a general purpose approach: It allows for high flexibility in the design of network models, and its applications range from small-scale simulations on laptops to brain-scale simulations on supercomputers. Hence, developers need to test their code for various use cases and ensure that changes to code do not impair scalability. However, running a full set of benchmarks on a supercomputer takes up precious compute-time resources and can entail long queuing times. Here, we present the NEST dry-run mode, which enables comprehensive dynamic code analysis without requiring access to high-performance computing facilities. A dry-run simulation is carried out by a single process, which performs all simulation steps except communication as if it was part of a parallel environment with many processes. We show that measurements of memory usage and runtime of neuronal network simulations closely match the corresponding dry-run data. Furthermore, we demonstrate the successful application of the dry-run mode in the areas of profiling and performance modeling. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

131. A gang-scheduling system for ASCI blue-pacific

Author: Moreira, José E., Franke, Hubertus, Chan, Waiman, Fong, Liana L., Jette, Morris A., Yoo, Andy, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Sloot, Peter, editor, Bubak, Marian, editor, Hoekstra, Alfons, editor, and Hertzberger, Bob, editor
Published: 1999
Full Text: View/download PDF

132. OpenPepXL: An Open-Source Tool for Sensitive Identification of Cross-Linked Peptides in XL-MS

Author: Oliver Kohlbacher, Ralf Ficner, Mathias Walzer, Timo Sachsenberg, Tjeerd M. H. Dijkstra, Eugen Netz, Henning Urlaub, Thomas Monecke, Olexandr Dybkov, and Lukas Zimmermann
Subjects: Models, Molecular, Computer science, computer.software_genre, Biochemistry, Mass Spectrometry, Analytical Chemistry, Set (abstract data type), Reduction (complexity), 03 medical and health sciences, Humans, Database search engine, Amino Acid Sequence, Databases, Protein, Molecular Biology, 030304 developmental biology, 0303 health sciences, 030302 biochemistry & molecular biology, Technological Innovation and Resources, Data structure, Data set, Identification (information), Cross-Linking Reagents, HEK293 Cells, Memory footprint, Data mining, Peptides, Heuristics, Ribosomes, computer, Algorithms, Software
Abstract: Cross-linking MS (XL-MS) has been recognized as an effective source of information about protein structures and interactions. In contrast to regular peptide identification, XL-MS has to deal with a quadratic search space, where peptides from every protein could potentially be cross-linked to any other protein. To cope with this search space, most tools apply different heuristics for search space reduction. We introduce a new open-source XL-MS database search algorithm, OpenPepXL, which offers increased sensitivity compared with other tools. OpenPepXL searches the full search space of an XL-MS experiment without using heuristics to reduce it. Because of efficient data structures and built-in parallelization OpenPepXL achieves excellent runtimes and can also be deployed on large compute clusters and cloud services while maintaining a slim memory footprint. We compared OpenPepXL to several other commonly used tools for identification of noncleavable labeled and label-free cross-linkers on a diverse set of XL-MS experiments. In our first comparison, we used a data set from a fraction of a cell lysate with a protein database of 128 targets and 128 decoys. At 5% FDR, OpenPepXL finds from 7% to over 50% more unique residue pairs (URPs) than other tools. On data sets with available high-resolution structures for cross-link validation OpenPepXL reports from 7% to over 40% more structurally validated URPs than other tools. Additionally, we used a synthetic peptide data set that allows objective validation of cross-links without relying on structural information and found that OpenPepXL reports at least 12% more validated URPs than other tools. It has been built as part of the OpenMS suite of tools and supports Windows, macOS, and Linux operating systems. OpenPepXL also supports the MzIdentML 1.2 format for XL-MS identification results. It is freely available under a three-clause BSD license at https://openms.org/openpepxl.
Published: 2020
Full Text: View/download PDF

133. SDCN

Author: Bo Chen, Huadong Ma, and Liang Liu
Subjects: business.product_category, Computer Networks and Communications, business.industry, Computer science, 020206 networking & telecommunications, 02 engineering and technology, Energy consumption, Database-centric architecture, Tree (data structure), Information-centric networking, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, 020201 artificial intelligence & image processing, Network switch, Performance improvement, business, Wireless sensor network, Computer network
Abstract: Building an open global sensing layer is critical for the Internet of Things (IoT). In this article, we present a Sensory Data-Centric Networking (SDCN) architecture for inter-networking two main networked sensing systems in IoT—wireless sensor networks and mobile sensing networks. Specifically, the proposed SDCN is a systematic solution including NDNs for sensor nodes in the Zigbee network, NDNm for mobilephones in the Wi-Fi network, and NDNg for gateways. Considering the sensing requirement of IoT, we first design a novel Spatio-Temporal 16 Tree (ST16T) naming scheme associated with the scope-matching method. Based on the naming scheme, we further propose the related discovery methods, network switching mechanism, forwarding, and routing strategies according to the features of large-scale sensing and resource-constrained environment. A proof-of-concept prototype is implemented and further is deployed on our campus (BUPT) and the Great Wall (Shaanxi, China) for Environment Monitoring Project. Several experiments are conducted on the deployed platform. The experimental results show that SDCN outperforms the state-of-the-arts and gains a great performance improvement in terms of energy consumption, data collection efficiency, memory footprint, and time delay.
Published: 2020
Full Text: View/download PDF

134. SGX-MR: Regulating Dataflows for Protecting Access Patterns of Data-Intensive SGX Applications

Author: A K M Mubashwir Alam, Keke Chen, and Sagar Sharma
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Cover (telecommunications), Computer science, Dataflow, sgx-based data analytics, Cloud computing, Systems and Control (eess.SY), 0102 computer and information sciences, computer.software_genre, Electrical Engineering and Systems Science - Systems and Control, 01 natural sciences, access patterns, 03 medical and health sciences, FOS: Electrical engineering, electronic engineering, information engineering, Oblivious ram, Implementation, data flow regularization, oram, 030304 developmental biology, General Environmental Science, Block (data storage), Ethics, 0303 health sciences, business.industry, Sorting, QA75.5-76.95, BJ1-1725, mapreduce, Computer Science - Distributed, Parallel, and Cluster Computing, 010201 computation theory & mathematics, Electronic computers. Computer science, Memory footprint, Operating system, General Earth and Planetary Sciences, Distributed, Parallel, and Cluster Computing (cs.DC), business, Cryptography and Security (cs.CR), computer
Abstract: Intel SGX has been a popular trusted execution environment (TEE) for protecting the integrity and confidentiality of applications running on untrusted platforms such as cloud. However, the access patterns of SGX-based programs can still be observed by adversaries, which may leak important information for successful attacks. Researchers have been experimenting with Oblivious RAM (ORAM) to address the privacy of access patterns. ORAM is a powerful low-level primitive that provides application-agnostic protection for any I/O operations, however, at a high cost. We find that some application-specific access patterns, such as sequential block I/O, do not provide additional information to adversaries. Others, such as sorting, can be replaced with specific oblivious algorithms that are more efficient than ORAM. The challenge is that developers may need to look into all the details of application-specific access patterns to design suitable solutions, which is time-consuming and error-prone. In this paper, we present the lightweight SGX based MapReduce (SGX-MR) approach that regulates the dataflow of data-intensive SGX applications for easier application-level access-pattern analysis and protection. It uses the MapReduce framework to cover a large class of data-intensive applications, and the entire framework can be implemented with a small memory footprint. With this framework, we have examined the stages of data processing, identified the access patterns that need protection, and designed corresponding efficient protection methods. Our experiments show that SGX-MR based applications are much more efficient than ORAM-based implementations., To appear in Privacy Enhancing Technologies Symposium, 2021
Published: 2020
Full Text: View/download PDF

135. Mobile web browsing under memory pressure

Author: Ghulam Murtaza, Ehsan Latif, Theophilus Benson, Zafar Ayyub Qazi, Ihsan Ayyub Qazi, Abrar Tariq, and Abdul Manan
Subjects: business.product_category, Multimedia, Computer Networks and Communications, Computer science, 020206 networking & telecommunications, 020207 software engineering, Mobile Web, 02 engineering and technology, computer.software_genre, Web page, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Internet access, Web navigation, Quality of experience, Android (operating system), business, Mobile device, computer, Software
Abstract: Mobile devices have become the primary mode of Internet access. Yet, differences in mobile hardware resources, such as device memory, coupled with the rising complexity of Web pages can lead to widely different quality of experience for users. In this work, we analyze how device memory usage affects Web browsing performance. We quantify the memory footprint of popular Web pages over different mobile devices, mobile browsers, and Android versions, analyze the induced memory distribution across different browser components (e.g., JavaScript engine and compositor), investigate how performance gets impacted under memory pressure and propose optimizations to reduce the memory footprint of Web browsing. We show that these optimizations can improve performance and reduce chances of browser crashes in low memory scenarios.
Published: 2020
Full Text: View/download PDF

136. A Programmable Approach to Neural Network Compression

Author: Animesh Garg, Saurav Muralidharan, Vinu Joseph, Michael Garland, and Ganesh Gopalakrishnan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial neural network, Computer science, Computer Vision and Pattern Recognition (cs.CV), Quantization (signal processing), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), Machine Learning (cs.LG), Computer engineering, Statistics - Machine Learning, Hardware and Architecture, Memory footprint, Electrical and Electronic Engineering, Software
Abstract: Deep neural networks (DNNs) frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both the model size and inference time without appreciable loss in accuracy. However, finding the best compression strategy and corresponding target sparsity for a given DNN, hardware platform, and optimization objective currently requires expensive, frequently manual, trial-and-error experimentation. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build more complex and practically interesting compression strategies. Given a strategy and user-provided objective (such as minimization of running time), Condensa uses a novel Bayesian optimization-based algorithm to automatically infer desirable sparsities. Our experiments on four real-world DNNs demonstrate memory footprint and hardware runtime throughput improvements of 188x and 2.59x, respectively, using at most ten samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa., This is an updated version of a paper published in IEEE Micro, vol. 40, no. 5, pp. 17-25, Sept.-Oct. 2020 at https://ieeexplore.ieee.org/document/9151283
Published: 2020
Full Text: View/download PDF

137. SRNPU: An Energy-Efficient CNN-Based Super-Resolution Processor With Tile-Based Selective Super-Resolution in Mobile Devices

Author: Hoi-Jun Yoo, Juhyoung Lee, and Jinsu Lee
Subjects: business.industry, Computer science, 020208 electrical & electronic engineering, 02 engineering and technology, Convolutional neural network, Memory management, Application-specific integrated circuit, CMOS, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Bandwidth (computing), 020201 artificial intelligence & image processing, Electrical and Electronic Engineering, business, Electrical efficiency, Auxiliary memory, Computer hardware
Abstract: In this article, we propose an energy-efficient convolutional neural network (CNN) based super-resolution (SR) processor, super-resolution neural processing unit (SRNPU), for mobile applications. Traditionally, it is hard to realize real-time CNN-based SR on resource-limited platforms like mobile devices due to its massive amount of computation workload and communication bandwidth with external memory. The SRNPU can support the tile-based selective super-resolution (TSSR) which dynamically selects the proper sized CNN in a tile-by-tile manner. The TSSR reduces the computational workload of CNN-based SR by 31.1 % while maintaining image restoration performance. Moreover, a proposed selective caching based convolutional layer fusion (SC2LF) can reduce 78.8 % of external memory bandwidth with 93.2 % smaller on-chip memory footprint compared with previous layer fusion methods, by only caching short reuse distance intermediate feature maps. Additionally, reconfigurable cyclic ring architecture in the SRNPU enables maintaining high PE utilization by amortizing the reloading process caused by SC2LF operation under various convolutional layer configurations. The SRNPU is fabricated in 65 nm CMOS technology and occupies $4 \times 4$ mm2 die area. The SRNPU has a peak power efficiency of 1.9 TOPS/W at 0.75 V, 50 MHz. The SRNPU achieves 31.8 fps $\times 2$ scale Full-HD generation and 88.3 fps $\times 4$ scale Full-HD generation with higher restoration performance and power efficiency than previous SR hardware implementations. To the best of our knowledge, the SRNPU is the first ASIC implementation of the CNN-based SR algorithm which supports real-time Full-HD up-scaling.
Published: 2020
Full Text: View/download PDF

138. Compacted CPU/GPU Data Compression via Modified Virtual Address Translation

Author: Cem Yuksel, Larry Seiler, and Daqi Lin
Subjects: 010302 applied physics, Lossless compression, Data compaction, Computer science, business.industry, 020207 software engineering, Data_CODINGANDINFORMATIONTHEORY, 02 engineering and technology, Lossy compression, 01 natural sciences, Computer Graphics and Computer-Aided Design, Computer Science Applications, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Cache, Page table, business, Computer hardware, Data compression, Image compression
Abstract: We propose a method to reduce the footprint of compressed data by using modified virtual address translation to permit random access to the data. This extends our prior work on using page translation to perform automatic decompression and deswizzling upon accesses to fixed rate lossy or lossless compressed data. Our compaction method allows a virtual address space the size of the uncompressed data to be used to efficiently access variable-size blocks of compressed data. Compression and decompression take place between the first and second level caches, which allows fast access to uncompressed data in the first level cache and provides data compaction at all other levels of the memory hierarchy. This improves performance and reduces power relative to compressed but uncompacted data. An important property of our method is that compression, decompression, and reallocation are automatically managed by the new hardware without operating system intervention and without storing compression data in the page tables. As a result, although some changes are required in the page manager, it does not need to know the specific compression algorithm and can use a single memory allocation unit size. We tested our method with two sample CPU algorithms. When performing depth buffer occlusion tests, our method reduces the memory footprint by 3.1x. When rendering into textures, our method reduces the footprint by 1.69x before rendering and 1.63x after. In both cases, the power and cycle time are better than for uncompacted compressed data, and significantly better than for accessing uncompressed data.
Published: 2020
Full Text: View/download PDF

139. EchoBay

Author: Alessio Micheli, Giuseppe Franco, Luca Cerina, Claudio Gallicchio, and Marco D. Santambrogio
Subjects: Fitness function, Computer science, business.industry, Bayesian optimization, Reservoir computing, Process (computing), Cloud computing, Echo State Networks, EchoBay, Random search, Recurrent neural network, Computer engineering, Hardware and Architecture, Memory footprint, business, Software, Information Systems
Abstract: The increase in computational power of embedded devices and the latency demands of novel applications brought a paradigm shift on how and where the computation is performed. Although AI inference is slowly moving from the cloud to end-devices with limited resources, time-centric recurrent networks like Long-Short Term Memory remain too complex to be transferred on embedded devices without extreme simplifications and limiting the performance of many notable applications. To solve this issue, the Reservoir Computing paradigm proposes sparse, untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory occupation. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose EchoBay , a comprehensive C++ library for ESN design and training. EchoBay is architecture-agnostic to guarantee maximum performance on different devices (whether embedded or not), and it offers the possibility to optimize and tailor an ESN on a particular case study, reducing at the minimum the effort required on the user side. This can be done thanks to the Bayesian Optimization (BO) process, which efficiently and automatically searches hyper-parameters that maximize a fitness function. Additionally, we designed different optimization techniques that take in consideration resource constraints of the device to minimize memory footprint and inference time. Our results in different scenarios show an average speed-up in training time of 119x compared to Grid and Random search of hyper-parameters, a decrease of 94% of trained models size and 95% in inference time, maintaining comparable performance for the given task. The EchoBay library is Open Source and publicly available at https://github.com/necst/Echobay.
Published: 2020
Full Text: View/download PDF

140. ODIN

Author: Calton Pu, Abhijit Suprem, Joy Arulraj, and João Eduardo Ferreira
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Systems and Control (eess.SY), 02 engineering and technology, 010501 environmental sciences, Electrical Engineering and Systems Science - Systems and Control, 01 natural sciences, Machine Learning (cs.LG), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Computer vision, VISÃO COMPUTACIONAL, Throughput (business), 0105 earth and related environmental sciences, business.industry, Model selection, General Engineering, Process (computing), Data point, Analytics, Memory footprint, Data analysis, 020201 artificial intelligence & image processing, Artificial intelligence, Dashboard, business
Abstract: Recent advances in computer vision have led to a resurgence of interest in visual data analytics. Researchers are developing systems for effectively and efficiently analyzing visual data at scale. A significant challenge that these systems encounter lies in the drift in real-world visual data. For instance, a model for self-driving vehicles that is not trained on images containing snow does not work well when it encounters them in practice. This drift phenomenon limits the accuracy of models employed for visual data analytics. In this paper, we present a visual data analytics system, called Odin, that automatically detects and recovers from drift. Odin uses adversarial autoencoders to learn the distribution of high-dimensional images. We present an unsupervised algorithm for detecting drift by comparing the distributions of the given data against that of previously seen data. When Odin detects drift, it invokes a drift recovery algorithm to deploy specialized models tailored towards the novel data points. These specialized models outperform their non-specialized counterpart on accuracy, performance, and memory footprint. Lastly, we present a model selection algorithm for picking an ensemble of best-fit specialized models to process a given input. We evaluate the efficacy and efficiency of Odin on high-resolution dashboard camera videos captured under diverse environments from the Berkeley DeepDrive dataset. We demonstrate that Odin's models deliver 6X higher throughput, 2X higher accuracy, and 6X smaller memory footprint compared to a baseline system without automated drift detection and recovery.
Published: 2020
Full Text: View/download PDF

141. Increasing the Utilization of Deep Neural Networks for SEM Measurements Through Multiple Task Formulation and Visualization

Author: Narendra Chaudhary and Serap A. Savari
Subjects: 0209 industrial biotechnology, Creative visualization, Artificial neural network, Computer science, business.industry, Computation, media_common.quotation_subject, Supervised learning, Pattern recognition, 02 engineering and technology, Condensed Matter Physics, Convolutional neural network, Industrial and Manufacturing Engineering, Electronic, Optical and Magnetic Materials, Visualization, 020901 industrial engineering & automation, Memory footprint, Artificial intelligence, Enhanced Data Rates for GSM Evolution, Electrical and Electronic Engineering, business, media_common
Abstract: Scanning electron microscopy images are an attractive option to estimate the roughness of nanostructures. Convolutional neural network (CNN) based algorithms have improved scanning electron microscope (SEM) image denoising and estimation of line roughness measurements. However, these algorithms need improvements to run at high speeds with a low memory footprint and without compromising accuracy. We introduce two approaches to reduce computation time and memory. We first propose deep CNNs LineNet1 and LineNet2 to perform simultaneous denoising and edge estimation on rough line SEM images. This multiple task formulation in LineNet1 and LineNet2 reduces training time, inference time and model sizes. LineNet2 also facilitates edge estimation in the multiple-line images and generalizes the approach for other geometries. Our training method uses supervised learning datasets of single-line SEM images and multiple-line SEM images together with edge positions information. We next consider multiple visualization tools to improve our understanding of the LineNet1 architecture and use the resulting insights from these visualizations to motivate a study of two variations of LineNet1 with fewer neural network layers. One of these visualization techniques is new to the visualization of denoising CNNs. Our results show that these approaches significantly reduce the memory and computation needed for edge estimation with a slight impact on accuracy.
Published: 2020
Full Text: View/download PDF

142. Hitting set enumeration with partial information for unique column combination discovery

Author: Felix Naumann, Tobias Friedrich, Thomas Bläsius, Thorsten Papenbrock, Johann Birnick, and Martin Schirneck
Subjects: Theoretical computer science, business.industry, Relational database, Computer science, Data management, General Engineering, Hasso-Plattner-Institut für Digital Engineering gGmbH, 02 engineering and technology, Column (database), Metadata discovery, Set (abstract data type), Data profiling, 020204 information systems, ddc:000, 0202 electrical engineering, electronic engineering, information engineering, Enumeration, Memory footprint, 020201 artificial intelligence & image processing, business
Abstract: Unique column combinations (UCCs) are a fundamental concept in relational databases. They identify entities in the data and support various data management activities. Still, UCCs are usually not explicitly defined and need to be discovered. State-of-the-art data profiling algorithms are able to efficiently discover UCCs in moderately sized datasets, but they tend to fail on large and, in particular, on wide datasets due to run time and memory limitations. In this paper, we introduce HPIValid , a novel UCC discovery algorithm that implements a faster and more resource-saving search strategy. HPIValid models the metadata discovery as a hitting set enumeration problem in hypergraphs. In this way, it combines efficient discovery techniques from data profiling research with the most recent theoretical insights into enumeration algorithms. Our evaluation shows that HPIValid is not only orders of magnitude faster than related work, it also has a much smaller memory footprint.
Published: 2020
Full Text: View/download PDF

143. Two‐tiered face verification with low‐memory footprint for mobile devices

Author: William Dias, Anderson Rocha, Ricardo da Silva Torres, Waldir Rodrigues De Almeida, Rafael Padilha, Thiago Resek, Gabriel Bertocco, Jacques Wainer, and Fernanda A. Andaló
Subjects: 021110 strategic, defence & security studies, Authentication, Speedup, Biometrics, Computer science, 0211 other engineering and technologies, Mobile computing, 02 engineering and technology, Facial recognition system, Convolutional neural network, Human–computer interaction, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Mobile device, Software
Abstract: Mobile devices have their popularity and affordability greatly increased in recent years. As a consequence of their ubiquity, these devices now carry all sorts of personal data that should be accessed only by their owner. Even though knowledge-based procedures are still the main methods to secure the owner's identity, recently biometric traits have been employed for more secure and effortless authentication. In this work, the authors propose a facial verification method optimised to the mobile environment. It consists of a two-tiered procedure that combines hand-crafted features and a convolutional neural network (CNN) to verify if the person depicted in a photograph corresponds to the device owner. To train a CNN for the verification task, the authors propose a hybrid-image input, which allows the network to process encoded information of a pair of face images. The proposed experiments show that the solution outperforms state of the art face verification methods, providing a 4× speedup when processing an image in recent smartphone models. Additionally, the authors show that the two-tiered procedure can be coupled with existing face verification CNNs improving their accuracy and efficiency. They also present a new data set of selfie pictures – RECOD Selfie data set – that hopefully will support future research in this scenario.
Published: 2020
Full Text: View/download PDF

144. Faster & strong: string dictionary compression using sampling and fast vectorized decompression

Author: Ismail Oukid, Roman Dementiev, Suleyman S. Demirsoy, Kai-Uwe Sattler, Norman May, and Robert Lasch
Subjects: Speedup, Hardware and Architecture, Computer science, 020204 information systems, MathematicsofComputing_GENERAL, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, 020201 artificial intelligence & image processing, 02 engineering and technology, Dictionary coder, Algorithm, Information Systems, Data compression
Abstract: String dictionaries constitute a large portion of the memory footprint of database applications. While strong string dictionary compression algorithms exist, these come with impractical access and compression times. Therefore, lightweight algorithms such as front coding (PFC) are favored in practice. This paper endeavors to make strong string dictionary compression practical. We focus on Re-Pair Front Coding (RPFC), a grammar-based compression algorithm, since it consistently offers better compression ratios than other algorithms in the literature. To accelerate compression times, we propose block-based RPFC (BRPFC) which consists in independently compressing small blocks of the dictionary. For further accelerated compression times especially on large string dictionaries, we also propose an alternative version of BRPFC that uses sampling to speed up compression. Moreover, to accelerate access times, we devise a vectorized access method, using $$\hbox {Intel}^{\circledR }$$ Intel ® Advanced Vector Extensions 512 ($$\hbox {Intel}^{\circledR }$$ Intel ® AVX-512). Our experimental evaluation shows that sampled BRPFC offers compression times up to 190 $$\times $$ × faster than RPFC, and random string lookups 2.3 $$\times $$ × faster than RPFC on average. These results move our modified RPFC into a practical range for use in database systems because the overhead of Re-Pair-based compression for access times can be reduced by 2 $$\times $$ × .
Published: 2020
Full Text: View/download PDF

145. CLMIP: cross-layer manifold invariance based pruning method of deep convolutional neural network for real-time road type recognition

Author: Mingqiang Yang, Xinyu Tian, Huake Su, and Qinghe Zheng
Subjects: Machine vision, Computer science, 02 engineering and technology, Convolutional neural network, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Segmentation, computer.programming_language, business.industry, Applied Mathematics, Deep learning, 020206 networking & telecommunications, Pattern recognition, Python (programming language), Manifold, Object detection, Computer Science Applications, Hardware and Architecture, Signal Processing, Memory footprint, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Software, Information Systems
Abstract: Recently, deep learning based models have demonstrated the superiority in a variety of visual tasks like object detection and instance segmentation. In practical applications, deploying advanced networks into real-time applications such as autonomous driving is still challenging due to expensive computational cost and memory footprint. In this paper, to reduce the size of deep convolutional neural network (CNN) and accelerate its reasoning, we propose a cross-layer manifold invariance based pruning method named CLMIP for network compression to help it complete real-time road type recognition in low-cost vision system. Manifolds are higher-dimensional analogues of curves and surfaces, which can be self-organized to reflect the data distribution and characterize the relationship between data. Therefore, we hope to guarantee the generalization ability of deep CNN by maintaining the consistency of the data manifolds of each layer in the network, and then remove the parameters with less influence on the manifold structure. Therefore, CLMIP can be regarded as a tool to further investigate the dependence of model structure on network optimization and generalization. To the best of our knowledge, this is the first time to prune deep CNN based on the invariance of data manifolds. During experimental process, we use the python based keyword crawler program to collect 102 first-view videos of car cameras, including 137 200 images (320 × 240) of four road scenes (urban road, off-road, trunk road and motorway). Finally, the classification results have demonstrated that CLMIP can achieve state-of-the-art performance with a speed of 26 FPS on NVIDIA Jetson Nano.
Published: 2020
Full Text: View/download PDF

146. CGMBE: a model-based tool for the design and implementation of real-time image processing applications on CPU–GPU platforms

Author: Jing Xie, Jiahao Wu, Alexandre Bardakoff, Walid Keyrouz, Shuvra S. Bhattacharyya, and Timothy Blattner
Subjects: Multi-core processor, business.industry, Computer science, Dataflow, Design tool, 020207 software engineering, Image processing, 02 engineering and technology, Embedded system, Digital image processing, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Software design, 020201 artificial intelligence & image processing, Central processing unit, business, Information Systems
Abstract: Processing large images in real time requires effective image processing algorithms as well as efficient software design and implementation to take full advantage of all CPU cores and GPU resources on state of the art CPU/GPU platforms. Efficiently coordinating computations on both the host (CPU) and devices (GPUs), along with host–device data transfers is critical to achieving real-time performance. However, such coordination is challenging for system designers given the complexity of modern image processing applications and the targeted processing platforms. In this paper, we present a novel model-based design tool that automates and optimizes these critical design decisions for real-time image processing implementation. The proposed tool consists of a compile-time static analyzer and a run-time dynamic scheduler. The tool automates the process of scheduling dataflow tasks (actors) and coordinating CPU–GPU data transfers in an integrated manner. The approach uses an unfolded dataflow graph representation of the application along with thread-pool-based executors, which are optimized for efficient operation on the targeted CPU–GPU platform. This approach automates the most complicated aspects of the design and implementation process for image processing system designers, while maximizing the utilization of computational power, reducing the memory footprint for both the CPU and GPU, and facilitating experimentation for tuning performance-oriented designs.
Published: 2020
Full Text: View/download PDF

147. H-CNN: Spatial Hashing Based CNN for 3D Shape Analysis

Author: Yanlin Weng, Kun Zhou, Tianjia Shao, Yin Yang, and Qiming Hou
Subjects: FOS: Computer and information sciences, Computer science, Hash function, 020207 software engineering, 02 engineering and technology, Data structure, Computer Graphics and Computer-Aided Design, Convolutional neural network, Graphics (cs.GR), Hash table, Computer Science - Graphics, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Leverage (statistics), Computer Vision and Pattern Recognition, Algorithm, Software
Abstract: We present a novel spatial hashing based data structure to facilitate 3D shape analysis using convolutional neural networks (CNNs). Our method well utilizes the sparse occupancy of 3D shape boundary and builds hierarchical hash tables for an input model under different resolutions. Based on this data structure, we design two efficient GPU algorithms namely hash2col and col2hash so that the CNN operations like convolution and pooling can be efficiently parallelized. The spatial hashing is nearly minimal, and our data structure is almost of the same size as the raw input. Compared with state-of-the-art octree-based methods, our data structure significantly reduces the memory footprint during the CNN training. As the input geometry features are more compactly packed, CNN operations also run faster with our data structure. The experiment shows that, under the same network structure, our method yields comparable or better benchmarks compared to the state-of-the-art while it has only one-third memory consumption. Such superior memory performance allows the CNN to handle high-resolution shape analysis., 12 pages, 9 figures
Published: 2020
Full Text: View/download PDF

148. Exploring compression and parallelization techniques for distribution of deep neural networks over Edge–Fog continuum – a review

Author: Azra Nazir, Shaima Qureshi, and Roohie Naaz Mir
Subjects: General Computer Science, Exploit, business.industry, Computer science, Deep learning, 020206 networking & telecommunications, Cloud computing, 02 engineering and technology, Parallel computing, Pipeline (software), Field (computer science), 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Overhead (computing), 020201 artificial intelligence & image processing, Enhanced Data Rates for GSM Evolution, Artificial intelligence, business
Abstract: PurposeThe trend of “Deep Learning for Internet of Things (IoT)” has gained fresh momentum with enormous upcoming applications employing these models as their processing engine and Cloud as their resource giant. But this picture leads to underutilization of ever-increasing device pool of IoT that has already passed 15 billion mark in 2015. Thus, it is high time to explore a different approach to tackle this issue, keeping in view the characteristics and needs of the two fields. Processing at the Edge can boost applications with real-time deadlines while complementing security.Design/methodology/approachThis review paper contributes towards three cardinal directions of research in the field of DL for IoT. The first section covers the categories of IoT devices and how Fog can aid in overcoming the underutilization of millions of devices, forming the realm of the things for IoT. The second direction handles the issue of immense computational requirements of DL models by uncovering specific compression techniques. An appropriate combination of these techniques, including regularization, quantization, and pruning, can aid in building an effective compression pipeline for establishing DL models for IoT use-cases. The third direction incorporates both these views and introduces a novel approach of parallelization for setting up a distributed systems view of DL for IoT.FindingsDL models are growing deeper with every passing year. Well-coordinated distributed execution of such models using Fog displays a promising future for the IoT application realm. It is realized that a vertically partitioned compressed deep model can handle the trade-off between size, accuracy, communication overhead, bandwidth utilization, and latency but at the expense of an additionally considerable memory footprint. To reduce the memory budget, we propose to exploit Hashed Nets as potentially favorable candidates for distributed frameworks. However, the critical point between accuracy and size for such models needs further investigation.Originality/valueTo the best of our knowledge, no study has explored the inherent parallelism in deep neural network architectures for their efficient distribution over the Edge-Fog continuum. Besides covering techniques and frameworks that have tried to bring inference to the Edge, the review uncovers significant issues and possible future directions for endorsing deep models as processing engines for real-time IoT. The study is directed to both researchers and industrialists to take on various applications to the Edge for better user experience.
Published: 2020
Full Text: View/download PDF

149. A p-adaptive Matrix-Free Discontinuous Galerkin Method for the Implicit LES of Incompressible Transitional Flows

Author: Alessandro Colombo, Andrea Crivellini, Lorenzo Alessio Botti, Antonio Ghidoni, G. Noventa, Matteo Franciolini, and Francesco Bassi
Subjects: p-adaptation, Discretization, Rosenbrock-type schemes, Computer science, General Chemical Engineering, General Physics and Astronomy, CPU time, 02 engineering and technology, Computational fluid dynamics, 01 natural sciences, 010305 fluids & plasmas, Incompressible flows, 0203 mechanical engineering, Discontinuous Galerkin method, Matrix-free, Discontinuous Galerkin, 0103 physical sciences, Applied mathematics, p-multigrid preconditioner, Physical and Theoretical Chemistry, business.industry, Preconditioner, Generalized minimal residual method, ILES, 020303 mechanical engineering & transports, Settore ING-IND/06 - Fluidodinamica, Memory footprint, business, Large eddy simulation
Abstract: In recent years Computational Fluid Dynamics (CFD) has become a widespread practice in industry. The growing need to simulate off-design conditions, characterized by massively separated flows, strongly promoted research on models and methods to improve the computational efficiency and to bring the practice of Scale Resolving Simulations (SRS), like the Large Eddy Simulation (LES), to an industrial level. Among the possible approaches to the SRS, an appealing choice is to perform Implicit LES (ILES) via a high-order Discontinuous Galerkin (DG) method, where the favourable numerical dissipation of the space discretization scheme plays directly the role of a subgrid-scale model. To reduce the large CPU time for ILES, implicit time integrators, that allows for larger time steps than explicit schemes, can be considered. The main drawbacks of implicit time integration in a DG framework are represented by the large memory footprint, the large CPU time for the operator assembly and the difficulty to design highly scalable preconditioners for the linear solvers. In this paper, which aims to significantly reduce the memory requirement and CPU time without spoiling the high-order accuracy of the method, we rely on a p-adaptive algorithm suited for the ILES of turbulent flows and an efficient matrix-free iterative linear solver based on a cheap p-multigrid preconditioner and a Flexible GMRES method. The performance and accuracy of the method have been assessed by considering the following test cases: (1) the T3L test case of the ERCOFTAC suite, a rounded leading edge flat plate at $${\mathrm{Re}}_D=3450$$ ; (2) the flow past a sphere at $$\mathrm{Re}_D=300$$ ; (3) the flow past a circular cylinder at $$\mathrm{Re}_D=3900$$ .
Published: 2020
Full Text: View/download PDF

150. VTR 8

Author: Jean-Philippe Legault, Kevin E. Murray, Panagiotis Patros, Sheng Zhong, Matthew Walker, Kenneth B. Kent, Jean Wu, Vaughn Betz, Aaron G. Graham, Hanqing Zeng, Mohamed Eldafrawy, Jason Luu, Oleg Petelin, Jia Min Wang, and Eugene Sha
Subjects: 010302 applied physics, General Computer Science, Computer science, Design flow, CAD, 02 engineering and technology, computer.software_genre, 01 natural sciences, 020202 computer hardware & architecture, Computer architecture, Gate array, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Verilog, Computer Aided Design, Routing (electronic design automation), Field-programmable gate array, computer, computer.programming_language
Abstract: Developing Field-programmable Gate Array (FPGA) architectures is challenging due to the competing requirements of various application domains and changing manufacturing process technology. This is compounded by the difficulty of fairly evaluating FPGA architectural choices, which requires sophisticated high-quality Computer Aided Design (CAD) tools to target each potential architecture. This article describes version 8.0 of the open source Verilog to Routing (VTR) project, which provides such a design flow. VTR 8 expands the scope of FPGA architectures that can be modelled, allowing VTR to target and model many details of both commercial and proposed FPGA architectures. The VTR design flow also serves as a baseline for evaluating new CAD algorithms. It is therefore important, for both CAD algorithm comparisons and the validity of architectural conclusions, that VTR produce high-quality circuit implementations. VTR 8 significantly improves optimization quality (reductions of 15% minimum routable channel width, 41% wirelength, and 12% critical path delay), run-time (5.3× faster) and memory footprint (3.3× lower). Finally, we demonstrate VTR is run-time and memory footprint efficient, while producing circuit implementations of reasonable quality compared to highly-tuned architecture-specific industrial tools—showing that architecture generality, good implementation quality, and run-time efficiency are not mutually exclusive goals.
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

2,203 results on '"Memory footprint"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources