Author: "Ahn, Jung Ho" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Ahn, Jung Ho"' showing total 701 results

Start Over Author "Ahn, Jung Ho"

701 results on '"Ahn, Jung Ho"'

1. Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

Author: Yun, Sungmin, Kyung, Kwanhee, Cho, Juhwan, Choi, Jaewan, Kim, Jongmin, Kim, Byeongho, Lee, Sukhan, Sohn, Kyomin, and Ahn, Jung Ho
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching. To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing., Comment: 15 pages, 16 figures, accepted at MICRO 2024
Published: 2024

2. Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs

Author: Kim, Jongmin, Choi, Wonseok, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Performance
Abstract: Fully homomorphic encryption (FHE) is a cryptographic technology capable of resolving security and privacy problems in cloud computing by encrypting data in use. However, FHE introduces tremendous computational overhead for processing encrypted data, causing FHE workloads to become 2-6 orders of magnitude slower than their unencrypted counterparts. To mitigate the overhead, we propose Cheddar, an FHE library for CUDA GPUs, which demonstrates significantly faster performance compared to prior GPU implementations. We develop optimized functionalities at various implementation levels ranging from efficient low-level primitives to streamlined high-level operational sequences. Especially, we improve major FHE operations, including number-theoretic transform and base conversion, based on efficient kernel designs using a small word size of 32 bits. By these means, Cheddar demonstrates 2.9 to 25.6 times higher performance for representative FHE workloads compared to prior GPU implementations., Comment: 12 pages, 5 figures
Published: 2024

3. DRAMScope: Uncovering DRAM Microarchitecture and Characteristics by Issuing Memory Commands

Author: Nam, Hwayong, Baek, Seungmin, Wi, Minbok, Kim, Michael Jaemin, Park, Jaehyun, Song, Chihun, Kim, Nam Sung, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Hardware Architecture
Abstract: The demand for precise information on DRAM microarchitectures and error characteristics has surged, driven by the need to explore processing in memory, enhance reliability, and mitigate security vulnerability. Nonetheless, DRAM manufacturers have disclosed only a limited amount of information, making it difficult to find specific information on their DRAM microarchitectures. This paper addresses this gap by presenting more rigorous findings on the microarchitectures of commodity DRAM chips and their impacts on the characteristics of activate-induced bitflips (AIBs), such as RowHammer and RowPress. The previous studies have also attempted to understand the DRAM microarchitectures and associated behaviors, but we have found some of their results to be misled by inaccurate address mapping and internal data swizzling, or lack of a deeper understanding of the modern DRAM cell structure. For accurate and efficient reverse-engineering, we use three tools: AIBs, retention time test, and RowCopy, which can be cross-validated. With these three tools, we first take a macroscopic view of modern DRAM chips to uncover the size, structure, and operation of their subarrays, memory array tiles (MATs), and rows. Then, we analyze AIB characteristics based on the microscopic view of the DRAM microarchitecture, such as 6F^2 cell layout, through which we rectify misunderstandings regarding AIBs and discover a new data pattern that accelerates AIBs. Lastly, based on our findings at both macroscopic and microscopic levels, we identify previously unknown AIB vulnerabilities and propose a simple yet effective protection solution., Comment: To appear at the 51st IEEE/ACM International Symposium on Computer Architecture (ISCA)
Published: 2024

4. NeuJeans: Private Neural Network Inference with Joint Optimization of Convolution and Bootstrapping

Author: Ju, Jae Hyung, Park, Jaiyoung, Kim, Jongmin, Kang, Minsik, Kim, Donghwan, Cheon, Jung Hee, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Fully homomorphic encryption (FHE) is a promising cryptographic primitive for realizing private neural network inference (PI) services by allowing a client to fully offload the inference task to a cloud server while keeping the client data oblivious to the server. This work proposes NeuJeans, an FHE-based solution for the PI of deep convolutional neural networks (CNNs). NeuJeans tackles the critical problem of the enormous computational cost for the FHE evaluation of CNNs. We introduce a novel encoding method called Coefficients-in-Slot (CinS) encoding, which enables multiple convolutions in one HE multiplication without costly slot permutations. We further observe that CinS encoding is obtained by conducting the first several steps of the Discrete Fourier Transform (DFT) on a ciphertext in conventional Slot encoding. This property enables us to save the conversion between CinS and Slot encodings as bootstrapping a ciphertext starts with DFT. Exploiting this, we devise optimized execution flows for various two-dimensional convolution (conv2d) operations and apply them to end-to-end CNN implementations. NeuJeans accelerates the performance of conv2d-activation sequences by up to 5.68 times compared to state-of-the-art FHE-based PI work and performs the PI of a CNN at the scale of ImageNet within a mere few seconds., Comment: 15 pages, 6 figures
Published: 2023

5. Toward Practical Privacy-Preserving Convolutional Neural Networks Exploiting Fully Homomorphic Encryption

Author: Park, Jaiyoung, Kim, Donghwan, Kim, Jongmin, Kim, Sangpyo, Jung, Wonkyung, Cheon, Jung Hee, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Hardware Architecture
Abstract: Incorporating fully homomorphic encryption (FHE) into the inference process of a convolutional neural network (CNN) draws enormous attention as a viable approach for achieving private inference (PI). FHE allows delegating the entire computation process to the server while ensuring the confidentiality of sensitive client-side data. However, practical FHE implementation of a CNN faces significant hurdles, primarily due to FHE's substantial computational and memory overhead. To address these challenges, we propose a set of optimizations, which includes GPU/ASIC acceleration, an efficient activation function, and an optimized packing scheme. We evaluate our method using the ResNet models on the CIFAR-10 and ImageNet datasets, achieving several orders of magnitude improvement compared to prior work and reducing the latency of the encrypted CNN inference to 1.4 seconds on an NVIDIA A100 GPU. We also show that the latency drops to a mere 0.03 seconds with a custom hardware design., Comment: 3 pages, 1 figure, appears at DISCC 2023 (2nd Workshop on Data Integrity and Secure Cloud Computing, in conjunction with the 56th International Symposium on Microarchitecture (MICRO 2023))
Published: 2023

6. CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure

Author: Kim, Sangpyo, Kim, Jongmin, Choi, Jaeyoung, and Ahn, Jung Ho
Subjects: Computer Science - Hardware Architecture, Computer Science - Cryptography and Security
Abstract: Fully homomorphic encryption (FHE) is in the spotlight as a definitive solution for privacy, but the high computational overhead of FHE poses a challenge to its practical adoption. Although prior studies have attempted to design ASIC accelerators to mitigate the overhead, their designs require excessive chip resources (e.g., areas) to contain and process massive data for FHE operations. We propose CiFHER, a chiplet-based FHE accelerator with a resizable structure, to tackle the challenge with a cost-effective multi-chip module (MCM) design. First, we devise a flexible core architecture whose configuration is adjustable to conform to the global organization of chiplets and design constraints. Its distinctive feature is a composable functional unit providing varying computational throughput for the number-theoretic transform, the most dominant function in FHE. Then, we establish generalized data mapping methodologies to minimize the interconnect overhead when organizing the chips into the MCM package in a tiled manner, which becomes a significant bottleneck due to the packaging constraints. This study demonstrates that a CiFHER package composed of a number of compact chiplets provides performance comparable to state-of-the-art monolithic ASIC accelerators while significantly reducing the package-wide power consumption and manufacturing cost., Comment: 12 pages, 10 figures, to appear in 2024 International Symposium on Secure and Private Execution Environment Design (SEED)
Published: 2023

7. Corona: System Implications of Emerging Nanophotonic Technology

Author: Vantrease, Dana, Schreiber, Robert, Monchiero, Matteo, McLaren, Moray, Jouppi, Norman P., Fiorentin, Marco, Davis, Al, Binkert, Nathan, Beausoleil, Raymond G., and Ahn, Jung Ho
Subjects: Computer Science - Hardware Architecture, Computer Science - Emerging Technologies, Computer Science - Networking and Internet Architecture
Abstract: We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off- and on- stack bandwidth requirements at acceptable power levels. Corona is a 3D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory-intensive workloads, while simultaneously reducing power., Comment: This edition is recompiled from proceedings of ISCA-35 (the 35th International Symposium on Computer Architecture, June 21 - 25, 2008, Beijing, China) and has minor formatting differences. 13 pages; 11 figures
Published: 2023
Full Text: View/download PDF

8. RETROSPECTIVE: Corona: System Implications of Emerging Nanophotonic Technology

Author: Vantrease, Dana, Schreiber, Robert, Monchiero, Matteo, McLaren, Moray, Jouppi, Norman P., Fiorentino, Marco, Davis, Al, Binkert, Nathan, Beausoleil, Raymond G., and Ahn, Jung Ho
Subjects: Computer Science - Hardware Architecture, Computer Science - Networking and Internet Architecture
Abstract: The 2008 Corona effort was inspired by a pressing need for more of everything, as demanded by the salient problems of the day. Dennard scaling was no longer in effect. A lot of computer architecture research was in the doldrums. Papers often showed incremental subsystem performance improvements, but at incommensurate cost and complexity. The many-core era was moving rapidly, and the approach with many simpler cores was at odds with the better and more complex subsystem publications of the day. Core counts were doubling every 18 months, while per-pin bandwidth was expected to double, at best, over the next decade. Memory bandwidth and capacity had to increase to keep pace with ever more powerful multi-core processors. With increasing core counts per die, inter-core communication bandwidth and latency became more important. At the same time, the area and power of electrical networks-on-chip were increasingly problematic: To be reliably received, any signal that traverses a wire spanning a full reticle-sized die would need significant equalization, re-timing, and multiple clock cycles. This additional time, area, and power was the crux of the concern, and things looked to get worse in the future. Silicon nanophotonics was of particular interest and seemed to be improving rapidly. This led us to consider taking advantage of 3D packaging, where one die in the 3D stack would be a photonic network layer. Our focus was on a system that could be built about a decade out. Thus, we tried to predict how the technologies and the system performance requirements would converge in about 2018. Corona was the result this exercise; now, 15 years later, it's interesting to look back at the effort., Comment: 2 pages. Proceedings of ISCA-50: 50 years of the International Symposia on Computer Architecture (selected papers) June 17-21 Orlando, Florida
Published: 2023

9. X-ray: Discovering DRAM Internal Structure and Error Characteristics by Issuing Memory Commands

Author: Nam, Hwayong, Baek, Seungmin, Wi, Minbok, Kim, Michael Jaemin, Park, Jaehyun, Song, Chihun, Kim, Nam Sung, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Hardware Architecture
Abstract: The demand for accurate information about the internal structure and characteristics of dynamic random-access memory (DRAM) has been on the rise. Recent studies have explored the structure and characteristics of DRAM to improve processing in memory, enhance reliability, and mitigate a vulnerability known as rowhammer. However, DRAM manufacturers only disclose limited information through official documents, making it difficult to find specific information about actual DRAM devices. This paper presents reliable findings on the internal structure and characteristics of DRAM using activate-induced bitflips (AIBs), retention time test, and row-copy operation. While previous studies have attempted to understand the internal behaviors of DRAM devices, they have only shown results without identifying the causes or have analyzed DRAM modules rather than individual chips. We first uncover the size, structure, and operation of DRAM subarrays and verify our findings on the characteristics of DRAM. Then, we correct misunderstood information related to AIBs and demonstrate experimental results supporting the cause of rowhammer. We expect that the information we uncover about the structure, behavior, and characteristics of DRAM will help future DRAM research., Comment: 4 pages, 7 figures, accepted at IEEE Computer Architecture Letters
Published: 2023
Full Text: View/download PDF

10. Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

Author: Sun, Yan, Yuan, Yifan, Yu, Zeduo, Kuper, Reese, Song, Chihun, Huang, Jinghan, Ji, Houxiang, Agarwal, Siddharth, Lou, Jiaqi, Jeong, Ipoom, Wang, Ren, Ahn, Jung Ho, Xu, Tianyin, and Kim, Nam Sung
Subjects: Computer Science - Performance, Computer Science - Hardware Architecture, C.4, D.4, C.0
Abstract: The ever-growing demands for memory with larger capacity and higher bandwidth have driven recent innovations on memory expansion and disaggregation technologies based on Compute eXpress Link (CXL). Especially, CXL-based memory expansion technology has recently gained notable attention for its ability not only to economically expand memory capacity and bandwidth but also to decouple memory technologies from a specific memory interface of the CPU. However, since CXL memory devices have not been widely available, they have been emulated using DDR memory in a remote NUMA node. In this paper, for the first time, we comprehensively evaluate a true CXL-ready system based on the latest 4th-generation Intel Xeon CPU with three CXL memory devices from different manufacturers. Specifically, we run a set of microbenchmarks not only to compare the performance of true CXL memory with that of emulated CXL memory but also to analyze the complex interplay between the CPU and CXL memory in depth. This reveals important differences between emulated CXL memory and true CXL memory, some of which will compel researchers to revisit the analyses and proposals from recent work. Next, we identify opportunities for memory-bandwidth-intensive applications to benefit from the use of CXL memory. Lastly, we propose a CXL-memory-aware dynamic page allocation policy, Caption to more efficiently use CXL memory as a bandwidth expander. We demonstrate that Caption can automatically converge to an empirically favorable percentage of pages allocated to CXL memory, which improves the performance of memory-bandwidth-intensive applications by up to 24% when compared to the default page allocation policy designed for traditional NUMA systems., Comment: This paper has been accepted by MICRO'23. Please refer to the https://doi.org/10.1145/3613424.3614256 for the official version of this paper
Published: 2023
Full Text: View/download PDF

11. HyPHEN: A Hybrid Packing Method and Optimizations for Homomorphic Encryption-Based Neural Networks

Author: Kim, Donghwan, Park, Jaiyoung, Kim, Jongmin, Kim, Sangpyo, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence
Abstract: Convolutional neural network (CNN) inference using fully homomorphic encryption (FHE) is a promising private inference (PI) solution due to the capability of FHE that enables offloading the whole computation process to the server while protecting the privacy of sensitive user data. Prior FHE-based CNN (HCNN) work has demonstrated the feasibility of constructing deep neural network architectures such as ResNet using FHE. Despite these advancements, HCNN still faces significant challenges in practicality due to the high computational and memory overhead. To overcome these limitations, we present HyPHEN, a deep HCNN construction that incorporates novel convolution algorithms (RAConv and CAConv), data packing methods (2D gap packing and PRCR scheme), and optimization techniques tailored to HCNN construction. Such enhancements enable HyPHEN to substantially reduce the memory footprint and the number of expensive homomorphic operations, such as ciphertext rotation and bootstrapping. As a result, HyPHEN brings the latency of HCNN CIFAR-10 inference down to a practical level at 1.4 seconds (ResNet-20) and demonstrates HCNN ImageNet inference for the first time at 14.7 seconds (ResNet-18)., Comment: 15 pages, 12 figures
Published: 2023
Full Text: View/download PDF

12. ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse

Author: Kim, Jongmin, Lee, Gwangho, Kim, Sangpyo, Sohn, Gina, Kim, John, Rhu, Minsoo, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Hardware Architecture
Abstract: Homomorphic Encryption (HE) is one of the most promising post-quantum cryptographic schemes that enable privacy-preserving computation on servers. However, noise accumulates as we perform operations on HE-encrypted data, restricting the number of possible operations. Fully HE (FHE) removes this restriction by introducing the bootstrapping operation, which refreshes the data; however, FHE schemes are highly memory-bound. Bootstrapping, in particular, requires loading GBs of evaluation keys and plaintexts from off-chip memory, which makes FHE acceleration fundamentally bottlenecked by the off-chip memory bandwidth. In this paper, we propose ARK, an Accelerator for FHE with Runtime data generation and inter-operation Key reuse. ARK enables practical FHE workloads with a novel algorithm-architecture co-design to accelerate bootstrapping. We first eliminate the off-chip memory bandwidth bottleneck through runtime data generation and inter-operation key reuse. This approach enables ARK to fully exploit on-chip memory by substantially reducing the size of the working set. On top of such algorithmic enhancements, we build ARK microarchitecture that minimizes on-chip data movement through an efficient, alternating data distribution policy based on the data access patterns and a streamlined dataflow organization of the tailored functional units -- including base conversion, number-theoretic transform, and automorphism units. Overall, our co-design effectively handles the heavy computation and data movement overheads of FHE, drastically reducing the cost of HE operations, including bootstrapping., Comment: 18 pages, 9 figures
Published: 2022
Full Text: View/download PDF

13. Synthesis of biosurfactants from polyethylene waste via an integrated chemical and biological process

Author: Buhori, Achmad, Lee, Juwon, Cha, Min Ji, Ahn, Jung Ho, Han, Sung Ok, Choi, Jae-Wook, Kim, Kwang Ho, Ha, Jeong-Myeong, Gong, Gyeongtaek, and Yoo, Chun-Jae
Published: 2024
Full Text: View/download PDF

14. Sustainable production of microbial protein from carbon dioxide in the integrated bioelectrochemical system using recycled nitrogen sources

Author: Lee, Yeon Ji, Moon, Byeong Cheul, Lee, Dong Ki, Ahn, Jung Ho, Gong, Gyeongtaek, Um, Youngsoon, Lee, Sun-Mi, Kim, Kyoung Heon, and Ko, Ja Kyong
Published: 2025
Full Text: View/download PDF

15. AESPA: Accuracy Preserving Low-degree Polynomial Activation for Fast Private Inference

Author: Park, Jaiyoung, Kim, Michael Jaemin, Jung, Wonkyung, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Hybrid private inference (PI) protocol, which synergistically utilizes both multi-party computation (MPC) and homomorphic encryption, is one of the most prominent techniques for PI. However, even the state-of-the-art PI protocols are bottlenecked by the non-linear layers, especially the activation functions. Although a standard non-linear activation function can generate higher model accuracy, it must be processed via a costly garbled-circuit MPC primitive. A polynomial activation can be processed via Beaver's multiplication triples MPC primitive but has been incurring severe accuracy drops so far. In this paper, we propose an accuracy preserving low-degree polynomial activation function (AESPA) that exploits the Hermite expansion of the ReLU and basis-wise normalization. We apply AESPA to popular ML models, such as VGGNet, ResNet, and pre-activation ResNet, to show an inference accuracy comparable to those of the standard models with ReLU activation, achieving superior accuracy over prior low-degree polynomial studies. When applied to the all-RELU baseline on the state-of-the-art Delphi PI protocol, AESPA shows up to 42.1x and 28.3x lower online latency and communication cost., Comment: 11 pages, 5 figures
Published: 2022

16. BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption

Author: Kim, Sangpyo, Kim, Jongmin, Kim, Michael Jaemin, Jung, Wonkyung, Rhu, Minsoo, Kim, John, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Hardware Architecture
Abstract: Homomorphic encryption (HE) enables the secure offloading of computations to the cloud by providing computation on encrypted data (ciphertexts). HE is based on noisy encryption schemes in which noise accumulates as more computations are applied to the data. The limited number of operations applicable to the data prevents practical applications from exploiting HE. Bootstrapping enables an unlimited number of operations or fully HE (FHE) by refreshing the ciphertext. Unfortunately, bootstrapping requires a significant amount of additional computation and memory bandwidth as well. Prior works have proposed hardware accelerators for computation primitives of FHE. However, to the best of our knowledge, this is the first to propose a hardware FHE accelerator that supports bootstrapping as a first-class citizen. In particular, we propose BTS - Bootstrappable, Technologydriven, Secure accelerator architecture for FHE. We identify the challenges of supporting bootstrapping in the accelerator and analyze the off-chip memory bandwidth and computation required. In particular, given the limitations of modern memory technology, we identify the HE parameter sets that are efficient for FHE acceleration. Based on the insights gained from our analysis, we propose BTS, which effectively exploits the parallelism innate in HE operations by arranging a massive number of processing elements in a grid. We present the design and microarchitecture of BTS, including a network-on-chip design that exploits a deterministic communication pattern. BTS shows 5,556x and 1,306x improved execution time on ResNet-20 and logistic regression over a CPU, with a chip area of 373.6mm^2 and up to 163.2W of power., Comment: 15 pages, 10 figures
Published: 2021
Full Text: View/download PDF

17. Biodegradation of oxidized low density polyethylene by Pelosinus fermentans lipase

Author: Kim, Do-Wook, Lim, Eui Seok, Lee, Ga Hyun, Son, Hyeoncheol Francis, Sung, Changmin, Jung, Jong-Hyun, Park, Hyun June, Gong, Gyeongtaek, Ko, Ja Kyong, Um, Youngsoon, Han, Sung Ok, and Ahn, Jung Ho
Published: 2024
Full Text: View/download PDF

18. Mithril: Cooperative Row Hammer Protection on Commodity DRAM Leveraging Managed Refresh

Author: Kim, Michael Jaemin, Park, Jaehyun, Park, Yeonhong, Doh, Wanju, Kim, Namhoon, Ham, Tae Jun, Lee, Jae W., and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Hardware Architecture
Abstract: Since its public introduction in the mid-2010s, the Row Hammer (RH) phenomenon has drawn significant attention from the research community due to its security implications. Although many RH-protection schemes have been proposed by processor vendors, DRAM manufacturers, and academia, they still have shortcomings. Solutions implemented in the memory controller (MC) incur increasingly higher costs due to their conservative design for the worst case in terms of the number of DRAM banks and RH threshold to support. Meanwhile, DRAM-side implementation either has a limited time margin for RH-protection measures or requires extensive modifications to the standard DRAM interface. Recently, a new command for RH-protection has been introduced in the DDR5/LPDDR5 standards, referred to as refresh management (RFM). RFM enables the separation of the tasks for RHprotection to both MC and DRAM by having the former generate an RFM command at a specific activation frequency and the latter take proper RH-protection measures within a given time window. Although promising, no existing study presents and analyzes RFM-based solutions for RH-protection. In this paper, we propose Mithril, the first RFM interfacecompatible, DRAM-MC cooperative RH-protection scheme providing deterministic protection guarantees. Mithril has minimal energy overheads for common use cases without adversarial memory access patterns. We also introduce Mithril+, an optional extension to provide minimal performance overheads at the expense of a tiny modification to the MC, while utilizing existing DRAM commands., Comment: 16 pages, to appear in HPCA 2022
Published: 2021

19. Tailored polyhydroxyalkanoate production from renewable non-fatty acid carbon sources using engineered Cupriavidus necator H16

Author: Park, Soyoung, Roh, Soonjong, Yoo, Jin, Ahn, Jung Ho, Gong, Gyeongtaek, Lee, Sun-Mi, Um, Youngsoon, Han, Sung Ok, and Ko, Ja Kyong
Published: 2024
Full Text: View/download PDF

20. Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

Author: Kim, Sangpyo, Jung, Wonkyung, Park, Jaiyoung, and Ahn, Jung Ho
Subjects: Computer Science - Cryptography and Security, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2x speedup on a modern GPU., Comment: 12 pages, 13 figures, to appear in IISWC 2020
Published: 2020
Full Text: View/download PDF

21. HEAAN Demystified: Accelerating Fully Homomorphic Encryption Through Architecture-centric Analysis and Optimization

Author: Jung, Wonkyung, Lee, Eojin, Kim, Sangpyo, Lee, Keewoo, Kim, Namhoon, Min, Chohong, Cheon, Jung Hee, and Ahn, Jung Ho
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Homomorphic Encryption (HE) draws a significant attention as a privacy-preserving way for cloud computing because it allows computation on encrypted messages called ciphertexts. Among numerous HE schemes proposed, HE for Arithmetic of Approximate Numbers (HEAAN) is rapidly gaining popularity across a wide range of applications because it supports messages that can tolerate approximate computation with no limit on the number of arithmetic operations applicable to the corresponding ciphertexts. A critical shortcoming of HE is the high computation complexity of ciphertext arithmetic; especially, HE multiplication (HE Mul) is more than 10,000 times slower than the corresponding multiplication between unencrypted messages. This leads to a large body of HE acceleration studies, including ones exploiting FPGAs; however, those did not conduct a rigorous analysis of computational complexity and data access patterns of HE Mul. Moreover, the proposals mostly focused on designs with small parameter sizes, making it difficult to accurately estimate their performance in conducting a series of complex arithmetic operations. In this paper, we first describe how HE Mul of HEAAN is performed in a manner friendly to computer architects. Then we conduct a disciplined analysis on its computational and memory access characteristics, through which we (1) extract parallelism in the key functions composing HE Mul and (2) demonstrate how to effectively map the parallelism to the popular parallel processing platforms, multicore CPUs and GPUs, by applying a series of optimization techniques such as transposing matrices and pinning data to threads. This leads to the performance improvement of HE Mul on a CPU and a GPU by 42.9x and 134.1x, respectively, over the single-thread reference HEAAN running on a CPU. The conducted analysis and optimization would set a new foundation for future HE acceleration research.
Published: 2020
Full Text: View/download PDF

22. Current advancements in the bio-based production of polyamides

Author: Lee, Jong An, Kim, Ji Yeon, Ahn, Jung Ho, Ahn, Yeah-Ji, and Lee, Sang Yup
Published: 2023
Full Text: View/download PDF

23. Hexanoic acid improves the production of lipid and oleic acid in Yarrowia lipolytica: The benefit of integrating biorefinery with organic waste management

Author: Choi, Yeon-Ho, Son, Hyeoncheol Francis, Hwang, Sungmin, Kim, Jiwon, Ko, Ja Kyong, Gong, Gyeongtaek, Ahn, Jung Ho, Um, Youngsoon, Han, Sung Ok, and Lee, Sun-Mi
Published: 2023
Full Text: View/download PDF

24. Enhanced production of poly(3-hydroxybutyrate-co-3-hydroxyvalerate) with modulated 3-hydroxyvalerate fraction by overexpressing acetolactate synthase in Cupriavidus necator H16

Author: Jo, Young Yun, Park, Soyoung, Gong, Gyeongtaek, Roh, Soonjong, Yoo, Jin, Ahn, Jung Ho, Lee, Sun-Mi, Um, Youngsoon, Kim, Kyoung Heon, and Ko, Ja Kyong
Published: 2023
Full Text: View/download PDF

25. Effective hexanol production from carbon monoxide using extractive fermentation with Clostridium carboxidivorans P7

Author: Oh, Hyun Ju, Gong, Gyeongtaek, Ahn, Jung Ho, Ko, Ja Kyong, Lee, Sun-Mi, and Um, Youngsoon
Published: 2023
Full Text: View/download PDF

26. Restructuring Batch Normalization to Accelerate CNN Training

Author: Jung, Wonkyung, Jung, Daejin, Kim, and Byeongho, Lee, Sunjung, Rhee, Wonjong, and Ahn, Jung Ho
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Batch Normalization (BN) has become a core design block of modern Convolutional Neural Networks (CNNs). A typical modern CNN has a large number of BN layers in its lean and deep architecture. BN requires mean and variance calculations over each mini-batch during training. Therefore, the existing memory access reduction techniques, such as fusing multiple CONV layers, are not effective for accelerating BN due to their inability to optimize mini-batch related calculations during training. To address this increasingly important problem, we propose to restructure BN layers by first splitting a BN layer into two sub-layers (fission) and then combining the first sub-layer with its preceding CONV layer and the second sub-layer with the following activation and CONV layers (fusion). The proposed solution can significantly reduce main-memory accesses while training the latest CNN models, and the experiments on a chip multiprocessor show that the proposed BN restructuring can improve the performance of DenseNet-121 by 25.7%., Comment: 13 pages, 8 figures, to appear in SysML 2019, added ResNet-50 results
Published: 2018

27. Partitioning Compute Units in CNN Acceleration for Statistical Memory Traffic Shaping

Author: Jung, Daejin, Lee, Sunjung, Rhee, Wonjong, and Ahn, Jung Ho
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The design complexity of CNNs has been steadily increasing to improve accuracy. To cope with the massive amount of computation needed for such complex CNNs, the latest solutions utilize blocking of an image over the available dimensions and batching of multiple input images to improve data reuse in the memory hierarchy. While there has been numerous works on maximizing data reuse, only a few studies have focused on the memory bottleneck caused by limited bandwidth. Bandwidth bottleneck can easily occur in CNN acceleration as CNN layers have different sizes with varying computation needs and as batching is typically performed over each CNN layer for an ideal data reuse. In this case, the data transfer demand for a layer can be relatively low or high compared to the computation requirement of the layer, and hence temporal fluctuations in memory access can be induced eventually causing bandwidth problems. In this paper, we first show that there exists a high degree of fluctuation in memory access to computation ratio depending on CNN layers and functions in the layer being processed by the compute units (cores), where the units are tightly synchronized to maximize data reuse. Then we propose a strategy of partitioning the compute units where the cores within each partition process a batch of input data synchronously to maximize data reuse but different partitions run asynchronously. As the partitions stay asynchronous and typically process different CNN layers at any given moment, the memory access traffic sizes of the partitions become statistically shuffled. Thus, the partitioning of compute units and asynchronous use of them make the total memory access traffic size be smoothened over time. We call this smoothing statistical memory traffic shaping, and we show that it can lead to 8.0 percent of performance gain on a commercial 64-core processor when running ResNet-50., Comment: 4 pages, 6 figures, appears at IEEE Computer Architecture Letters
Published: 2018
Full Text: View/download PDF

28. GraNDe: Efficient Near-Data Processing Architecture for Graph Neural Networks

Author: Yun, Sungmin, Nam, Hwayong, Park, Jaehyun, Kim, Byeongho, Ahn, Jung Ho, and Lee, Eojin
Abstract: Graph Neural Network (GNN) models have attracted attention, given their high accuracy in interpreting graph data. One of the primary building blocks of a GNN model is aggregation, which gathers and averages the feature vectors corresponding to the nodes adjacent to each node. Aggregation works by multiplying the adjacency and feature matrices. The size of both matrices exceeds the on-chip cache capacity for many realistic datasets, and the adjacency matrix is highly sparse. These characteristics lead to little data reuse, causing intensive main-memory accesses during the aggregation process. Thus, aggregation exhibits memory-intensive characteristics and dominates most of the total execution time. In this paper, we propose GraNDe, an NDP architecture that accelerates memory-intensive aggregation operations by locating NDP modules near DRAM datapath to exploit rank-level parallelism. GraNDe maximizes bandwidth utilization by separating the memory channel path with the buffer chip in between so that pre-/post-processing in the host processor and reduction in NDP modules operate simultaneously. By exploring the preferred data mappings of the operand matrices to DRAM ranks, we architect GraNDe to support adaptive matrix mapping that applies the optimal mapping for each layer depending on the dimension of the layer and the configuration of a memory system. We also propose adj-bundle broadcasting and re-tiling optimizations to reduce the transfer time for adjacency matrix data and to improve feature vector data reusability by exploiting tiling with consideration of adjacency between nodes. GraNDe achieves 3.01× and 1.69× on average, and up to 4.00× and 1.98× speedups of GCN aggregation over the baseline system and the state-of-the-art NDP architecture for GCN, respectively.
Published: 2024
Full Text: View/download PDF

29. Metabolic engineering for the production of dicarboxylic acids and diamines

Author: Chae, Tong Un, Ahn, Jung Ho, Ko, Yoo-Sung, Kim, Je Woong, Lee, Jong An, Lee, Eon Hui, and Lee, Sang Yup
Published: 2020
Full Text: View/download PDF

30. HyPHEN: A Hybrid Packing Method and Its Optimizations for Homomorphic Encryption-based Neural Networks

Author: Kim, Donghwan, primary, Park, Jaiyoung, additional, Kim, Jongmin, additional, Kim, Sangpyo, additional, and Ahn, Jung Ho, additional
Published: 2024
Full Text: View/download PDF

31. Access Tracking and Probabilistic Promotion for CXL-Attached Memory

Author: Ko, Seoyoung, primary, Doh, Wanju, additional, Moon, Yaebin, additional, and Ahn, Jung-Ho, additional
Published: 2023
Full Text: View/download PDF

32. Escherichia coli is engineered to grow on CO2 and formic acid

Author: Bang, Junho, Hwang, Chang Hun, Ahn, Jung Ho, Lee, Jong An, and Lee, Sang Yup
Published: 2020
Full Text: View/download PDF

33. Separation and purification of three, four, and five carbon diamines from fermentation broth

Author: Lee, Jong An, Ahn, Jung Ho, Kim, Inho, Li, Sheng, and Lee, Sang Yup
Published: 2019
Full Text: View/download PDF

34. High-precision RNS-CKKS on fixed but smaller word-size architectures: theory and application

Author: Agrawal, Rashmi, primary, Ahn, Jung Ho, additional, Bergamaschi, Flavio, additional, Cammarota, Ro, additional, Cheon, Jung Hee, additional, D. M. de Souza, Fillipe, additional, Gong, Huijing, additional, Kang, Minsik, additional, Kim, Duhyeong, additional, Kim, Jongmin, additional, de Lassus, Hubert, additional, Park, Jai Hyun, additional, Steiner, Michael, additional, and Wang, Wen, additional
Published: 2023
Full Text: View/download PDF

35. How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAM

Author: Kim, Michael Jaemin, primary, Wi, Minbok, additional, Park, Jaehyun, additional, Ko, Seoyoung, additional, Choi, Jaeyoung, additional, Nam, Hwayoung, additional, Kim, Nam Sung, additional, Ahn, Jung Ho, additional, and Lee, Eojin, additional
Published: 2023
Full Text: View/download PDF

36. Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

Author: Sun, Yan, primary, Yuan, Yifan, additional, Yu, Zeduo, additional, Kuper, Reese, additional, Song, Chihun, additional, Huang, Jinghan, additional, Ji, Houxiang, additional, Agarwal, Siddharth, additional, Lou, Jiaqi, additional, Jeong, Ipoom, additional, Wang, Ren, additional, Ahn, Jung Ho, additional, Xu, Tianyin, additional, and Kim, Nam Sung, additional
Published: 2023
Full Text: View/download PDF

37. Biotechnological Plastic Degradation and Valorization Using Systems Metabolic Engineering

Author: Lee, Ga Hyun, primary, Kim, Do-Wook, additional, Jin, Yun Hui, additional, Kim, Sang Min, additional, Lim, Eui Seok, additional, Cha, Min Ji, additional, Ko, Ja Kyong, additional, Gong, Gyeongtaek, additional, Lee, Sun-Mi, additional, Um, Youngsoon, additional, Han, Sung Ok, additional, and Ahn, Jung Ho, additional
Published: 2023
Full Text: View/download PDF

38. A Study on an IoT Sensor-based Monitoring Algorithm for Caring for the Elderly Living Alone

Author: Chang, Kyuchang, primary, Choi, KwonTaeg, additional, and Ahn, Jung-Ho, additional
Published: 2023
Full Text: View/download PDF

39. SHARP: A Short-Word Hierarchical Accelerator for Robust and Practical Fully Homomorphic Encryption

Author: Kim, Jongmin, primary, Kim, Sangpyo, additional, Choi, Jaewan, additional, Park, Jaiyoung, additional, Kim, Donghwan, additional, and Ahn, Jung Ho, additional
Published: 2023
Full Text: View/download PDF

40. Organic Acids: Succinic and Malic Acids

Author: Lee, Jong An, primary, Ahn, Jung Ho, additional, and Lee, Sang Yup, additional
Published: 2019
Full Text: View/download PDF

41. Membrane engineering via trans-unsaturated fatty acids production improves succinic acid production in Mannheimia succiniciproducens

Author: Ahn, Jung Ho, Lee, Jong An, Bang, Junho, and Lee, Sang Yup
Published: 2018
Full Text: View/download PDF

42. The Effect of Additive Element on the Properties of Mechanically Alloyed Fe-Y2O3 Alloys

Author: Ahn Jung-Ho, Kim Tae Kyu, and Ahn Jeongsuk
Subjects: Mechanical alloying, ODS alloys, Nanoparticles, Oxide-dispersion Strengthening, High-energy ball milling, Mining engineering. Metallurgy, TN1-997, Materials of engineering and construction. Mechanics of materials, TA401-492
Abstract: In the present work, we have examined the effect of Ti on the properties of Fe-Y2O3 alloys. The result showed that the addition of Ti was effective for improving mechanical properties. This is due to the reduction of oxides by Ti during mechanical alloying and hot-consolidation. In particular, iron oxides are effectively reduced by the addition of Ti. Compared to the pristine Fe-Y2O3 alloys, titanium-added alloys exhibited fine and uniform microstructures, resulting in at least 60% higher tensile strength.
Published: 2017
Full Text: View/download PDF

43. CRISPR/deadCas9-based high-throughput gene doping analysis (HiGDA): A proof of concept for exogenous human erythropoietin gene doping detection

Author: Yi, Joon-Yeop, primary, Kim, Minyoung, additional, Ahn, Jung Ho, additional, Kim, Byung-Gee, additional, Son, Junghyun, additional, and Sung, Changmin, additional
Published: 2023
Full Text: View/download PDF

44. Homo-succinic acid production by metabolically engineered Mannheimia succiniciproducens

Author: Lee, Jeong Wook, Yi, Jongho, Kim, Tae Yong, Choi, Sol, Ahn, Jung Ho, Song, Hyohak, Lee, Moon-Hee, and Lee, Sang Yup
Published: 2016
Full Text: View/download PDF

45. A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models

Author: Li, Hailong, primary, Choi, Jaewan, additional, Kwon, Yongsuk, additional, and Ahn, Jung Ho, additional
Published: 2023
Full Text: View/download PDF

46. ADT: Aggressive Demotion and Promotion for Tiered Memory

Author: Moon, Yaebin, primary, Doh, Wanju, additional, Kyung, Kwanhee, additional, Lee, Eojin, additional, and Ahn, Jung Ho, additional
Published: 2023
Full Text: View/download PDF

47. Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models

Author: Choi, Jaewan, primary, Park, Jaehyun, additional, Kyung, Kwanhee, additional, Kim, Nam Sung, additional, and Ahn, Jung Ho, additional
Published: 2023
Full Text: View/download PDF

48. X-ray: Discovering DRAM Internal Structure and Error Characteristics by Issuing Memory Commands

Author: Nam, Hwayong, primary, Baek, Seungmin, additional, Wi, Minbok, additional, Kim, Michael Jaemin, additional, Park, Jaehyun, additional, Song, Chihun, additional, Kim, Nam Sung, additional, and Ahn, Jung Ho, additional
Published: 2023
Full Text: View/download PDF

49. GraNDe: Efficient Near-Data Processing Architecture for Graph Neural Networks

Author: Yun, Sungmin, primary, Nam, Hwayong, additional, Park, Jaehyun, additional, Kim, Byeongho, additional, Ahn, Jung Ho, additional, and Lee, Eojin, additional
Published: 2023
Full Text: View/download PDF

50. MaPHeA: A Framework for Lightweight Memory Hierarchy-aware Profile-guided Heap Allocation

Author: Oh, Deok-Jae, primary, Moon, Yaebin, additional, Ham, Do Kyu, additional, Ham, Tae Jun, additional, Park, Yongjun, additional, Lee, Jae W., additional, Ahn, Jung Ho, additional, and Lee, Eojin, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

701 results on '"Ahn, Jung Ho"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources