Author: "Xiao, Guangxuan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xiao, Guangxuan"' showing total 19 results

Start Over Author "Xiao, Guangxuan"

19 results on '"Xiao, Guangxuan"'

1. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Author: Xiao, Guangxuan, Tang, Jiaming, Zuo, Jingwei, Guo, Junxian, Yang, Shang, Tang, Haotian, Fu, Yao, and Han, Song
Subjects: Computer Science - Computation and Language
Abstract: Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.
Published: 2024

2. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Author: Tang, Jiaming, Zhao, Yilong, Zhu, Kan, Xiao, Guangxuan, Kasikci, Baris, and Han, Song
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest ., Comment: ICML 2024
Published: 2024

3. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Author: Lin, Yujun, Tang, Haotian, Yang, Shang, Zhang, Zhekai, Xiao, Guangxuan, Gan, Chuang, and Han, Song
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve., Comment: The first three authors contribute equally to this project and are listed in the alphabetical order. Yujun Lin leads the quantization algorithm, Haotian Tang and Shang Yang lead the GPU kernels and the serving system. Code is available at https://github.com/mit-han-lab/qserve
Published: 2024

4. Retrieval Head Mechanistically Explains Long-Context Factuality

Author: Wu, Wenhao, Wang, Yizhong, Xiao, Guangxuan, Peng, Hao, and Fu, Yao
Subjects: Computer Science - Computation and Language
Abstract: Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache., Comment: Preprint
Published: 2024

5. FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

Author: Xiao, Guangxuan, Yin, Tianwei, Freeman, William T., Durand, Frédo, and Han, Song
Published: 2024
Full Text: View/download PDF

6. BitDelta: Your Fine-Tune May Only Be Worth One Bit

Author: Liu, James, Xiao, Guangxuan, Li, Kai, Lee, Jason D., Han, Song, Dao, Tri, and Cai, Tianle
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings., Comment: NeurIPS 2024 acceptance
Published: 2024

7. InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Author: Xiao, Chaojun, Zhang, Pengle, Han, Xu, Xiao, Guangxuan, Lin, Yankai, Zhang, Zhengyan, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs (e.g., LLM-driven agents). However, existing LLMs, pre-trained on sequences with a restricted maximum length, cannot process longer sequences due to the out-of-domain and distraction issues. Common solutions often involve continual pre-training on longer sequences, which will introduce expensive computational overhead and uncontrollable change in model capabilities. In this paper, we unveil the intrinsic capacity of LLMs for understanding extremely long sequences without any fine-tuning. To this end, we introduce a training-free memory-based method, InfLLM. Specifically, InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. Thereby, InfLLM allows LLMs to efficiently process long sequences with a limited context window and well capture long-distance dependencies. Without any training, InfLLM enables LLMs that are pre-trained on sequences consisting of a few thousand tokens to achieve comparable performance with competitive baselines that continually train these LLMs on long sequences. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies. Our code can be found in \url{https://github.com/thunlp/InfLLM}.
Published: 2024

8. Efficient Streaming Language Models with Attention Sinks

Author: Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, and Lewis, Mike
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm., Comment: ICLR 2024
Published: 2023

9. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Author: Lin, Ji, Tang, Jiaming, Tang, Haotian, Yang, Shang, Chen, Wei-Ming, Wang, Wei-Chen, Xiao, Guangxuan, Dang, Xingyu, Gan, Chuang, and Han, Song
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs., Comment: MLSys 2024 Best Paper Award. Code available at: https://github.com/mit-han-lab/llm-awq
Published: 2023

10. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Author: Xiao, Guangxuan, Yin, Tianwei, Freeman, William T., Durand, Frédo, and Han, Song
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300$\times$-2500$\times$ speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer., Comment: The first two authors contributed equally to this work
Published: 2023

11. Sparse and Local Networks for Hypergraph Reasoning

Author: Xiao, Guangxuan, Kaelbling, Leslie Pack, Wu, Jiajun, and Mao, Jiayuan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Reasoning about the relationships between entities from input facts (e.g., whether Ari is a grandparent of Charlie) generally requires explicit consideration of other entities that are not mentioned in the query (e.g., the parents of Charlie). In this paper, we present an approach for learning to solve problems of this kind in large, real-world domains, using sparse and local hypergraph neural networks (SpaLoc). SpaLoc is motivated by two observations from traditional logic-based reasoning: relational inferences usually apply locally (i.e., involve only a small number of individuals), and relations are usually sparse (i.e., only hold for a small percentage of tuples in a domain). We exploit these properties to make learning and inference efficient in very large domains by (1) using a sparse tensor representation for hypergraph neural networks, (2) applying a sparsification loss during training to encourage sparse representations, and (3) subsampling based on a novel information sufficiency-based sampling process during training. SpaLoc achieves state-of-the-art performance on several real-world, large-scale knowledge graph reasoning benchmarks, and is the first framework for applying hypergraph neural networks on real-world knowledge graphs with more than 10k nodes., Comment: Learning on Graphs Conference (LoG) 2022. Project page: https://spaloc.csail.mit.edu
Published: 2023

12. Offsite-Tuning: Transfer Learning without Full Model

Author: Xiao, Guangxuan, Lin, Ji, and Han, Song
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Transfer learning is important for foundation models to adapt to downstream tasks. However, many foundation models are proprietary, so users must share their data with model owners to fine-tune the models, which is costly and raise privacy concerns. Moreover, fine-tuning large foundation models is computation-intensive and impractical for most downstream users. In this paper, we propose Offsite-Tuning, a privacy-preserving and efficient transfer learning framework that can adapt billion-parameter foundation models to downstream data without access to the full model. In offsite-tuning, the model owner sends a light-weight adapter and a lossy compressed emulator to the data owner, who then fine-tunes the adapter on the downstream data with the emulator's assistance. The fine-tuned adapter is then returned to the model owner, who plugs it into the full model to create an adapted foundation model. Offsite-tuning preserves both parties' privacy and is computationally more efficient than the existing fine-tuning methods that require access to the full model weights. We demonstrate the effectiveness of offsite-tuning on various large language and vision foundation models. Offsite-tuning can achieve comparable accuracy as full model fine-tuning while being privacy-preserving and efficient, achieving 6.5x speedup and 5.6x memory reduction. Code is available at https://github.com/mit-han-lab/offsite-tuning.
Published: 2023

13. FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training

Author: Huang, Kezhao, Jiang, Haitian, Wang, Minjie, Xiao, Guangxuan, Wipf, David, Song, Xiang, Gan, Quan, Huang, Zengfeng, Zhai, Jidong, and Zhang, Zheng
Subjects: Computer Science - Machine Learning
Abstract: A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the currently-available shortcuts involved. To address these limitations, we instead propose FreshGNN, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, FreshGNN is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 3.4x up to 20.5x and reduce the memory access by 59%, with less than 1% influence on test accuracy., Comment: Accepted by VLDB 2024
Published: 2023

14. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Author: Xiao, Guangxuan, Lin, Ji, Seznec, Mickael, Wu, Hao, Demouth, Julien, and Han, Song
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, Llama-1/2, Falcon, Mistral, and Mixtral models. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant., Comment: ICML 2023. First two authors contributed equally to this work
Published: 2022

15. Correction to: Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks

Author: Zhang, Zhengyan, Xiao, Guangxuan, Li, Yongwei, Lv, Tian, Qi, Fanchao, Liu, Zhiyuan, Wang, Yasheng, Jiang, Xin, and Sun, Maosong
Published: 2024
Full Text: View/download PDF

16. Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks

Author: Zhang, Zhengyan, Xiao, Guangxuan, Li, Yongwei, Lv, Tian, Qi, Fanchao, Liu, Zhiyuan, Wang, Yasheng, Jiang, Xin, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}., Comment: Published in Machine Intelligence Research (https://link.springer.com/article/10.1007/s11633-022-1377-5)
Published: 2021
Full Text: View/download PDF

17. Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks

Author: Zhang, Zhengyan, Xiao, Guangxuan, Li, Yongwei, Lv, Tian, Qi, Fanchao, Liu, Zhiyuan, Wang, Yasheng, Jiang, Xin, and Sun, Maosong
Published: 2023
Full Text: View/download PDF

18. ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training

Author: Huang, Kezhao, Jiang, Haitian, Wang, Minjie, Xiao, Guangxuan, Wipf, David, Song, Xiang, Gan, Quan, Huang, Zengfeng, Zhai, Jidong, and Zhang, Zheng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the inevitable shortcuts involved. To address these limitations, we instead propose ReFresh, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, ReFresh is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 4.6x up to 23.6x and reduce the memory access by 64.5% (85.7% higher than a raw feature cache), with less than 1% influence on test accuracy.
Published: 2023

19. Erratum to: Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks

Author: Zhang, Zhengyan, Xiao, Guangxuan, Li, Yongwei, Lv, Tian, Qi, Fanchao, Liu, Zhiyuan, Wang, Yasheng, Jiang, Xin, and Sun, Maosong
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

19 results on '"Xiao, Guangxuan"'

1. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

2. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

3. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

4. Retrieval Head Mechanistically Explains Long-Context Factuality

5. FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

6. BitDelta: Your Fine-Tune May Only Be Worth One Bit

7. InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

8. Efficient Streaming Language Models with Attention Sinks

9. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

10. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

11. Sparse and Local Networks for Hypergraph Reasoning

12. Offsite-Tuning: Transfer Learning without Full Model

13. FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training

14. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

15. Correction to: Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks

16. Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks

17. Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks

18. ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training

19. Erratum to: Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

19 results on '"Xiao, Guangxuan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources