Author: "Li, Hongkang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Li, Hongkang"' showing total 17 results

Start Over Author "Li, Hongkang"

17 results on '"Li, Hongkang"'

1. Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis

Author: Li, Hongkang, Wang, Meng, Lu, Songtao, Cui, Xiaodong, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.
Published: 2024

2. Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Author: Li, Hongkang, Wang, Meng, Zhang, Shuai, Liu, Sijia, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning
Abstract: Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly elusive. To the best of our knowledge, this paper shows the first theoretical analysis of the property of low-rank and sparsity of one-layer Transformers by characterizing the trained model after convergence using stochastic gradient descent. By focusing on a data model based on label-relevant and label-irrelevant patterns, we quantify that the gradient updates of trainable parameters are low-rank, which depends on the number of label-relevant patterns. We also analyze how model pruning affects the generalization while improving computation efficiency and conclude that proper magnitude-based pruning has a slight effect on the testing performance. We implement numerical experiments to support our findings., Comment: IEEE SAM Workshop 2024
Published: 2024

3. What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Author: Li, Hongkang, Wang, Meng, Ma, Tengfei, Liu, Sijia, Zhang, Zaixi, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning
Abstract: Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper provides the quantitative characterization of the sample complexity and number of iterations for convergence dependent on the fraction of discriminative nodes, the dominant patterns, and the initial model errors. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks., Comment: ICML 2024
Published: 2024

4. Node Identifiers: Compact, Discrete Representations for Efficient Graph Learning

Author: Luo, Yuankai, Li, Hongkang, Liu, Qijiong, Shi, Lei, and Wu, Xiao-Ming
Subjects: Computer Science - Machine Learning
Abstract: We present a novel end-to-end framework that generates highly compact (typically 6-15 dimensions), discrete (int4 type), and interpretable node representations, termed node identifiers (node IDs), to tackle inference challenges on large-scale graphs. By employing vector quantization, we compress continuous node embeddings from multiple layers of a Graph Neural Network (GNN) into discrete codes, applicable under both self-supervised and supervised learning paradigms. These node IDs capture high-level abstractions of graph data and offer interpretability that traditional GNN embeddings lack. Extensive experiments on 34 datasets, encompassing node classification, graph classification, link prediction, and attributed graph clustering tasks, demonstrate that the generated node IDs significantly enhance speed and memory efficiency while achieving competitive performance compared to current state-of-the-art methods.
Published: 2024

5. How does promoting the minority fraction affect generalization? A theoretical study of the one-hidden-layer neural network on group imbalance

Author: Li, Hongkang, Zhang, Shuai, Zhang, Yihua, Wang, Meng, Liu, Sijia, and Chen, Pin-Yu
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Group imbalance has been a known problem in empirical risk minimization (ERM), where the achieved high average accuracy is accompanied by low accuracy in a minority group. Despite algorithmic efforts to improve the minority group accuracy, a theoretical generalization analysis of ERM on individual groups remains elusive. By formulating the group imbalance problem with the Gaussian Mixture Model, this paper quantifies the impact of individual groups on the sample complexity, the convergence rate, and the average and group-level testing performance. Although our theoretical framework is centered on binary classification using a one-hidden-layer neural network, to the best of our knowledge, we provide the first theoretical analysis of the group-level generalization of ERM in addition to the commonly studied average generalization performance. Sample insights of our theoretical results include that when all group-level co-variance is in the medium regime and all mean are close to zero, the learning performance is most desirable in the sense of a small sample complexity, a fast training rate, and a high average and group-level testing accuracy. Moreover, we show that increasing the fraction of the minority group in the training data does not necessarily improve the generalization performance of the minority group. Our theoretical results are validated on both synthetic and empirical datasets, such as CelebA and CIFAR-10 in image classification.
Published: 2024

6. How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

Author: Li, Hongkang, Wang, Meng, Lu, Songtao, Cui, Xiaodong, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning
Abstract: Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the technical challenges of analyzing the nonconvex training problems resulting from the nonlinear self-attention and nonlinear activation in Transformers. To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments., Comment: ICML 2024
Published: 2024

7. On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $\epsilon$-Greedy Exploration

Author: Zhang, Shuai, Li, Hongkang, Wang, Meng, Liu, Miao, Chen, Pin-Yu, Lu, Songtao, Liu, Sijia, Murugesan, Keerthiram, and Chaudhury, Subhajit
Subjects: Computer Science - Machine Learning
Abstract: This paper provides a theoretical understanding of Deep Q-Network (DQN) with the $\varepsilon$-greedy exploration in deep reinforcement learning. Despite the tremendous empirical achievement of the DQN, its theoretical characterization remains underexplored. First, the exploration strategy is either impractical or ignored in the existing analysis. Second, in contrast to conventional Q-learning algorithms, the DQN employs the target network and experience replay to acquire an unbiased estimation of the mean-square Bellman error (MSBE) utilized in training the Q-network. However, the existing theoretical analysis of DQNs lacks convergence analysis or bypasses the technical challenges by deploying a significantly overparameterized neural network, which is not computationally efficient. This paper provides the first theoretical convergence and sample complexity analysis of the practical setting of DQNs with $\epsilon$-greedy policy. We prove an iterative procedure with decaying $\epsilon$ converges to the optimal Q-value function geometrically. Moreover, a higher level of $\epsilon$ values enlarges the region of convergence but slows down the convergence, while the opposite holds for a lower level of $\epsilon$ values. Experiments justify our established theoretical insights on DQNs.
Published: 2023

8. How Can Context Help? Exploring Joint Retrieval of Passage and Personalized Context

Author: Wan, Hui, Li, Hongkang, Lu, Songtao, Cui, Xiaodong, and Danilevsky, Marina
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: The integration of external personalized context information into document-grounded conversational systems has significant potential business value, but has not been well-studied. Motivated by the concept of personalized context-aware document-grounded conversational systems, we introduce the task of context-aware passage retrieval. We also construct a dataset specifically curated for this purpose. We describe multiple baseline systems to address this task, and propose a novel approach, Personalized Context-Aware Search (PCAS), that effectively harnesses contextual information during passage retrieval. Experimental evaluations conducted on multiple popular dense retrieval systems demonstrate that our proposed approach not only outperforms the baselines in retrieving the most relevant passage but also excels at identifying the pertinent context among all the available contexts. We envision that our contributions will serve as a catalyst for inspiring future research endeavors in this promising direction.
Published: 2023

9. Enhancing Graph Transformers with Hierarchical Distance Structural Encoding

Author: Luo, Yuankai, Li, Hongkang, Shi, Lei, and Wu, Xiao-Ming
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Social and Information Networks
Abstract: Graph transformers need strong inductive biases to derive meaningful attention scores. Yet, current methods often fall short in capturing longer ranges, hierarchical structures, or community structures, which are common in various graphs such as molecules, social networks, and citation networks. This paper presents a Hierarchical Distance Structural Encoding (HDSE) method to model node distances in a graph, focusing on its multi-level, hierarchical nature. We introduce a novel framework to seamlessly integrate HDSE into the attention mechanism of existing graph transformers, allowing for simultaneous application with other positional encodings. To apply graph transformers with HDSE to large-scale graphs, we further propose a high-level HDSE that effectively biases the linear transformers towards graph hierarchies. We theoretically prove the superiority of HDSE over shortest path distances in terms of expressivity and generalization. Empirically, we demonstrate that graph transformers with HDSE excel in graph classification, regression on 7 graph-level datasets, and node classification on 11 large-scale graphs, including those with up to a billion nodes.
Published: 2023

10. A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Author: Li, Hongkang, Wang, Meng, Liu, Sijia, and Chen, Pin-yu
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.
Published: 2023

11. Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data

Author: Li, Hongkang, Zhang, Shuai, and Wang, Meng
Subjects: Computer Science - Machine Learning
Abstract: This paper analyzes the convergence and generalization of training a one-hidden-layer neural network when the input features follow the Gaussian mixture model consisting of a finite number of Gaussian distributions. Assuming the labels are generated from a teacher model with an unknown ground truth weight, the learning problem is to estimate the underlying teacher model by minimizing a non-convex risk function over a student neural network. With a finite number of training samples, referred to the sample complexity, the iterations are proved to converge linearly to a critical point with guaranteed generalization error. In addition, for the first time, this paper characterizes the impact of the input distributions on the sample complexity and the learning rate.
Published: 2022

12. Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

Author: Li, Hongkang, Wang, Meng, Liu, Sijia, Chen, Pin-Yu, and Xiong, Jinjun
Subjects: Computer Science - Machine Learning
Abstract: Graph convolutional networks (GCNs) have recently achieved great empirical success in learning graph-structured data. To address its scalability issue due to the recursive embedding of neighboring features, graph topology sampling has been proposed to reduce the memory and computational cost of training GCNs, and it has achieved comparable test performance to those without topology sampling in many empirical studies. To the best of our knowledge, this paper provides the first theoretical justification of graph topology sampling in training (up to) three-layer GCNs for semi-supervised node classification. We formally characterize some sufficient conditions on graph topology sampling such that GCN training leads to a diminishing generalization error. Moreover, our method tackles the nonconvex interaction of weights across layers, which is under-explored in the existing theoretical analyses of GCNs. This paper characterizes the impact of graph structures and topology sampling on the generalization performance and sample complexity explicitly, and the theoretical findings are also justified through numerical experiments.
Published: 2022

13. How Can Personalized Context Help? Exploring Joint Retrieval of Passage and Personalized Context

Author: Wan, Hui, primary, Li, Hongkang, additional, Lu, Songtao, additional, Cui, Xiaodong, additional, and Danilevsky, Marina, additional
Published: 2024
Full Text: View/download PDF

14. Colorimetric Aerogel Gas Sensor with High Sensitivity and Stability

Author: Xia, Xiaoli, primary, Wu, Ruonan, additional, Zhang, Lei, additional, Chen, Xiangyu, additional, Yan, Yanling, additional, Yin, Jikun, additional, Ren, Jin, additional, Li, Hongkang, additional, Yin, Jinzhong, additional, Xue, Zhenjie, additional, Yi, Lanhua, additional, and Wang, Tie, additional
Published: 2023
Full Text: View/download PDF

15. How Does Promoting the Minority Fraction Affect Generalization? A Theoretical Study of One-Hidden-Layer Neural Network on Group Imbalance

Author: Li, Hongkang, Zhang, Shuai, Zhang, Yihua, Wang, Meng, Liu, Sijia, and Chen, Pin-Yu
Abstract: Group imbalance has been a known problem in empirical risk minimization (ERM), where the achieved high average accuracy is accompanied by low accuracy in a minority group. Despite algorithmic efforts to improve the minority group accuracy, a theoretical generalization analysis of ERM on individual groups remains elusive. By formulating the group imbalance problem with the Gaussian Mixture Model, this paper quantifies the impact of individual groups on the sample complexity, the convergence rate, and the average and group-level testing performance. Although our theoretical framework is centered on binary classification using a one-hidden-layer neural network, to the best of our knowledge, we provide the first theoretical analysis of the group-level generalization of ERM in addition to the commonly studied average generalization performance. Sample insights of our theoretical results include that when all group-level co-variance is in the medium regime and all mean are close to zero, the learning performance is most desirable in the sense of a small sample complexity, a fast training rate, and a high average and group-level testing accuracy. Moreover, we show that increasing the fraction of the minority group in the training data does not necessarily improve the generalization performance of the minority group. Our theoretical results are validated on both synthetic and empirical datasets, such as CelebA and CIFAR-10 in image classification.
Published: 2024
Full Text: View/download PDF

16. Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data

Author: Li, Hongkang, primary, Zhang, Shuai, additional, and Wang, Meng, additional
Published: 2022
Full Text: View/download PDF

17. How does promoting the minority fraction affect generalization? A theoretical study of one-hidden-layer neural network on group imbalance_supp1-3374593.pdf

Author: Li, Hongkang, primary
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

17 results on '"Li, Hongkang"'

1. Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis

2. Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

3. What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

4. Node Identifiers: Compact, Discrete Representations for Efficient Graph Learning

5. How does promoting the minority fraction affect generalization? A theoretical study of the one-hidden-layer neural network on group imbalance

6. How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

7. On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $\epsilon$-Greedy Exploration

8. How Can Context Help? Exploring Joint Retrieval of Passage and Personalized Context

9. Enhancing Graph Transformers with Hierarchical Distance Structural Encoding

10. A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity

11. Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data

12. Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling

13. How Can Personalized Context Help? Exploring Joint Retrieval of Passage and Personalized Context

14. Colorimetric Aerogel Gas Sensor with High Sensitivity and Stability

15. How Does Promoting the Minority Fraction Affect Generalization? A Theoretical Study of One-Hidden-Layer Neural Network on Group Imbalance

16. Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data

17. How does promoting the minority fraction affect generalization? A theoretical study of one-hidden-layer neural network on group imbalance_supp1-3374593.pdf

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

17 results on '"Li, Hongkang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources