Author: "Zha, Sheng" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zha, Sheng"' showing total 105 results

Start Over Author "Zha, Sheng"

105 results on '"Zha, Sheng"'

1. Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning

Author: Sarkar, Soumajyoti, Lausen, Leonard, Cevher, Volkan, Zha, Sheng, Brox, Thomas, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. These models use conditionally activated feedforward subnetworks in transformer blocks, allowing for a separation between total model parameters and per-example computation. However, large token-routed SMoE models face a significant challenge: during inference, the entire model must be used for a sequence or a batch, resulting in high latencies in a distributed setting that offsets the advantages of per-token sparse activation. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures, mainly modulating the choice of expert counts in pretraining. We investigate whether such pruned models offer advantages over smaller SMoE models trained from scratch, when evaluating and comparing them individually on tasks. To that end, we introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training. Our findings reveal a threshold pruning factor for the reduction that depends on the number of experts used in pretraining, above which, the reduction starts to degrade model performance. These insights contribute to our understanding of model design choices when pretraining with SMoE architectures, particularly useful when considering task-specific inference optimization for later stages.
Published: 2024

2. DEM: Distribution Edited Model for Training with Mixed Data Distributions

Author: Ram, Dhananjay, Rawal, Aditya, Hardalov, Momchil, Pappas, Nikolaos, and Zha, Sheng
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, 68T50, F.2.2, I.2.7
Abstract: Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive training runs. In this paper, we propose a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations. The resulting model, namely Distribution Edited Model (DEM), is 11x cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks, yielding upto 6.2% improvement on MMLU, 11.5% on BBH, 16.1% on DROP, 6% on MathQA, and 9.3% on HELM with models of size 3B to 13B. Notably, DEM does not require full re-training when modifying a single data-source, thus making it very flexible and scalable for training with diverse data sources., Comment: Accepted to EMNLP 2024 (Main Conference)
Published: 2024

3. Pre-training Differentially Private Models with Limited Public Data

Author: Bu, Zhiqi, Zhang, Xinwei, Hong, Mingyi, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Cryptography and Security
Abstract: The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with $\epsilon=8$), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models. Our DP pre-trained models are released in fastDP library (https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1), Comment: Accepted at NeurIPS 2024
Published: 2024

4. Extreme Miscalibration and the Illusion of Adversarial Robustness

Author: Raina, Vyas, Tan, Samson, Cevher, Volkan, Rawal, Aditya, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Computation and Language
Abstract: Deep learning-based Natural Language Processing (NLP) models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify. Adversarial Training (AT) is often used to increase model robustness. However, we have discovered an intriguing phenomenon: deliberately or accidentally miscalibrating models masks gradients in a way that interferes with adversarial attack search methods, giving rise to an apparent increase in robustness. We show that this observed gain in robustness is an illusion of robustness (IOR), and demonstrate how an adversary can perform various forms of test-time temperature calibration to nullify the aforementioned interference and allow the adversarial attack to find adversarial examples. Hence, we urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations to ensure that any observed gains are genuine. Finally, we show how the temperature can be scaled during \textit{training} to improve genuine robustness.
Published: 2024

5. Zero redundancy distributed learning with differential privacy

Author: Bu, Zhiqi, Chiu, Justin, Liu, Ruixuan, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Computational Complexity, Computer Science - Cryptography and Security, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Deep learning using large models have achieved great success in a wide range of domains. However, training these models on billions of parameters is very challenging in terms of the training speed, memory cost, and communication efficiency, especially under the privacy-preserving regime with differential privacy (DP). On the one hand, DP optimization has comparable efficiency to the standard non-private optimization on a single GPU, but on multiple GPUs, existing DP distributed learning (such as pipeline parallel) has suffered from significantly worse efficiency. On the other hand, the Zero Redundancy Optimizer (ZeRO) is a state-of-the-art solution to the standard distributed learning, exhibiting excellent training efficiency on large models, but to work compatibly with DP is technically complicated. In this work, we develop a new systematic solution, DP-ZeRO, (I) to scale up the trainable DP model size, e.g. to GPT-100B, (II) to obtain the same computation and communication efficiency as the standard ZeRO, and (III) to enable mixed-precision DP training. Our DP-ZeRO, like the standard ZeRO, has the potential to train models with arbitrary size and is evaluated on the world's largest DP models in terms of the number of trainable parameters.
Published: 2023

6. On the accuracy and efficiency of group-wise clipping in differentially private optimization

Author: Bu, Zhiqi, Liu, Ruixuan, Wang, Yu-Xiang, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Computational Complexity, Computer Science - Cryptography and Security
Abstract: Recent advances have substantially improved the accuracy, memory cost, and training speed of differentially private (DP) deep learning, especially on large vision and language models with millions to billions of parameters. In this work, we thoroughly study the per-sample gradient clipping style, a key component in DP optimization. We show that different clipping styles have the same time complexity but instantiate an accuracy-memory trade-off: while the all-layer clipping (of coarse granularity) is the most prevalent and usually gives the best accuracy, it incurs heavier memory cost compared to other group-wise clipping, such as the layer-wise clipping (of finer granularity). We formalize this trade-off through our convergence theory and complexity analysis. Importantly, we demonstrate that the accuracy gap between group-wise clipping and all-layer clipping becomes smaller for larger models, while the memory advantage of the group-wise clipping remains. Consequently, the group-wise clipping allows DP optimization of large models to achieve high accuracy and low peak memory simultaneously.
Published: 2023

7. Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Author: Zhang, Qingru, Ram, Dhananjay, Hawkins, Cole, Zha, Sheng, and Zhao, Tuo
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Specifically, MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. For the remaining layers, MASformer only employs sparse attention to capture short-range dependencies. Our experiments on natural language modeling and generation tasks show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention while significantly reducing computational cost (up to 75%). Additionally, we investigate the effectiveness of continual training with long sequence data and how sequence length impacts downstream generation performance, which may be of independent interest., Comment: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023 Findings)
Published: 2023

8. Coupling public and private gradient provably helps optimization

Author: Liu, Ruixuan, Bu, Zhiqi, Wang, Yu-xiang, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning
Abstract: The success of large neural networks is crucially determined by the availability of data. It has been observed that training only on a small amount of public data, or privately on the abundant private data can lead to undesirable degradation of accuracy. In this work, we leverage both private and public data to improve the optimization, by coupling their gradients via a weighted linear combination. We formulate an optimal solution for the optimal weight in the convex setting to indicate that the weighting coefficient should be hyperparameter-dependent. Then, we prove the acceleration in the convergence of non-convex loss and the effects of hyper-parameters such as privacy budget, number of iterations, batch size, and model size on the choice of the weighting coefficient. We support our analysis with empirical experiments across language and vision benchmarks, and provide a guideline for choosing the optimal weight of the gradient coupling., Comment: 12 pages
Published: 2023

9. HYTREL: Hypergraph-enhanced Tabular Data Representation Learning

Author: Chen, Pei, Sarkar, Soumajyoti, Lausen, Leonard, Srinivasan, Balasubramaniam, Zha, Sheng, Huang, Ruihong, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Language models pretrained on large collections of tabular data have demonstrated their effectiveness in several downstream tasks. However, many of these models do not take into account the row/column permutation invariances, hierarchical structure, etc. that exist in tabular data. To alleviate these limitations, we propose HYTREL, a tabular language model, that captures the permutation invariances and three more structural properties of tabular data by using hypergraphs - where the table cells make up the nodes and the cells occurring jointly together in each row, column, and the entire table are used to form three different types of hyperedges. We show that HYTREL is maximally invariant under certain conditions for tabular data, i.e., two tables obtain the same representations via HYTREL iff the two tables are identical up to permutations. Our empirical results demonstrate that HYTREL consistently outperforms other competitive baselines on four downstream tasks with minimal pretraining, illustrating the advantages of incorporating the inductive biases associated with tabular data into the representations. Finally, our qualitative analyses showcase that HYTREL can assimilate the table structures to generate robust representations for the cells, rows, columns, and the entire table., Comment: NeurIPS 2023 (spotlight)
Published: 2023

10. Large Language Models of Code Fail at Completing Code with Potential Bugs

Author: Dinh, Tuan, Zhao, Jinman, Tan, Samson, Negrinho, Renato, Lausen, Leonard, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Software Engineering
Abstract: Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a significant gap in post-mitigation performance., Comment: 27 pages, accepted to NeurIPS 2023
Published: 2023

11. Better Context Makes Better Code Language Models: A Case Study on Function Call Argument Completion

Author: Pei, Hengzhi, Zhao, Jinman, Lausen, Leonard, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Software Engineering, Computer Science - Machine Learning, I.2.2, I.2.7
Abstract: Pretrained code language models have enabled great progress towards program synthesis. However, common approaches only consider in-file local context and thus miss information and constraints imposed by other parts of the codebase and its external dependencies. Existing code completion benchmarks also lack such context. To resolve these restrictions we curate a new dataset of permissively licensed Python packages that includes full projects and their dependencies and provide tools to extract non-local information with the help of program analyzers. We then focus on the task of function call argument completion which requires predicting the arguments to function calls. We show that existing code completion models do not yield good results on our completion task. To better solve this task, we query a program analyzer for information relevant to a given function call, and consider ways to provide the analyzer results to different code completion models during inference and training. Our experiments show that providing access to the function implementation and function usages greatly improves the argument completion performance. Our ablation study provides further insights on how different types of information available from the program analyzer and different ways of incorporating the information affect the model performance., Comment: 12 pages. Accepted to AAAI 2023
Published: 2023

12. Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

Author: Sarkar, Soumajyoti, Lin, Kaixiang, Sengupta, Sailik, Lausen, Leonard, Zha, Sheng, and Mansour, Saab
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants. It remains an open problem on whether the limited dialectical data can be used to improve the models trained in Arabic on its dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model and beat existing models (by an avg metric of +$6.41$). We then explore two continual pre-training methods -- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function. We show that both approaches help improve performance on dialectal classification tasks ($+4.64$ avg. gain) when used on monolingual models.
Published: 2022

13. Differentially Private Optimization on Large Model at Small Cost

Author: Bu, Zhiqi, Wang, Yu-Xiang, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Differentially private (DP) optimization is the standard paradigm to learn large neural networks that are accurate and privacy-preserving. The computational cost for DP deep learning, however, is notoriously heavy due to the per-sample gradient clipping. Existing DP implementations are 2-1000X more costly in time and space complexity than the standard (non-private) training. In this work, we develop a novel Book-Keeping (BK) technique that implements existing DP optimizers (thus achieving the same accuracy), with a substantial improvement on the computational cost. Specifically, BK enables DP training on large models and high dimensional data to be roughly as fast and memory-saving as the standard training, whereas previous DP algorithms can be inefficient or incapable of training due to memory error. The computational advantage of BK is supported by the complexity analysis as well as extensive experiments on vision and language tasks. Our implementation achieves state-of-the-art (SOTA) accuracy with very small extra cost: on GPT2 and at almost the same memory cost (<1% overhead), BK has 1.03X the time complexity of the standard training (0.83X training speed in practice), and 0.61X the time complexity of the most efficient DP implementation (1.36X training speed in practice). We open-source the codebase for the BK algorithm at the FastDP library (https://github.com/awslabs/fast-differential-privacy).
Published: 2022

14. Differentially Private Bias-Term Fine-tuning of Foundation Models

Author: Bu, Zhiqi, Wang, Yu-Xiang, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the problem of differentially private (DP) fine-tuning of large pre-trained models -- a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about 0.1% of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is 2~30X faster and uses 2~8X less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP fine-tuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods. We open-source our code at FastDP (https://github.com/awslabs/fast-differential-privacy)., Comment: Accepted at ICML 2024
Published: 2022

15. Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger

Author: Bu, Zhiqi, Wang, Yu-Xiang, Zha, Sheng, and Karypis, George
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping threshold R, however, is vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune R for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, showing that it can enjoy an asymptotic convergence rate that matches the standard SGD, under a symmetric gradient noise assumption of the per-sample gradients (commonly used in the non-DP literature). We demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases., Comment: accepted to NeurIPS 2023
Published: 2022

16. Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning

Author: Padmakumar, Vishakh, Lausen, Leonard, Ballesteros, Miguel, Zha, Sheng, He, He, and Karypis, George
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent work has found that multi-task training with a large number of diverse tasks can uniformly improve downstream performance on unseen target tasks. In contrast, literature on task transferability has established that the choice of intermediate tasks can heavily affect downstream task performance. In this work, we aim to disentangle the effect of scale and relatedness of tasks in multi-task representation learning. We find that, on average, increasing the scale of multi-task learning, in terms of the number of tasks, indeed results in better learned representations than smaller multi-task setups. However, if the target tasks are known ahead of time, then training on a smaller set of related tasks is competitive to the large-scale multi-task training at a reduced computational cost., Comment: NAACL 2022 - Camera ready version
Published: 2022

17. Characterization and biological activity evaluation of water-soluble resveratrol complexes obtained by spray drying, ball milling and jet milling

Author: Yang, Tian-Xiao, Li, Hang, Zhu, Yuan, Gao, Yu, Lv, Hong-Ning, Zha, Sheng-Hua, Sun, Xiao-Li, and Zhao, Qing-Sheng
Published: 2024
Full Text: View/download PDF

18. Meta-learning via Language Model In-context Tuning

Author: Chen, Yanda, Zhong, Ruiqi, Zha, Sheng, Karypis, George, and He, He
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. To tackle this problem in NLP, we propose $\textit{in-context tuning}$, which recasts adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, the labeled examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label from the input sequences on a collection of tasks. We benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to first-order MAML which adapts the model with gradient descent, our method better leverages the inductive bias of LMs to perform pattern matching, and outperforms MAML by an absolute $6\%$ AUC ROC score on BinaryClfs, with increasing advantage w.r.t. model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning directly learns to learn from in-context examples. On BinaryClfs, in-context tuning improves the average AUC-ROC score by an absolute $10\%$, and reduces the variance with respect to example ordering by 6x and example choices by 2x., Comment: ACL 2022 camera-ready version
Published: 2021

19. Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

Author: Zheng, Shuai, Lin, Haibin, Zha, Sheng, and Li, Mu
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which result in long training time and impede development progress. Using stochastic gradient methods with large mini-batch has been advocated as an efficient tool to reduce the training time. Along this line of research, LAMB is a prominent example that reduces the training time of BERT from 3 days to 76 minutes on a TPUv3 Pod. In this paper, we propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training. As the learning rate is theoretically upper bounded by the inverse of the Lipschitz constant of the function, one cannot always reduce the number of optimization iterations by selecting a larger learning rate. In order to use larger mini-batch size without accuracy loss, we develop a new learning rate scheduler that overcomes the difficulty of using large learning rate. Using the proposed LANS method and the learning rate scheme, we scaled up the mini-batch sizes to 96K and 33K in phases 1 and 2 of BERT pretraining, respectively. It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud., Comment: Technical Report (not under reviewed in any venue)
Published: 2020

20. β-cyclodextrin-based nanosponges for crocetin delivery: Physicochemical characterization, aqueous solubility, and bioactivity

Author: Li, Hang, Cui, Ming-Yue, Zha, Sheng-Hua, Tian, Rong-Rong, and Zhao, Qing-Sheng
Published: 2023
Full Text: View/download PDF

21. Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual

Author: He, He, Zha, Sheng, and Wang, Haohan
Subjects: Computer Science - Computation and Language
Abstract: Statistical natural language inference (NLI) models are susceptible to learning dataset bias: superficial cues that happen to associate with the label on a particular dataset, but are not useful in general, e.g., negation words indicate contradiction. As exposed by several recent challenge datasets, these models perform poorly when such association is absent, e.g., predicting that "I love dogs" contradicts "I don't love cats". Our goal is to design learning algorithms that guard against known dataset bias. We formalize the concept of dataset bias under the framework of distribution shift and present a simple debiasing algorithm based on residual fitting, which we call DRiFt. We first learn a biased model that only uses features that are known to relate to dataset bias. Then, we train a debiased model that fits to the residual of the biased model, focusing on examples that cannot be predicted well by biased features only. We use DRiFt to train three high-performing NLI models on two benchmark datasets, SNLI and MNLI. Our debiased models achieve significant gains over baseline models on two challenge test sets, while maintaining reasonable performance on the original test sets., Comment: DeepLo at EMNLP 2019
Published: 2019

22. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing

Author: Guo, Jian, He, He, He, Tong, Lausen, Leonard, Li, Mu, Lin, Haibin, Shi, Xingjian, Wang, Chenguang, Xie, Junyuan, Zha, Sheng, Zhang, Aston, Zhang, Hang, Zhang, Zhi, Zhang, Zhongyue, Zheng, Shuai, and Zhu, Yi
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototyping and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customization. Leveraging the MXNet ecosystem, the deep learning models in GluonCV and GluonNLP can be deployed onto a variety of platforms with different programming languages. The Apache 2.0 license has been adopted by GluonCV and GluonNLP to allow for software distribution, modification, and usage.
Published: 2019

23. Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

Author: Lin, Haibin, Zhang, Hang, Ma, Yifei, He, Tong, Zhang, Zhi, Zha, Sheng, and Li, Mu
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed training. Despite the recent successes in large mini-batch distributed training, these methods are rarely tested in elastic distributed training environments and suffer degraded performance in our experiments, when we adjust the learning rate linearly immediately with respect to the batch size. One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes. We therefore propose to smoothly adjust the learning rate over time to alleviate the influence of the noisy momentum estimation. Our experiments on image classification, object detection and semantic segmentation have demonstrated that our proposed Dynamic SGD method achieves stabilized performance when varying the number of GPUs from 8 to 128. We also provide theoretical understanding on the optimality of linear learning rate scheduling and the effects of stochastic momentum.
Published: 2019

24. Just-in-Time Dynamic-Batching

Author: Zha, Sheng, Jiang, Ziheng, Lin, Haibin, and Zhang, Zhi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Databases
Abstract: Batching is an essential technique to improve computation efficiency in deep learning frameworks. While batch processing for models with static feed-forward computation graphs is straightforward to implement, batching for dynamic computation graphs such as syntax trees or social network graphs is challenging due to variable computation graph structure across samples. Through simulation and analysis of a Tree-LSTM model, we show the key trade-off between graph analysis time and batching effectiveness in dynamic batching. Based on this finding, we propose a dynamic batching method as an extension to MXNet Gluon's just-in-time compilation (JIT) framework. We show empirically that our method yields up to 6.25 times speed-up on a common dynamic workload, a tree-LSTM model for the semantic relatedness task., Comment: NeurIPS 2018 Systems for ML Workshop
Published: 2019

25. Preparation of DES lignin-chitosan aerogel and its adsorption performance for dyes, catechin and epicatechin

Author: Zhu, Yuan, Qi, Ben-Kun, Lv, Hong-Ning, Gao, Yu, Zha, Sheng-Hua, An, Rong-Yan, Zhao, Qing-Sheng, and Zhao, Bing
Published: 2023
Full Text: View/download PDF

26. Question Type Guided Attention in Visual Question Answering

Author: Shi, Yang, Furlanello, Tommaso, Zha, Sheng, and Anandkumar, Animashree
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual Question Answering (VQA) requires integration of feature maps with drastically different structures and focus of the correct regions. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as "Activity Recognition", "Utility" and "Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we achieve 3% improvement for overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack of question type, with minimal performance loss.
Published: 2018

27. Effect of particle size on the physicochemical and antioxidant properties of Forsythia suspensa (Thunb.)Vahl leaf powders

Author: Weng, Di, Zha, Sheng-Hua, Zhu, Yuan, Li, Hang, Hou, Shou-Bu, Zhao, Qing-Sheng, and Zhao, Bing
Published: 2022
Full Text: View/download PDF

28. Better Context Makes Better Code Language Models: A Case Study on Function Call Argument Completion

Author: Pei, Hengzhi, primary, Zhao, Jinman, additional, Lausen, Leonard, additional, Zha, Sheng, additional, and Karypis, George, additional
Published: 2023
Full Text: View/download PDF

29. Preparation and characterisation of wheat starch-based aerogels for procyanidin encapsulation to enhance stability.

Author: Yang, Tian-Xiao, Li, Hang, Zhu, Yuan, Gao, Yu, Lv, Hong-Ning, Zha, Sheng-Hua, Sun, Xiao-Li, and Zhao, Qing-Sheng
Subjects: PROCYANIDINS, WHEAT starch, AEROGELS, LIGHT sources, WHEAT
Abstract: Procyanidins (PC) are formed by the polymerisation of flavan-3-ol monomers, which have excellent bioactivity and present great health benefits. However, the application range of the PC is greatly limited due to their poor stability. Therefore, in this study, starch aerogels were used as carriers to encapsulate PC. After screening, wheat starch aerogel (WSA) was finally chosen to encapsulate PC. Afterward, characterization (SEM, XRD, FT-IR, TGA and BET), stability testing, investigations on antioxidant activity, and in vitro tests simulating digestion were carried out. The results of relevant characterization showed that the micro surface structure of the procyanidin wheat starch aerogel (PC-WSA) presented a network structure, and PC were encapsulated in WSA. In addition, PC-WSA allows PC to be released in the final stage of simulated digestion. Meanwhile, the results of antioxidant experiments indicate that PC-WSA exhibits antioxidant activity after encapsulating PC. Moreover, the retention of PC in PC-WSA was still 53.63 ± 3.23% when stored at 70 °C for 20 days. After 6 days of light exposure, the retention of PC in PC-WSA under different light sources was all above 85%, which was higher than that of PC. In conclusion, the wheat starch aerogel could improve the stability of PC and has great potential for application. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. Preparation of phillyrin/cyclodextrin inclusion complexes and study of their physical properties, solubility enhancement, molecular docking and antioxidant activity.

Author: Qin, Qiao, Zhao, Qing-Sheng, Li, Hang, Ren, Yu-Heng, Zha, Sheng-hua, Tian, Rong-Rong, Li, Jing, and Hou, Shou-bu
Subjects: INCLUSION compounds, MOLECULAR docking, OXIDANT status, FUNCTIONAL foods, ANTIOXIDANTS, CYCLODEXTRINS, SOLUBILITY
Abstract: Phillyrin has good biological activity, but it is insoluble in water, which restricts its use in various industries. In our work, we prepared phillyrin using a green and rapid method from Forsythia suspensa leaves, and its purity is greater than 93%. Then, phillyrin was made to interact with different kinds of cyclodextrins (CDs) as a way to improve the water solubility of phillyrin, and the interaction mechanism between CDs and phillyrin was speculated by various characterization methods. First, phase solubility experiments showed that the phase solubility curves of all three CD inclusion complexes were AL-shaped, indicating that all three inclusion complexes were prepared using phillyrin with CDs in a 1 : 1 stoichiometric ratio. And based on this result, three cyclodextrin inclusion compounds, phillyrin/HP-β-CD (complexation efficiency (CE): 82.58%, the loading efficiency (LE): 22.27%), phillyrin/DM-β-CD IC (CE: 89.31%, LE: 24.83%) and phillyrin/β-CD IC (CE: 68.59%, LE: 23.67%), were prepared. Then, phillyrin was successfully encapsulated as determined by various characterization analyses, after which the conformations of the inclusion complex interactions were analyzed by NMR and molecular docking. The result also indicated that the water solubility of the phillyrin/HP-β-CD, phillyrin/DM-β-CD IC and phillyrin/β-CD IC has significantly improved, which was 30.02, 41.91 and 14.56 times higher than that of pure phillyrin. And the inclusion of CDs also considerably improved the antioxidant capacity of phillyrin, and the DPPH clearance rates of the three inclusion compounds were 4.34, 4.46 and 4.79 times higher than those of pure phillyrin, respectively, and the ABTS clearance rates were 1.35, 1.70 and 1.62 times higher than that of pure phillyrin. The prepared inclusion complexes can be applied in the functional food industry as active ingredients. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

31. Characteristics of Geochemistry and Depositional Environment of Terrestrial Shales in the First Member of Qingshankou Formation of Changling Fault Depression in the Southern Songliao Basin.

Author: Zha Chenxu, Li Zhrnigcheng, Gu Shicha, Ba Zhiding, Wei Zha sheng, Li Lei, and Wang Hailmi
Published: 2023
Full Text: View/download PDF

32. Effect of superfine grinding on the physico‐chemical and antioxidant properties of Cistanche deserticola powders

Author: Hou, Shoubu, primary, Zhao, Qing‐Sheng, additional, Chang, Senlin, additional, Li, Hang, additional, Zha, Sheng‐hua, additional, Li, Qiushuang, additional, Weng, Di, additional, Qin, Qiao, additional, Zhao, Bing, additional, and Zhang, Jinyu, additional
Published: 2023
Full Text: View/download PDF

33. Question Type Guided Attention in Visual Question Answering

Author: Shi, Yang, primary, Furlanello, Tommaso, additional, Zha, Sheng, additional, and Anandkumar, Animashree, additional
Published: 2018
Full Text: View/download PDF

34. Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Author: Zhang, Qingru, primary, Ram, Dhananjay, additional, Hawkins, Cole, additional, Zha, Sheng, additional, and Zhao, Tuo, additional
Published: 2023
Full Text: View/download PDF

35. Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem

Author: Meurer, Aaron, primary, Reines, Athan, additional, Gommers, Ralf, additional, Fang, Yao-Lung, additional, Kirkham, John, additional, Barber, Matthew, additional, Hoyer, Stephan, additional, Müller, Andreas, additional, Zha, Sheng, additional, Shanabrook, Saul, additional, Gacha, Stephannie, additional, Lezcano-Casado, Mario, additional, Fan, Thomas, additional, Reddy, Tyler, additional, Passos, Alexandre, additional, Kwon, Hyukjin, additional, Oliphant, Travis, additional, and Standards, Consortium, additional
Published: 2023
Full Text: View/download PDF

36. Identification of maca (Lepidium meyenii Walp.) and its adulterants by a DNA-barcoding approach based on the ITS sequence

Author: CHEN, Jin-Jin, ZHAO, Qing-Sheng, LIU, Yi-Lan, ZHA, Sheng-Hua, and ZHAO, Bing
Published: 2015
Full Text: View/download PDF

37. Differentially Private Bias-Term only Fine-tuning of Foundation Models

Author: Bu, Zhiqi, Wang, Yu-Xiang, Zha, Sheng, and Karypis, George
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Cryptography and Security (cs.CR), Machine Learning (cs.LG)
Abstract: We study the problem of differentially private (DP) fine-tuning of large pre-trained models -- a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about $0.1\%$ of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is $2\sim 30\times$ faster and uses $2\sim 8\times$ less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP fine-tuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods.
Published: 2022

38. Green Extraction of Forsythoside A, Phillyrin and Phillygenol from Forsythia suspensa Leaves Using a β-Cyclodextrin-Assisted Method

Author: Li, Jing, primary, Qin, Qiao, additional, Zha, Sheng-Hua, additional, Zhao, Qing-Sheng, additional, Li, Hang, additional, Liu, Lu-Peng, additional, Hou, Shou-Bu, additional, and Zhao, Bing, additional
Published: 2022
Full Text: View/download PDF

39. Meta-learning via Language Model In-context Tuning

Author: Chen, Yanda, Zhong, Ruiqi, Zha, Sheng, Karypis, George, and He, He
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. To tackle this problem in NLP, we propose $\textit{in-context tuning}$, which recasts adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, the labeled examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label from the input sequences on a collection of tasks. We benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to first-order MAML which adapts the model with gradient descent, our method better leverages the inductive bias of LMs to perform pattern matching, and outperforms MAML by an absolute $6\%$ AUC ROC score on BinaryClfs, with increasing advantage w.r.t. model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning directly learns to learn from in-context examples. On BinaryClfs, in-context tuning improves the average AUC-ROC score by an absolute $10\%$, and reduces the variance with respect to example ordering by 6x and example choices by 2x., Comment: ACL 2022 camera-ready version
Published: 2022

40. Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning

Author: Padmakumar, Vishakh, primary, Lausen, Leonard, additional, Ballesteros, Miguel, additional, Zha, Sheng, additional, He, He, additional, and Karypis, George, additional
Published: 2022
Full Text: View/download PDF

41. Context, Language Modeling, and Multimodal Data in Finance

Author: Das, Sanjiv, primary, Goggins, Connor, additional, He, John, additional, Karypis, George, additional, Krishnamurthy, Sandeep, additional, Mahajan, Mitali, additional, Prabhala, Nagpurnanand, additional, Slack, Dylan, additional, van Dusen, Rob, additional, Yue, Shenghua, additional, Zha, Sheng, additional, and Zheng, Shuai, additional
Published: 2021
Full Text: View/download PDF

42. Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Author: He, Haoyu, primary, Shi, Xingjian, additional, Mueller, Jonas, additional, Zha, Sheng, additional, Li, Mu, additional, and Karypis, George, additional
Published: 2021
Full Text: View/download PDF

43. Multi-wavelength filters of templated blue phase liquid crystal

Author: ZHA Sheng-hao, 查升毫, primary, SUN Chang-li, 孙长俐, additional, FENG Yi-fan, 冯一凡, additional, and LU Jian-gang, 陆建钢, additional
Published: 2019
Full Text: View/download PDF

44. Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual

Author: He, He, primary, Zha, Sheng, additional, and Wang, Haohan, additional
Published: 2019
Full Text: View/download PDF

45. Question Type Guided Attention in Visual Question Answering

Author: Ferrari, Vittorio, Hebert, Martial, Sminchisescu, Cristian, Weiss, Yair, Shi, Yang, Furlanello, Tommaso, Zha, Sheng, Anandkumar, Animashree, Ferrari, Vittorio, Hebert, Martial, Sminchisescu, Cristian, Weiss, Yair, Shi, Yang, Furlanello, Tommaso, Zha, Sheng, and Anandkumar, Animashree
Abstract: Visual Question Answering (VQA) requires integration of feature maps with drastically different structures. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as “Activity Recognition”, “Utility” and “Counting” on TDIUC dataset compared to the state-of-art. By adding QTA on the state-of-art model MCB, we achieve 3% improvement in overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack question type, with a minimal performance loss.
Published: 2018

46. Additional file 1: of REM sleep behavior disorder was associated with Parkinson’s disease: a community-based study

Author: Ma, Jian-Fang, Hou, Miao-Miao, Tang, Hui-Dong, Gao, Xiang, Liang, Liang, Zhu, Li-Fang, Zhou, Yi, Zha, Sheng-Yu, Shi-Shuang Cui, Du, Juan-Juan, Liu, Jun, and Sheng-Di Chen
Abstract: RBD single questionnaire. (DOCX 13 kb)
Published: 2016
Full Text: View/download PDF

47. REM sleep behavior disorder was associated with Parkinson’s disease: a community-based study

Author: Ma, Jian-Fang, primary, Hou, Miao-Miao, additional, Tang, Hui-Dong, additional, Gao, Xiang, additional, Liang, Liang, additional, Zhu, Li-Fang, additional, Zhou, Yi, additional, Zha, Sheng-Yu, additional, Cui, Shi-Shuang, additional, Du, Juan-Juan, additional, Li, Gen, additional, Liu, Jun, additional, and Chen, Sheng-Di, additional
Published: 2016
Full Text: View/download PDF

48. Biocompatibility and characteristics of theophylline/carboxymethyl chitosan microspheres for pulmonary drug delivery

Author: Zhang, Wei Fen, primary, Zhao, Xin Tong, additional, Zhao, Qing Sheng, additional, Zha, Sheng Hua, additional, Liu, Dong Mei, additional, Zheng, Zeng Juan, additional, Li, Wen Tao, additional, Zhou, Hui Yun, additional, and Yan, Fang, additional
Published: 2013
Full Text: View/download PDF

49. Study on forecasting equilibrium of matrix game based on grey information of players' decision rules

Author: Zha, Sheng-zhong, primary and Wang, Wen-ping, additional
Published: 2011
Full Text: View/download PDF

50. A cross ? sectional study of affective, psychiatric, cognitive disorders and motor complications of Parkinson's disease.

Author: JIANG Qian-wen, ZHA Sheng-yu, and WANG Gang
Subjects: AFFECTIVE disorders, ANXIETY diagnosis, COGNITION disorders diagnosis, MENTAL illness, NEUROSURGERY, NEUROLOGY, PARKINSON'S disease, PSYCHIATRIC treatment, DISEASE complications, DIAGNOSIS
Abstract: Objective To investigate the prevalence, diagnosis and treatment of depression, anxiety, psychiatric symptom, cognitive impairment and motor complications of Parkinson's disease (PD). Methods Face to face interview was carried out among patients with idiopathic PD from Outpatient Department of Neurology, Ruijin Hospital affiliated to Shanghai Jiaotong University School of Medicine from March to May 2015. Self-Rating Depression Scale (SDS), Self-Rating Anxiety Scale (SAS), Mini - Mental State Examination (MMSE) were used for evaluation of depression, anxiety and cognitive impairment. Results A total of 55 patients with PD were enrolled in this study. Prevalences of depression, anxiety, psychiatric symptom and cognitive impairment of PD were 16.36% (9/55), 14.55% (8/55), 23.64% (13/55) and 9.09% (5/55), respectively. Ratio of previous diagnosis and treatment were 2/9, 2/8, 2/13 and 1/5, respectively. Prevalences of fluctuation and dyskinesia were 27.27% (15/55) and 9.09% (5/55), separately. There were no significant differences in prevalences of depression (P = 0.858), anxiety (P = 0.188), psychiatric symptom (P = 0.926), cognitive impairment (P = 0.286), fluctuation (P = 0.205) or dyskinesia (P = 0.417) between male and female PD patients. Conclusions Prevalences of depression, anxiety, psychiatric symptom, cognitive impairment and motor complications of PD were high, while ratios of diagnosis and treatment were relatively low. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

105 results on '"Zha, Sheng"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources