Author: "Chen, Nancy F." / Topic: computer science - machine learning - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chen, Nancy F."' showing total 17 results

Start Over Author "Chen, Nancy F." Topic computer science - machine learning

17 results on '"Chen, Nancy F."'

1. PRESENT: Zero-Shot Text-to-Prosody Control

Author: Lam, Perry, Zhang, Huayun, Chen, Nancy F., Sisman, Berrak, and Herremans, Dorien
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modifying the inference process directly. We apply our text-to-prosody framework to zero-shot language transfer using a JETS model exclusively trained on English LJSpeech data. We obtain character error rates (CER) of 12.8%, 18.7% and 5.9% for German, Hungarian and Spanish respectively, beating the previous state-of-the-art CER by over 2x for all three languages. Furthermore, we allow subphoneme-level control, a first in this field. To evaluate its effectiveness, we show that PRESENT can improve the prosody of questions, and use it to generate Mandarin, a tonal language where vowel pitch varies at subphoneme level. We attain 25.3% hanzi CER and 13.0% pinyin CER with the JETS model. All our code and audio samples are available online.
Published: 2024

2. In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Author: Joaquin, Ayrton San, Wang, Bin, Liu, Zhengyuan, Asher, Nicholas, Lim, Brian, Muller, Philippe, and Chen, Nancy F.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set's coverage of those test points., Comment: EMNLP 2024 - Findings
Published: 2024

3. LOCOST: State-Space Models for Long Document Abstractive Summarization

Author: Bronnec, Florian Le, Duong, Song, Ravaut, Mathieu, Allauzen, Alexandre, Chen, Nancy F., Guigue, Vincent, Lumbreras, Alberto, Soulier, Laure, and Gallinari, Patrick
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L \log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing., Comment: 9 pages, 5 figures, 7 tables, EACL 2024 conference
Published: 2024

4. Prompt Optimization via Adversarial In-Context Learning

Author: Do, Xuan Long, Zhao, Yiran, Brown, Hannah, Xie, Yuxi, Zhao, James Xu, Chen, Nancy F., Kawaguchi, Kenji, Shieh, Michael, and He, Junxian
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompt for in-context learning (ICL) by employing one LLM as a generator, another as a discriminator, and a third as a prompt modifier. As in traditional adversarial learning, adv-ICL is implemented as a two-player game between the generator and discriminator, where the generator tries to generate realistic enough output to fool the discriminator. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator is then tasked with classifying the generator input-output pair as model-generated or real data. Based on the discriminator loss, the prompt modifier proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 11 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, because our method uses pre-trained models and updates only prompts rather than model parameters, it is computationally efficient, easy to extend to any LLM and task, and effective in low-resource settings., Comment: ACL 2024
Published: 2023

5. Multiple output samples per input in a single-output Gaussian process

Author: Wong, Jeremy H. M., Zhang, Huayun, and Chen, Nancy F.
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty information. This differs from a multi-output GP, as all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples, and latent variables are not repeated to reduce computation cost. The test set predictions are inferred similarly to a standard GP, with a difference being in the optimised hyper-parameters. This is evaluated on speechocean762, showing that it allows the GP to compute a test set output distribution that is more similar to the collection of reference outputs from the multiple human raters., Comment: This paper is presented in the "Symposium for Celebrating 40 Years of Bayesian Learning in Speech and Language Processing and Beyond", which is a satellite event of the ASRU workshop, on 20 December 2023. https://bayesian40.github.io/
Published: 2023

6. EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

Author: Lam, Perry, Zhang, Huayun, Chen, Nancy F., and Sisman, Berrak
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Neural models are known to be over-parameterized, and recent work has shown that sparse text-to-speech (TTS) models can outperform dense models. Although a plethora of sparse methods has been proposed for other domains, such methods have rarely been applied in TTS. In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity? We compare a Tacotron2 baseline and the results of applying five techniques. We then evaluate the performance via the factors of naturalness, intelligibility and prosody, while reporting model size and training time. Complementary to prior research, we find that pruning before or during training can achieve similar performance to pruning after training and can be trained much faster, while removing entire neurons degrades performance much more than removing parameters. To our best knowledge, this is the first work that compares sparsity paradigms in text-to-speech synthesis.
Published: 2022
Full Text: View/download PDF

7. A Transformer-based Neural Language Model that Synthesizes Brain Activation Maps from Free-Form Text Queries

Author: Ngo, Gia H., Nguyen, Minh, Chen, Nancy F., and Sabuncu, Mert R.
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Neuroimaging studies are often limited by the number of subjects and cognitive processes that can be feasibly interrogated. However, a rapidly growing number of neuroscientific studies have collectively accumulated an extensive wealth of results. Digesting this growing literature and obtaining novel insights remains to be a major challenge, since existing meta-analytic tools are constrained to keyword queries. In this paper, we present Text2Brain, an easy to use tool for synthesizing brain activation maps from open-ended text queries. Text2Brain was built on a transformer-based neural network language model and a coordinate-based meta-analysis of neuroimaging studies. Text2Brain combines a transformer-based text encoder and a 3D image generator, and was trained on variable-length text snippets and their corresponding activation maps sampled from 13,000 published studies. In our experiments, we demonstrate that Text2Brain can synthesize meaningful neural activation patterns from various free-form textual descriptions. Text2Brain is available at https://braininterpreter.com as a web-based tool for efficiently searching through the vast neuroimaging literature and generating new hypotheses., Comment: arXiv admin note: text overlap with arXiv:2109.13814
Published: 2022
Full Text: View/download PDF

8. Multimodal Dialogue State Tracking

Author: Le, Hung, Chen, Nancy F., and Hoi, Steven C. H.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes, and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems., Comment: Accepted at NAACL 2022 (Oral)
Published: 2022

9. Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query

Author: Ngo, Gia H., Nguyen, Minh, Chen, Nancy F., and Sabuncu, Mert R.
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Machine Learning
Abstract: Most neuroimaging experiments are under-powered, limited by the number of subjects and cognitive processes that an individual study can investigate. Nonetheless, over decades of research, neuroscience has accumulated an extensive wealth of results. It remains a challenge to digest this growing knowledge base and obtain new insights since existing meta-analytic tools are limited to keyword queries. In this work, we propose Text2Brain, a neural network approach for coordinate-based meta-analysis of neuroimaging studies to synthesize brain activation maps from open-ended text queries. Combining a transformer-based text encoder and a 3D image generator, Text2Brain was trained on variable-length text snippets and their corresponding activation maps sampled from 13,000 published neuroimaging studies. We demonstrate that Text2Brain can synthesize anatomically-plausible neural activation patterns from free-form textual descriptions of cognitive concepts. Text2Brain is available at https://braininterpreter.com as a web-based tool for retrieving established priors and generating new hypotheses for neuroscience research., Comment: MICCAI 2021
Published: 2021

10. $C^3$: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues

Author: Le, Hung, Chen, Nancy F., and Hoi, Steven C. H.
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context., Comment: 24th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
Published: 2021

11. VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Author: Le, Hung, Chen, Nancy F., and Hoi, Steven C. H.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on a challenging video-grounded dialogue benchmark as well as a video QA benchmark., Comment: Accepted at NAACL 2022 (Oral)
Published: 2021

12. Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues

Author: Le, Hung, Chen, Nancy F., and Hoi, Steven C. H.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Compared to traditional visual question answering, video-grounded dialogues require additional reasoning over dialogue context to answer questions in a multi-turn setting. Previous approaches to video-grounded dialogues mostly use dialogue context as a simple text input without modelling the inherent information flows at the turn level. In this paper, we propose a novel framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer. PDC model then learns to predict reasoning paths over this semantic graph. Our path prediction model predicts a path from the current turn through past dialogue turns that contain additional visual cues to answer the current question. Our reasoning model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer. Our experimental results demonstrate the effectiveness of our method and provide additional insights on how models use semantic dependencies in a dialogue context to retrieve visual cues., Comment: Accepted at ICLR (International Conference on Learning Representations) 2021
Published: 2021

13. BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Author: Le, Hung, Sahoo, Doyen, Chen, Nancy F., and Hoi, Steven C. H.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-to-spatial reasoning. The bidirectional strategy aims to tackle the evolving semantics of user queries in the dialogue setting. The retrieved visual cues are used as contextual information to construct relevant responses to the users. Our empirical results and comprehensive qualitative analysis show that BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.
Published: 2020

14. UniConv: A Unified Conversational Neural Architecture for Multi-domain Task-oriented Dialogues

Author: Le, Hung, Sahoo, Doyen, Liu, Chenghao, Chen, Nancy F., and Hoi, Steven C. H.
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Building an end-to-end conversational agent for multi-domain task-oriented dialogues has been an open challenge for two main reasons. First, tracking dialogue states of multiple domains is non-trivial as the dialogue agent must obtain complete states from all relevant domains, some of which might have shared slots among domains as well as unique slots specifically for one domain only. Second, the dialogue agent must also process various types of information across domains, including dialogue context, dialogue states, and database, to generate natural responses to users. Unlike the existing approaches that are often designed to train each module separately, we propose "UniConv" -- a novel unified neural architecture for end-to-end conversational systems in multi-domain task-oriented dialogues, which is designed to jointly train (i) a Bi-level State Tracker which tracks dialogue states by learning signals at both slot and domain level independently, and (ii) a Joint Dialogue Act and Response Generator which incorporates information from various input components and models dialogue acts and target responses simultaneously. We conduct comprehensive experiments in dialogue state tracking, context-to-text, and end-to-end settings on the MultiWOZ2.1 benchmark, achieving superior performance over competitive baselines., Comment: Accepted The 2020 Conference on Empirical Methods in Natural Language Processing
Published: 2020

15. Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge

Author: Le, Hung and Chen, Nancy F.
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Audio-Visual Scene-Aware Dialog (AVSD) is an extension from Video Question Answering (QA) whereby the dialogue agent is required to generate natural language responses to address user queries and carry on conversations. This is a challenging task as it consists of video features of multiple modalities, including text, visual, and audio features. The agent also needs to learn semantic dependencies among user utterances and system responses to make coherent conversations with humans. In this work, we describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge. We adopt dot-product attention to combine text and non-text features of input video. We further enhance the generation capability of the dialogue agent by adopting pointer networks to point to tokens from multiple source sequences in each generation step. Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation among all submissions., Comment: Accepted at DSTC Workshop at AAAI 2020
Published: 2020

16. Joint Learning of Word and Label Embeddings for Sequence Labelling in Spoken Language Understanding

Author: Wu, Jiewen, D'Haro, Luis Fernando, Chen, Nancy F., Krishnaswamy, Pavitra, and Banchs, Rafael E.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We propose an architecture to jointly learn word and label embeddings for slot filling in spoken language understanding. The proposed approach encodes labels using a combination of word embeddings and straightforward word-label association from the training data. Compared to the state-of-the-art methods, our approach does not require label embeddings as part of the input and therefore lends itself nicely to a wide range of model architectures. In addition, our architecture computes contextual distances between words and labels to avoid adding contextual windows, thus reducing memory footprint. We validate the approach on established spoken dialogue datasets and show that it can achieve state-of-the-art performance with much fewer trainable parameters., Comment: Accepted for publication at ASRU 2019
Published: 2019

17. Multiple output samples for each input in a single-output Gaussian process

Author: Wong, Jeremy H. M., Zhang, Huayun, and Chen, Nancy F.
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty information. This differs from a multi-output GP, as all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples, and latent variables are not repeated to reduce computation cost. The test set predictions are inferred similarly to a standard GP, with a difference being in the optimised hyper-parameters. This is evaluated on speechocean762, showing that it allows the GP to compute a test set output distribution that is more similar to the collection of reference outputs from the multiple human raters.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

17 results on '"Chen, Nancy F."'

1. PRESENT: Zero-Shot Text-to-Prosody Control

2. In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

3. LOCOST: State-Space Models for Long Document Abstractive Summarization

4. Prompt Optimization via Adversarial In-Context Learning

5. Multiple output samples per input in a single-output Gaussian process

6. EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

7. A Transformer-based Neural Language Model that Synthesizes Brain Activation Maps from Free-Form Text Queries

8. Multimodal Dialogue State Tracking

9. Text2Brain: Synthesis of Brain Activation Maps from Free-form Text Query

10. $C^3$: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues

11. VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

12. Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues

13. BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

14. UniConv: A Unified Conversational Neural Architecture for Multi-domain Task-oriented Dialogues

15. Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge

16. Joint Learning of Word and Label Embeddings for Sequence Labelling in Spoken Language Understanding

17. Multiple output samples for each input in a single-output Gaussian process

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

17 results on '"Chen, Nancy F."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources