Author: "Turc, Iulia" / Topic: fos: computer and information sciences - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Turc, Iulia"' showing total 6 results

Start Over Author "Turc, Iulia" Topic fos: computer and information sciences

6 results on '"Turc, Iulia"'

1. <scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Author: Clark, Jonathan H., Garrette, Dan, Turc, Iulia, and Wieting, John
Subjects: FOS: Computer and information sciences, Human-Computer Interaction, Computer Science - Machine Learning, Linguistics and Language, Computer Science - Computation and Language, Artificial Intelligence, Communication, Computation and Language (cs.CL), Machine Learning (cs.LG), Computer Science Applications
Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters., Comment: TACL Final Version
Published: 2022
Full Text: View/download PDF

2. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Author: Lee, Kenton, Joshi, Mandar, Turc, Iulia, Hu, Hexiang, Liu, Fangyu, Eisenschlos, Julian, Khandelwal, Urvashi, Shaw, Peter, Chang, Ming-Wei, and Toutanova, Kristina
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images., Comment: Accepted at ICML
Published: 2022
Full Text: View/download PDF

3. Measuring Attribution in Natural Language Generation Models

Author: Rashkin, Hannah, Nikolaev, Vitaly, Lamm, Matthew, Aroyo, Lora, Collins, Michael, Das, Dipanjan, Petrov, Slav, Tomar, Gaurav Singh, Turc, Iulia, and Reitter, David
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies.
Published: 2021

4. The MultiBERTs: BERT Reproductions for Robustness Analysis

Author: Sellam, Thibault, Yadlowsky, Steve, Wei, Jason, Saphra, Naomi, D'Amour, Alexander, Linzen, Tal, Bastings, Jasmijn, Turc, Iulia, Eisenstein, Jacob, Das, Dipanjan, Tenney, Ian, and Pavlick, Ellie
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Experiments with pre-trained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact tested in the experiment (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure which includes the architecture, training data, initialization scheme, and loss function. Recent work has shown that repeating the pre-training process can lead to substantially different performance, suggesting that an alternate strategy is needed to make principled statements about procedures. To enable researchers to draw more robust conclusions, we introduce the MultiBERTs, a set of 25 BERT-Base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random weight initialization and shuffling of training data. We also define the Multi-Bootstrap, a non-parametric bootstrap method for statistical inference designed for settings where there are multiple pre-trained models and limited test data. To illustrate our approach, we present a case study of gender bias in coreference resolution, in which the Multi-Bootstrap lets us measure effects that may not be detected with a single checkpoint. We release our models and statistical library along with an additional set of 140 intermediate checkpoints captured during pre-training to facilitate research on learning dynamics., Accepted at ICLR'22. Checkpoints and example analyses: http://goo.gle/multiberts
Published: 2021

5. Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer

Author: Turc, Iulia, Lee, Kenton, Eisenstein, Jacob, Chang, Ming-Wei, and Toutanova, Kristina
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Despite their success, large pre-trained multilingual models have not completely alleviated the need for labeled data, which is cumbersome to collect for all target languages. Zero-shot cross-lingual transfer is emerging as a practical solution: pre-trained models later fine-tuned on one transfer language exhibit surprising performance when tested on many target languages. English is the dominant source language for transfer, as reinforced by popular zero-shot benchmarks. However, this default choice has not been systematically vetted. In our study, we compare English against other transfer languages for fine-tuning, on two pre-trained multilingual models (mBERT and mT5) and multiple classification and question answering tasks. We find that other high-resource languages such as German and Russian often transfer more effectively, especially when the set of target languages is diverse or unknown a priori. Unexpectedly, this can be true even when the training sets were automatically translated from English. This finding can have immediate impact on multilingual zero-shot systems, and should inform future benchmark designs.
Published: 2021
Full Text: View/download PDF

6. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Author: Turc, Iulia, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available., Comment: Added comparison to concurrent work
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

6 results on '"Turc, Iulia"'

1. <scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

2. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

3. Measuring Attribution in Natural Language Generation Models

4. The MultiBERTs: BERT Reproductions for Robustness Analysis

5. Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer

6. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

6 results on '"Turc, Iulia"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources