Author: "Shetty, Pranav" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shetty, Pranav"' showing total 7 results

Start Over Author "Shetty, Pranav" Publication Type Reports

7 results on '"Shetty, Pranav"'

1. 'What is the value of {templates}?' Rethinking Document Information Extraction Datasets for LLMs

Author: Zmigrod, Ran, Shetty, Pranav, Sibue, Mathieu, Ma, Zhiqiang, Nourbakhsh, Armineh, Liu, Xiaomo, and Veloso, Manuela
Subjects: Computer Science - Computation and Language
Abstract: The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template "What is the value for the {key}?". However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training., Comment: Accepted to EMNLP Findings 2024
Published: 2024
Full Text: View/download PDF

2. Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

Author: Shetty, Pranav, Adeboye, Aishat, Gupta, Sonakshi, Zhang, Chao, and Ramprasad, Rampi
Subjects: Condensed Matter - Materials Science, Computer Science - Computation and Language, Physics - Applied Physics
Abstract: We present a simulation of various active learning strategies for the discovery of polymer solar cell donor/acceptor pairs using data extracted from the literature spanning $\sim$20 years by a natural language processing pipeline. While data-driven methods have been well established to discover novel materials faster than Edisonian trial-and-error approaches, their benefits have not been quantified for material discovery problems that can take decades. Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation. Our pipeline enables us to extract data from greater than 3300 papers which is $\sim$5 times larger and therefore more diverse than similar data sets reported by others. We also trained machine learning models to predict the power conversion efficiency and used our model to identify promising donor-acceptor combinations that are as yet unreported. We thus demonstrate a pipeline that goes from published literature to extracted material property data which in turn is used to obtain data-driven insights. Our insights include active learning strategies that can be used to train strong predictive models of material properties or be robust to the initial material system used. This work provides a valuable framework for data-driven research in materials science.
Published: 2024
Full Text: View/download PDF

3. PolyIE: A Dataset of Information Extraction from Polymer Material Scientific Literature

Author: Cheung, Jerry Junyang, Zhuang, Yuchen, Li, Yinghao, Shetty, Pranav, Zhao, Wantian, Grampurohit, Sanjeev, Ramprasad, Rampi, and Zhang, Chao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Scientific information extraction (SciIE), which aims to automatically extract information from scientific literature, is becoming more important than ever. However, there are no existing SciIE datasets for polymer materials, which is an important class of materials used ubiquitously in our daily lives. To bridge this gap, we introduce POLYIE, a new SciIE dataset for polymer materials. POLYIE is curated from 146 full-length polymer scholarly articles, which are annotated with different named entities (i.e., materials, properties, values, conditions) as well as their N-ary relations by domain experts. POLYIE presents several unique challenges due to diverse lexical formats of entities, ambiguity between entities, and variable-length relations. We evaluate state-of-the-art named entity extraction and relation extraction models on POLYIE, analyze their strengths and weaknesses, and highlight some difficult cases for these models. To the best of our knowledge, POLYIE is the first SciIE benchmark for polymer materials, and we hope it will lead to more research efforts from the community on this challenging task. Our code and data are available on: https://github.com/jerry3027/PolyIE., Comment: Work in progress
Published: 2023

4. Cross-Geography Generalization of Machine Learning Methods for Classification of Flooded Regions in Aerial Images

Author: Lenka, Sushant, Kerhalkar, Pratyush, Shetty, Pranav, Gupta, Harsh, Vidyarthi, Bhavam, and Verma, Ujjwal
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Identification of regions affected by floods is a crucial piece of information required for better planning and management of post-disaster relief and rescue efforts. Traditionally, remote sensing images are analysed to identify the extent of damage caused by flooding. The data acquired from sensors onboard earth observation satellites are analyzed to detect the flooded regions, which can be affected by low spatial and temporal resolution. However, in recent years, the images acquired from Unmanned Aerial Vehicles (UAVs) have also been utilized to assess post-disaster damage. Indeed, a UAV based platform can be rapidly deployed with a customized flight plan and minimum dependence on the ground infrastructure. This work proposes two approaches for identifying flooded regions in UAV aerial images. The first approach utilizes texture-based unsupervised segmentation to detect flooded areas, while the second uses an artificial neural network on the texture features to classify images as flooded and non-flooded. Unlike the existing works where the models are trained and tested on images of the same geographical regions, this work studies the performance of the proposed model in identifying flooded regions across geographical regions. An F1-score of 0.89 is obtained using the proposed segmentation-based approach which is higher than existing classifiers. The robustness of the proposed approach demonstrates that it can be utilized to identify flooded regions of any region with minimum or no user intervention.
Published: 2022

5. A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

Author: Shetty, Pranav, Rajan, Arunkumar Chitteth, Kuenneth, Christopher, Gupta, Sonkakshi, Panchumarti, Lakshmi Prerana, Holm, Lauren, Zhang, Chao, and Ramprasad, Rampi
Subjects: Computer Science - Computation and Language, Condensed Matter - Materials Science, Condensed Matter - Soft Condensed Matter
Abstract: The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.
Published: 2022
Full Text: View/download PDF

6. PRBoost: Prompt-Based Rule Discovery and Boosting for Interactive Weakly-Supervised Learning

Author: Zhang, Rongzhi, Yu, Yue, Shetty, Pranav, Song, Le, and Zhang, Chao
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Weakly-supervised learning (WSL) has shown promising results in addressing label scarcity on many NLP tasks, but manually designing a comprehensive, high-quality labeling rule set is tedious and difficult. We study interactive weakly-supervised learning -- the problem of iteratively and automatically discovering novel labeling rules from data to improve the WSL model. Our proposed model, named PRBoost, achieves this goal via iterative prompt-based rule discovery and model boosting. It uses boosting to identify large-error instances and then discovers candidate rules from them by prompting pre-trained LMs with rule templates. The candidate rules are judged by human experts, and the accepted rules are used to generate complementary weak labels and strengthen the current model. Experiments on four tasks show PRBoost outperforms state-of-the-art WSL baselines up to 7.1% and bridges the gaps with fully supervised models. Our Implementation is available at \url{https://github.com/rz-zhang/PRBoost}., Comment: ACL 2022 (Main Conference). Code: https://github.com/rz-zhang/PRBoost
Published: 2022

7. BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition

Author: Li, Yinghao, Shetty, Pranav, Liu, Lucas, Zhang, Chao, and Song, Le
Subjects: Computer Science - Computation and Language
Abstract: We study the problem of learning a named entity recognition (NER) tagger using noisy labels from multiple weak supervision sources. Though cheap to obtain, the labels from weak supervision sources are often incomplete, inaccurate, and contradictory, making it difficult to learn an accurate NER model. To address this challenge, we propose a conditional hidden Markov model (CHMM), which can effectively infer true labels from multi-source noisy labels in an unsupervised way. CHMM enhances the classic hidden Markov model with the contextual representation power of pre-trained language models. Specifically, CHMM learns token-wise transition and emission probabilities from the BERT embeddings of the input tokens to infer the latent true labels from noisy observations. We further refine CHMM with an alternate-training approach (CHMM-ALT). It fine-tunes a BERT-NER model with the labels inferred by CHMM, and this BERT-NER's output is regarded as an additional weak source to train the CHMM in return. Experiments on four NER benchmarks from various domains show that our method outperforms state-of-the-art weakly supervised NER models by wide margins.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

7 results on '"Shetty, Pranav"'

1. 'What is the value of {templates}?' Rethinking Document Information Extraction Datasets for LLMs

2. Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

3. PolyIE: A Dataset of Information Extraction from Polymer Material Scientific Literature

4. Cross-Geography Generalization of Machine Learning Methods for Classification of Flooded Regions in Aerial Images

5. A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

6. PRBoost: Prompt-Based Rule Discovery and Boosting for Interactive Weakly-Supervised Learning

7. BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

7 results on '"Shetty, Pranav"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources