Author: "Gadre, Samir Yitzhak" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Gadre, Samir Yitzhak"' showing total 17 results

Start Over Author "Gadre, Samir Yitzhak"

17 results on '"Gadre, Samir Yitzhak"'

1. Language models scale reliably with over-training and on downstream tasks

Author: Gadre, Samir Yitzhak, Smyrnis, Georgios, Shankar, Vaishaal, Gururangan, Suchin, Wortsman, Mitchell, Shao, Rulin, Mercat, Jean, Fang, Alex, Li, Jeffrey, Keh, Sedrick, Xin, Rui, Nezhurina, Marianna, Vasiljevic, Igor, Jitsev, Jenia, Soldaini, Luca, Dimakis, Alexandros G., Ilharco, Gabriel, Koh, Pang Wei, Song, Shuran, Kollar, Thomas, Carmon, Yair, Dave, Achal, Heckel, Reinhard, Muennighoff, Niklas, and Schmidt, Ludwig
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.
Published: 2024

2. Improving Multimodal Datasets with Image Captioning

Author: Nguyen, Thao, Gadre, Samir Yitzhak, Ilharco, Gabriel, Oh, Sewoong, and Schmidt, Ludwig
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity. The synthetic captions used in our experiments are now available on HuggingFace., Comment: Accepted at NeurIPS 2023 Datasets & Benchmarks
Published: 2023

3. Objaverse-XL: A Universe of 10M+ 3D Objects

Author: Deitke, Matt, Liu, Ruoshi, Wallingford, Matthew, Ngo, Huong, Michel, Oscar, Kusupati, Aditya, Fan, Alan, Laforte, Christian, Voleti, Vikram, Gadre, Samir Yitzhak, VanderBilt, Eli, Kembhavi, Aniruddha, Vondrick, Carl, Gkioxari, Georgia, Ehsani, Kiana, Schmidt, Ludwig, and Farhadi, Ali
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.
Published: 2023

4. DataComp: In search of the next generation of multimodal datasets

Author: Gadre, Samir Yitzhak, Ilharco, Gabriel, Fang, Alex, Hayase, Jonathan, Smyrnis, Georgios, Nguyen, Thao, Marten, Ryan, Wortsman, Mitchell, Ghosh, Dhruba, Zhang, Jieyu, Orgad, Eyal, Entezari, Rahim, Daras, Giannis, Pratt, Sarah, Ramanujan, Vivek, Bitton, Yonatan, Marathe, Kalyani, Mussmann, Stephen, Vencu, Richard, Cherti, Mehdi, Krishna, Ranjay, Koh, Pang Wei, Saukh, Olga, Ratner, Alexander, Song, Shuran, Hajishirzi, Hannaneh, Farhadi, Ali, Beaumont, Romain, Oh, Sewoong, Dimakis, Alex, Jitsev, Jenia, Carmon, Yair, Shankar, Vaishaal, and Schmidt, Ludwig
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai., Comment: NeurIPS 2023 Datasets and Benchmarks Track
Published: 2023

5. Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Author: Zhu, Wanrong, Hessel, Jack, Awadalla, Anas, Gadre, Samir Yitzhak, Dodge, Jesse, Fang, Alex, Yu, Youngjae, Schmidt, Ludwig, Wang, William Yang, and Choi, Yejin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. Multimodal C4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (88%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (80%). After filtering NSFW images, ads, etc., the resulting corpus consists of 101.2M documents with 571M images interleaved in 43B English tokens., Comment: NeurIPS D&B 2023. Project homepage: https://github.com/allenai/mmc4
Published: 2023

6. Patching open-vocabulary models by interpolating weights

Author: Ilharco, Gabriel, Wortsman, Mitchell, Gadre, Samir Yitzhak, Song, Shuran, Hajishirzi, Hannaneh, Kornblith, Simon, Farhadi, Ali, and Schmidt, Ludwig
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However, there are still settings where their zero-shot performance is far from optimal. We study model patching, where the goal is to improve accuracy on specific tasks without degrading accuracy on tasks where performance is already adequate. Towards this goal, we introduce PAINT, a patching method that uses interpolations between the weights of a model before fine-tuning and the weights after fine-tuning on a task to be patched. On nine tasks where zero-shot CLIP performs poorly, PAINT increases accuracy by 15 to 60 percentage points while preserving accuracy on ImageNet within one percentage point of the zero-shot model. PAINT also allows a single model to be patched on multiple tasks and improves with model scale. Furthermore, we identify cases of broad transfer, where patching on one task increases accuracy on other tasks even when the tasks have disjoint classes. Finally, we investigate applications beyond common benchmarks such as counting or reducing the impact of typographic attacks on CLIP. Our findings demonstrate that it is possible to expand the set of tasks on which open-vocabulary models achieve high accuracy without re-training them from scratch., Comment: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Published: 2022

7. Structure from Action: Learning Interactions for Articulated Object 3D Structure Discovery

Author: Nie, Neil, Gadre, Samir Yitzhak, Ehsani, Kiana, and Song, Shuran
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce Structure from Action (SfA), a framework to discover 3D part geometry and joint parameters of unseen articulated objects via a sequence of inferred interactions. Our key insight is that 3D interaction and perception should be considered in conjunction to construct 3D articulated CAD models, especially for categories not seen during training. By selecting informative interactions, SfA discovers parts and reveals occluded surfaces, like the inside of a closed drawer. By aggregating visual observations in 3D, SfA accurately segments multiple parts, reconstructs part geometry, and infers all joint parameters in a canonical coordinate frame. Our experiments demonstrate that a SfA model trained in simulation can generalize to many unseen object categories with diverse structures and to real-world objects. Empirically, SfA outperforms a pipeline of state-of-the-art components by 25.4 3D IoU percentage points on unseen categories, while matching already performant joint estimation baselines.
Published: 2022

8. Continuous Scene Representations for Embodied AI

Author: Gadre, Samir Yitzhak, Ehsani, Kiana, Song, Shuran, and Mottaghi, Roozbeh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: We propose Continuous Scene Representations (CSR), a scene representation constructed by an embodied agent navigating within a space, where objects and their relationships are modeled by continuous valued embeddings. Our method captures feature relationships between objects, composes them into a graph structure on-the-fly, and situates an embodied agent within the representation. Our key insight is to embed pair-wise relationships between objects in a latent space. This allows for a richer representation compared to discrete relations (e.g., [support], [next-to]) commonly used for building scene representations. CSR can track objects as the agent moves in a scene, update the representation accordingly, and detect changes in room configurations. Using CSR, we outperform state-of-the-art approaches for the challenging downstream task of visual room rearrangement, without any task specific training. Moreover, we show the learned embeddings capture salient spatial details of the scene and show applicability to real world data. A summery video and code is available at https://prior.allenai.org/projects/csr., Comment: CVPR 2022
Published: 2022

9. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

Author: Gadre, Samir Yitzhak, Wortsman, Mitchell, Ilharco, Gabriel, Schmidt, Ludwig, and Song, Shuran
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 21 CoW baselines across Habitat, RoboTHOR, and Pasture. In total, we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration -- and no additional training -- matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.
Published: 2022

10. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Author: Wortsman, Mitchell, Ilharco, Gabriel, Gadre, Samir Yitzhak, Roelofs, Rebecca, Gontijo-Lopes, Raphael, Morcos, Ari S., Namkoong, Hongseok, Farhadi, Ali, Carmon, Yair, Kornblith, Simon, and Schmidt, Ludwig
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups., Comment: ICML 2022. The last three authors contributed equally
Published: 2022

11. Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery

Author: Gadre, Samir Yitzhak, Ehsani, Kiana, and Song, Shuran
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: People often use physical intuition when manipulating articulated objects, irrespective of object semantics. Motivated by this observation, we identify an important embodied task where an agent must play with objects to recover their parts. To this end, we introduce Act the Part (AtP) to learn how to interact with articulated objects to discover and segment their pieces. By coupling action selection and motion segmentation, AtP is able to isolate structures to make perceptual part recovery possible without semantic labels. Our experiments show AtP learns efficient strategies for part discovery, can generalize to unseen categories, and is capable of conditional reasoning for the task. Although trained in simulation, we show convincing transfer to real world data with no fine-tuning., Comment: 16 pages, 16 figures
Published: 2021

12. Structure from Action: Learning Interactions for 3D Articulated Object Structure Discovery

Author: Nie, Neil, primary, Gadre, Samir Yitzhak, additional, Ehsani, Kiana, additional, and Song, Shuran, additional
Published: 2023
Full Text: View/download PDF

13. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

Author: Gadre, Samir Yitzhak, primary, Wortsman, Mitchell, additional, Ilharco, Gabriel, additional, Schmidt, Ludwig, additional, and Song, Shuran, additional
Published: 2023
Full Text: View/download PDF

14. Continuous Scene Representations for Embodied AI

Author: Gadre, Samir Yitzhak, primary, Ehsani, Kiana, additional, Song, Shuran, additional, and Mottaghi, Roozbeh, additional
Published: 2022
Full Text: View/download PDF

15. Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery

Author: Gadre, Samir Yitzhak, primary, Ehsani, Kiana, additional, and Song, Shuran, additional
Published: 2021
Full Text: View/download PDF

16. Design and implementation of a stereoscopic visual odometry module

Author: Gadre, Samir Yitzhak, Fan, Hongyi, and Yuan, Rong
Published: 2016
Full Text: View/download PDF

17. Abstract 2883: Enhanced anti-tumor activity and bioavailability of chemopreventives by coated polymeric implants

Author: Gadre, Samir-Yitzhak, primary, Aqil, Farrukh, additional, Jeyabalan, Jeyaprakash, additional, Kausar, Hina, additional, Sharma, Ramjee, additional, Singh, Inder P., additional, and Gupta, Ramesh C., additional
Published: 2012
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

17 results on '"Gadre, Samir Yitzhak"'

1. Language models scale reliably with over-training and on downstream tasks

2. Improving Multimodal Datasets with Image Captioning

3. Objaverse-XL: A Universe of 10M+ 3D Objects

4. DataComp: In search of the next generation of multimodal datasets

5. Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

6. Patching open-vocabulary models by interpolating weights

7. Structure from Action: Learning Interactions for Articulated Object 3D Structure Discovery

8. Continuous Scene Representations for Embodied AI

9. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

10. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

11. Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery

12. Structure from Action: Learning Interactions for 3D Articulated Object Structure Discovery

13. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

14. Continuous Scene Representations for Embodied AI

15. Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery

16. Design and implementation of a stereoscopic visual odometry module

17. Abstract 2883: Enhanced anti-tumor activity and bioavailability of chemopreventives by coated polymeric implants

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

17 results on '"Gadre, Samir Yitzhak"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources