67 results on '"Jianwei Yang"'
Search Results
2. Latent Action Pretraining from Videos.
3. Towards Flexible Visual Relationship Segmentation.
4. OmniParser for Pure Vision Based GUI Agent.
5. BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once.
6. List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.
7. V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.
8. Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging.
9. Pix2Gif: Motion-Guided Diffusion for GIF Generation.
10. DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs.
11. Matryoshka Multimodal Models.
12. Efficient Modulation for Vision Networks.
13. Foundation Models for Biomedical Image Segmentation: A Survey.
14. BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys.
15. A Simple Framework for Open-Vocabulary Segmentation and Detection.
16. LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following.
17. Visual In-Context Prompting.
18. detrex: Benchmarking Detection Transformers.
19. IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks.
20. VCoder: Versatile Vision Encoders for Multimodal Large Language Models.
21. A Strong and Reproducible Object Detector with Only Public Datasets.
22. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents.
23. Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection.
24. GLIGEN: Open-Set Grounded Text-to-Image Generation.
25. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
26. GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation.
27. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V.
28. LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models.
29. Learning Customized Visual Models with Retrieval-Augmented Knowledge.
30. Segment Everything Everywhere All at Once.
31. An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models.
32. LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing.
33. Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
34. Interfacing Foundation Models' Embeddings.
35. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.
36. Semantic-SAM: Segment and Recognize Anything at Any Granularity.
37. Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks.
38. Parameter-efficient Fine-tuning for Vision Transformers.
39. CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks.
40. Generalized Decoding for Pixel, Image, and Language.
41. ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models.
42. K-LITE: Learning Transferable Visual Models with External Knowledge.
43. Unified Contrastive Learning in Image-Text-Label Space.
44. Focal Modulation Networks.
45. Efficient Self-supervised Vision Transformers for Representation Learning.
46. VinVL: Making Visual Representations Matter in Vision-Language Models.
47. Grounded Language-Image Pre-training.
48. RegionCLIP: Region-based Language-Image Pretraining.
49. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment.
50. Image Scene Graph Generation (SGG) Benchmark.
Catalog
Books, media, physical & digital resources
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.