Search

Your search keyword '"Chang, Shih-Fu"' showing total 128 results

Search Constraints

Start Over You searched for: Author "Chang, Shih-Fu" Remove constraint Author: "Chang, Shih-Fu" Database arXiv Remove constraint Database: arXiv
128 results on '"Chang, Shih-Fu"'

Search Results

1. JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

2. WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

3. Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

4. MoDE: CLIP Data Experts via Clustering

5. Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

6. RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

7. From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

8. SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

9. Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

10. Video Summarization: Towards Entity-Aware Captions

11. Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond

12. Ferret: Refer and Ground Anything Anywhere at Any Granularity

13. UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

14. Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs

15. Non-Sequential Graph Script Induction via Multimedia Grounding

16. Learning from Children: Improving Image-Caption Pretraining via Curriculum

17. IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

18. Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

19. What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

20. Supervised Masked Knowledge Distillation for Few-Shot Transformers

21. DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

22. In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

23. TempCLR: Temporal Alignment Representation with Contrastive Learning

24. Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

25. Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

26. Video Event Extraction via Tracking Visual States of Arguments

27. Weakly-Supervised Temporal Article Grounding

28. Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy

29. Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

30. Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

31. Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

32. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

33. Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

34. Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting

35. Fine-Grained Visual Entailment

36. Few-Shot Object Detection with Fully Cross-Transformer

37. Learning To Recognize Procedural Activities with Distant Supervision

38. CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

39. CLIP-Event: Connecting Text and Images with Event Structures

40. MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

41. Query Adaptive Few-Shot Object Detection with Heterogeneous Graph Convolutional Networks

42. SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

43. PreViTS: Contrastive Pretraining with Video Tracking Supervision

44. Joint Multimedia Event Extraction from Video and Article

45. Partner-Assisted Learning for Few-Shot Image Classification

46. Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

47. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

48. Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment

49. Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

50. VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Catalog

Books, media, physical & digital resources