Author: "Xirong Li" / Publisher: acm - Searchworks@Jio Institute Digital Library Search Results

1. Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

Author: Yang Zhou, Weihong Yu, Youxin Chen, Jianchun Zhao, Dayong Ding, Xirong Li, Jie Wang, and Hailan Lin
Subjects: FOS: Computer and information sciences, Modalities, Modality (human–computer interaction), genetic structures, Artificial neural network, Computer Science - Artificial Intelligence, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, Multimedia (cs.MM), Domain (software engineering), Artificial Intelligence (cs.AI), Modal, Data acquisition, Artificial intelligence, Set (psychology), business, Computer Science - Multimedia
Abstract: This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic efficacy of CFP and OCT is disease-dependent, the network's ability of being both selective and interpretable is important. Moreover, as both data acquisition and manual labeling are extremely expensive in the medical domain, the network has to be relatively lightweight for learning from a limited set of labeled multi-modal samples. Prior art on retinal disease recognition focuses either on a single disease or on a single modality, leaving multi-modal fusion largely underexplored. We propose in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head attention modules) makes it suited for learning from relatively small-sized datasets. For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by over sampling a given CFP. The benefits of this tactic include well balancing instances across modalities, increasing the resolution of the CFP input, and finding out regions of the CFP most relevant with respect to the final diagnosis. Extensive experiments on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836 subjects demonstrate the viability of the proposed model., Accepted by ACM Multimedia 2021 (Main Track)
Published: 2021

2. Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning

Author: Fan Hu, Zihan Wang, Xinru Chen, Aozhu Chen, Chengbo Dong, and Xirong Li
Subjects: Closed captioning, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, computer.software_genre, Task (project management), Test set, Encoding (memory), Key (cryptography), Reinforcement learning, Artificial intelligence, Representation (mathematics), business, computer, Decoding methods, Natural language processing
Abstract: This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and caption decoding. For encoding, we propose to extract multi-level video features that describe holistic scenes and fine-grained key objects, respectively. The scene-level and object-level features are enhanced separately by multi-head self-attention mechanisms before feeding them into the decoding module. Towards generating content-relevant and human-like captions, we train our network end-to-end by semantic-reinforced learning. Finally, in order to select the best caption from captions produced by distinct models, we perform caption reranking by cross-modal matching between a given video and each candidate caption. Both internal experiments on the MSR-VTT test set and external evaluations by the challenge organizers justify the viability of the proposed solution.
Published: 2021

3. Towards annotation-free evaluation of cross-lingual image captioning

Author: Hailan Lin, Xinyi Huang, Xirong Li, and Aozhu Chen
Subjects: FOS: Computer and information sciences, Closed captioning, Machine translation, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Feature vector, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, computer.software_genre, Field (computer science), Multimedia (cs.MM), Image (mathematics), Annotation, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Relevance (information retrieval), Artificial intelligence, business, computer, Computer Science - Multimedia, Natural language processing, Word (computer architecture)
Abstract: Cross-lingual image captioning, with its ability to caption an unlabeled image in a target language other than English, is an emerging topic in the multimedia field. In order to save the precious human resource from re-writing reference sentences per target language, in this paper we make a brave attempt towards annotation-free evaluation of cross-lingual image captioning. Depending on whether we assume the availability of English references, two scenarios are investigated. For the first scenario with the references available, we propose two metrics, i.e., WMDRel and CLinRel. WMDRel measures the semantic relevance between a model-generated caption and machine translation of an English reference using their Word Mover's Distance. By projecting both captions into a deep visual feature space, CLinRel is a visual-oriented cross-lingual relevance measure. As for the second scenario, which has zero reference and is thus more challenging, we propose CMedRel to compute a cross-media relevance between the generated caption and the image content, in the same visual feature space as used by CLinRel. We have conducted a number of experiments to evaluate the effectiveness of the three proposed metrics. The combination of WMDRel, CLinRel and CMedRel has a Spearman's rank correlation of 0.952 with the sum of BLEU-4, METEOR, ROUGE-L and CIDEr, four standard metrics computed using references in the target language. CMedRel alone has a Spearman's rank correlation of 0.786 with the standard metrics. The promising results show high potential of the new metrics for evaluation with no need of references in the target language.
Published: 2021

4. A W2VV++ Case Study with Automated and Interactive Text-to-Video Retrieval

Author: František Mejzlík, Chaoxi Xu, Jakub Lokoč, Patrik Veselý, Xirong Li, Tomáš Souček, and Jiaqi Ji
Subjects: Information retrieval, Recall, Interactive video, Process (engineering), Computer science, business.industry, Deep learning, Full text search, 020207 software engineering, 02 engineering and technology, Visualization, Task (project management), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Feature learning
Abstract: As reported by respected evaluation campaigns focusing both on automated and interactive video search approaches, deep learning started to dominate the video retrieval area. However, the results are still not satisfactory for many types of search tasks focusing on high recall. To report on this challenging problem, we present two orthogonal task-based performance studies centered around the state-of-the-art W2VV++ query representation learning model for video retrieval. First, an ablation study is presented to investigate which components of the model are effective in two types of benchmark tasks focusing on high recall. Second, interactive search scenarios from the Video Browser Showdown are analyzed for two winning prototype systems implementing a selected variant of the model and providing additional querying and visualization components. The analysis of collected logs demonstrates that even with the state-of-the-art text search video retrieval model, it is still auspicious to integrate users into the search process for task types, where high recall is essential.
Published: 2020

5. iCap: Interactive Image Captioning with Predictive Text

Author: Xirong Li and Zhengxiong Jia
Subjects: FOS: Computer and information sciences, Closed captioning, Standard test image, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Deep learning, Computer Science - Human-Computer Interaction, Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Inference, 02 engineering and technology, computer.software_genre, Sentence completion tests, Human-Computer Interaction (cs.HC), Asynchronous communication, 0202 electrical engineering, electronic engineering, information engineering, Human-in-the-loop, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Predictive text, Natural language processing
Abstract: In this paper we study a brand new topic of interactive image captioning with human in the loop. Different from automated image captioning where a given test image is the sole input in the inference stage, we have access to both the test image and a sequence of (incomplete) user-input sentences in the interactive scenario. We formulate the problem as Visually Conditioned Sentence Completion (VCSC). For VCSC, we propose asynchronous bidirectional decoding for image caption completion (ABD-Cap). With ABD-Cap as the core module, we build iCap, a web-based interactive image captioning system capable of predicting new text with respect to live input from a user. A number of experiments covering both automated evaluations and real user studies show the viability of our proposals.
Published: 2020

6. W2VV++

Author: Chaoxi Xu, Zhineng Chen, Gang Yang, Jianfeng Dong, and Xirong Li
Subjects: Matching (statistics), Computer science, business.industry, Deep learning, 020207 software engineering, 02 engineering and technology, Machine learning, computer.software_genre, TRECVID, Ranking (information retrieval), Ranking, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Feature learning, computer, Sentence
Abstract: Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple yet important changes, W2VV++ brings in a substantial improvement. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.
Published: 2019

7. Deep Learning for Video Retrieval by Natural Language

Author: Xirong Li
Subjects: Information retrieval, Rule-based machine translation, Computer science, business.industry, Deep learning, Code (cryptography), Information needs, Artificial intelligence, Blackboard (design pattern), business, Set (psychology), TRECVID, Natural language
Abstract: Videos are everywhere. Video retrieval, i.e., finding videos that meet the information need of a specific user, is important for a wide range of applications including communication, education, entertainment, business, security etc. Among multiple ways of expressing the information need, a natural-language text is the most intuitive to start a retrieval process. For instance, to find video shots showing "a person in front of a blackboard talking or writing in a classroom". Such a query can be submitted easily, by typing or speech recognition, to a video retrieval system. Given a video as a sequence of frames and a query as a sequence of words, a fundamental problem in video retrieval by natural language is how to properly associate visual and linguistic information presented in sequential order. We attempt to address the fundamental problem by decomposing our quest along the following three dimensions: (1) Query representation, (2) Video representation, (3) Common space. The three dimensions also account for major designs in the state-of-the-art systems. We introduce a set of deep learning methods recently developed by our joint team of RUC, ZJGU, UvA and CAS. We evaluate the deep models on the TRECVID Ad-hoc Video Search (AVS) benchmark over the last three years (2016-2018). Much room exists for future research. Compared to video retrieval with semantic representations, deep learning approaches lack an intuitive explanation of the results obtained, in particular when the results are unsatisfactory. As the retrieval performance continues to improve, the accountability of a video retrieval model requires more research attention. While a well-performed deep model can be largely expected given adequate training data, novel algorithms that enable learning a video retrieval model from limited training resource are much in demand. Consider, for instance, visual annotation and retrieval for a target language other than English. Data and code used for this research are available at http://github.com/li-xirong/video-retrieval.
Published: 2019

8. Exploring Content-based Video Relevance for Video Click-Through Rate Prediction

Author: Leimin Zhang, Miao Zhang, Yali Du, Xun Wang, Xirong Li, and Jianfeng Dong
Subjects: Constraint (information theory), business.industry, Computer science, Content (measure theory), Stability (learning theory), Relevance (information retrieval), Artificial intelligence, business, Machine learning, computer.software_genre, Click-through rate, Feature learning, computer
Abstract: This paper describes our solution for the Hulu Challenge. To answer the challenge, we introduce two content-based models, namely, Cascading Mapping Network (CMN) and Relevant-Enhanced Deep Interest Network (REDIN). CMN predicts video Click-Through Rate (CTR) by predicting content-based video relevance. REDIN mainly improves the popular Deep Interest Network by adding explicit video relevance constraint, which provides guidance for low-level video feature learning thus helpful for CTR prediction. Based on the two models, our solution obtains Area Under Curve (AUC) score of 0.6022 and 0.6155 on the TV-shows and Movie track respectively. What is more, we are one of the only two teams giving scores of over 0.6 on both tracks. The results justify the effectiveness and stability of our proposed solution.
Published: 2019

9. Dissimilarity Representation Learning for Generalized Zero-Shot Recognition

Author: Jinlu Liu, Gang Yang, Jieping Xu, and Xirong Li
Subjects: Similarity (geometry), Artificial neural network, Computer science, business.industry, Feature vector, Pattern recognition, 02 engineering and technology, Margin (machine learning), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), Feature (machine learning), 020201 artificial intelligence & image processing, Artificial intelligence, business, Representation (mathematics), Feature learning
Abstract: Generalized zero-shot learning (GZSL) aims to recognize any test instance coming either from a known class or from a novel class that has no training instance. To synthesize training instances for novel classes and thus resolving GZSL as a common classification problem, we propose a Dissimilarity Representation Learning (DSS) method. Dissimilarity representation is to represent a specific instance in terms of its (dis)similarity to other instances in a visual or attribute based feature space. In the dissimilarity space, instances of the novel classes are synthesized by an end-to-end optimized neural network. The neural network realizes two-level feature mappings and domain adaptions in the dissimilarity space and the attribute based feature space. Experimental results on five benchmark datasets, i.e., AWA, AWA$_2$, SUN, CUB, and aPY, show that the proposed method improves the state-of-the-art with a large margin, approximately 10% gain in terms of the harmonic mean of the top-1 accuracy. Consequently, this paper establishes a new baseline for GZSL.
Published: 2018

10. Feature Re-Learning with Data Augmentation for Content-based Video Recommendation

Author: Xun Wang, Xirong Li, Jianfeng Dong, Gang Yang, and Chaoxi Xu
Subjects: business.industry, Computer science, Supervised learning, 02 engineering and technology, Machine learning, computer.software_genre, Feature (computer vision), 020204 information systems, Content (measure theory), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Baseline (configuration management), computer
Abstract: This paper describes our solution for the Hulu Content-based Video Relevance Prediction Challenge. Noting the deficiency of the original features, we propose feature re-learning to improve video relevance prediction. To generate more training instances for supervised learning, we develop two data augmentation strategies, one for frame-level features and the other for video-level features. In addition, late fusion of multiple models is employed to further boost the performance. Evaluation conducted by the organizers shows that our best run outperforms the Hulu baseline, obtaining relative improvements of 26.2% and 30.2% on the TV-shows track and the Movies track, respectively, in terms of recall@100. The results clearly justify the effectiveness of the proposed solution.
Published: 2018

11. Session details: FF-5

Author: Xirong Li
Subjects: Multimedia, Computer science, Session (computer science), computer.software_genre, computer
Published: 2018

12. Imagination Based Sample Construction for Zero-Shot Learning

Author: Jinlu Liu, Gang Yang, and Xirong Li
Subjects: FOS: Computer and information sciences, Imagination, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Feature vector, media_common.quotation_subject, 05 social sciences, Supervised learning, Computer Science - Computer Vision and Pattern Recognition, Probabilistic logic, Pattern recognition, 02 engineering and technology, Zero shot learning, 050105 experimental psychology, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Artificial intelligence, business, Image retrieval, Classifier (UML), media_common
Abstract: Zero-shot learning (ZSL) which aims to recognize unseen classes with no labeled training sample, efficiently tackles the problem of missing labeled data in image retrieval. Nowadays there are mainly two types of popular methods for ZSL to recognize images of unseen classes: probabilistic reasoning and feature projection. Different from these existing types of methods, we propose a new method: sample construction to deal with the problem of ZSL. Our proposed method, called Imagination Based Sample Construction (IBSC), innovatively constructs image samples of target classes in feature space by mimicking human associative cognition process. Based on an association between attribute and feature, target samples are constructed from different parts of various samples. Furthermore, dissimilarity representation is employed to select high-quality constructed samples which are used as labeled data to train a specific classifier for those unseen classes. In this way, zero-shot learning is turned into a supervised learning problem. As far as we know, it is the first work to construct samples for ZSL thus, our work is viewed as a baseline for future sample construction methods. Experiments on four benchmark datasets show the superiority of our proposed method., Accepted as a short paper in ACM SIGIR 2018
Published: 2018

13. Harvesting Deep Models for Cross-Lingual Image Annotation

Author: Xirong Li, Xiaoxu Wang, and Qijie Wei
Subjects: Vocabulary, Matching (graph theory), Machine translation, Computer science, business.industry, media_common.quotation_subject, 02 engineering and technology, Ambiguity, computer.software_genre, Automatic image annotation, 020204 information systems, Test set, 0202 electrical engineering, electronic engineering, information engineering, Redundancy (engineering), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Image retrieval, Natural language processing, media_common
Abstract: This paper considers cross-lingual image annotation, harvesting deep visual models from one language to annotate images with labels from another language. This task cannot be accomplished by machine translation, as labels can be ambiguous and a translated vocabulary leaves us limited freedom to annotate images with appropriate labels. Given non-overlapping vocabularies between two languages, we formulate cross-lingual image annotation as a zero-shot learning problem. For cross-lingual label matching, we adapt zero-shot by replacing the current monolingual semantic embedding space by a bilingual alternative. In order to reduce both label ambiguity and redundancy we propose a simple yet effective approach called label-enhanced zero-shot learning. Using three state-of-the-art deep visual models, i.e., ResNet-152, GoogleNet-Shuffle and OpenImages, experiments on the test set of Flickr8k-CN demonstrate the viability of the proposed approach for cross-lingual image annotation.
Published: 2017

14. Detecting Violence in Video using Subclasses

Author: Qin Jin, Yujia Huo, Jieping Xu, and Xirong Li
Subjects: FOS: Computer and information sciences, Computer science, Generalization, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 020207 software engineering, 02 engineering and technology, Machine learning, computer.software_genre, Motion (physics), Multimedia (cs.MM), Test set, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Computer Science - Multimedia, Test data
Abstract: This paper attacks the challenging problem of violence detection in videos. Different from existing works focusing on combining multi-modal features, we go one step further by adding and exploiting subclasses visually related to violence. We enrich the MediaEval 2015 violence dataset by manually labeling violence videos with respect to the subclasses. Such fine-grained annotations not only help understand what have impeded previous efforts on learning to fuse the multi-modal features, but also enhance the generalization ability of the learned fusion to novel test data. The new subclass based solution, with AP of 0.303 and P100 of 0.55 on the MediaEval 2015 test set, outperforms the state-of-the-art. Notice that our solution does not require fine-grained annotations on the test set, so it can be directly applied on novel and fully unlabeled videos. Interestingly, our study shows that motion related features (MBH, HOG and HOF), though being essential part in previous systems, are seemingly dispensable. Data is available at http://lixirong.net/datasets/mm2016vsd
Published: 2016

15. Adding Chinese Captions to Images

Author: Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu
Subjects: Closed captioning, Machine translation, Computer science, business.industry, Context (language use), 02 engineering and technology, computer.software_genre, Set (abstract data type), 03 medical and health sciences, 0302 clinical medicine, Automatic image annotation, 030221 ophthalmology & optometry, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Dimension (data warehouse), business, Image retrieval, computer, Sentence, Natural language processing
Abstract: This paper extends research on automated image captioning in the dimension of language, studying how to generate Chinese sentence descriptions for unlabeled images. To evaluate image captioning in this novel context, we present Flickr8k-CN, a bilingual extension of the popular Flickr8k set. The new multimedia dataset can be used to quantitatively assess the performance of Chinese captioning and English-Chinese machine translation. The possibility of re-using existing English data and models via machine translation is investigated. Our study reveals to some extent that a computer can master two distinct languages, English and Chinese, at a similar level for describing the visual world. Data is publicly available at http://tinyurl.com/flickr8kcn
Published: 2016

16. Image Tag Assignment, Refinement and Retrieval

Author: Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Xirong Li, Cees G. M. Snoek, and Tiberio Uricchio
Subjects: tag refinement, Information retrieval, tag retrieval, Computer science, Context (language use), Content-based image retrieval, tag assignment, Session (web analytics), Set (abstract data type), social tagging, tag relevance, Automatic image annotation, Image retrieval
Abstract: This tutorial focuses on challenges and solutions for content-based image annotation and retrieval in the context of online image sharing and tagging. We present a unified review on three closely linked problems, i.e., tag assignment, tag refinement, and tag-based image retrieval. We introduce a taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. Moreover, we present an open-source testbed, with training sets of varying sizes and three test datasets, to evaluate methods of varied learning complexity. A selected set of eleven representative works have been implemented and evaluated. During the tutorial we provide a practice session for hands on experience with the methods, software and datasets. For repeatable experiments all data and code are online at http://www.micc.unifi.it/tagsurvey
Published: 2015

17. Image Retrieval by Cross-Media Relevance Fusion

Author: Duanqing Xu, Xiaoyong Du, Jieping Xu, Xirong Li, Shuai Liao, and Jianfeng Dong
Subjects: Set (abstract data type), Information retrieval, Computer science, Simple (abstract algebra), Key (cryptography), Relevance (information retrieval), Data mining, Visual Word, computer.software_genre, Base (topology), computer, Image retrieval, Image (mathematics)
Abstract: How to estimate cross-media relevance between a given query and an unlabeled image is a key question in the MSR-Bing Image Retrieval Challenge. We answer the question by proposing cross-media relevance fusion, a conceptually simple framework that exploits the power of individual methods for cross-media relevance estimation. Four base cross-media relevance functions are investigated, and later combined by weights optimized on the development set. With DCG25 of 0.5200 on the test dataset, the proposed image retrieval system secures the first place in the evaluation.
Published: 2015

18. Towards structured semantic embedding of multimedia

Author: Shuai Liao, Yujia Huo, Xixi He, Weiyu Lan, and Xirong Li
Subjects: Information retrieval, Multimedia, business.industry, Computer science, WordNet, computer.software_genre, Semantic grid, Semantic similarity, Semantic equivalence, Semantic computing, Embedding, Semantic technology, Artificial intelligence, Semantic Web Stack, business, computer, Natural language processing
Abstract: This abstract paper sketches our research towards Structured Semantic Embedding of multimedia data. Though a tag may have multiple senses with completely different visual imagery, current semantic embedding methods represent the tag by a single vector regardless of its senses. We challenge this convention, arguing the importance of adding semantic structures into semantic embedding. In particular, we develop Hierarchical Semantic Embedding, a simple model that exploits the WordNet hierarchy to make the semantic embedding structured to some extent. We demonstrate the viability of structured semantic embedding for tag disambiguation and zero-shot image tagging.
Published: 2015

19. Zero-shot Image Tagging by Hierarchical Semantic Embedding

Author: Xiaoyong Du, Weiyu Lan, Gang Yang, Shuai Liao, and Xirong Li
Subjects: Similarity (geometry), Hierarchy (mathematics), Computer science, business.industry, Test set, WordNet, Embedding, Pattern recognition, Artificial intelligence, Language model, Object (computer science), business, Image (mathematics)
Abstract: Given the difficulty of acquiring labeled examples for many fine-grained visual classes, there is an increasing interest in zero-shot image tagging, aiming to tag images with novel labels that have no training examples present. Using a semantic space trained by a neural language model, the current state-of-the-art embeds both images and labels into the space, wherein cross-media similarity is computed. However, for labels of relatively low occurrence, its similarity to images and other labels can be unreliable. This paper proposes Hierarchical Semantic Embedding (HierSE), a simple model that exploits the WordNet hierarchy to improve label embedding and consequently image embedding. Moreover, we identify two good tricks, namely training the neural language model using Flickr tags instead of web documents, and using partial match instead of full match for vectorizing a WordNet node. All this lets us outperform the state-of-the-art. On a test set of over 1,500 visual object classes and 1.3 million images, the proposed model beats the current best results (18.3% versus 9.4% in hit@1).
Published: 2015

20. Music Positioning and Annotation For Television Videos

Author: Gang Yang, Jieping Xu, and Xirong Li
Subjects: Annotation, Multimedia, Computer science, Annotation database, computer.software_genre, Closed loop, computer, Classifier (UML), Statistic, Extractor
Abstract: This paper proposed a framework to assist highlighting and annotating music utilization situation automatically within videos, further to supervise and protect music copyrights. Nowadays, music copyrighters pay attention to their rights increasingly, thus music embedded in TV channel videos should be validated to avoid infringing. Our framework supports Music Copyrighter Society of China(MCSC) to do statistic works to protect the copyright owners. In our framework, through AV separation, feature extractor, classification and assemblage functions, music positioning could be confirmed effectively. Then applying music fingerprint retrieving, music could be annotated automatically with high accuracy. Moreover, our framework is a closed loop self-adaptation system as it can be re-trained regularly to expand annotation database and enhance classifier's efficiency. The system based on our framework has been implemented in MCSC and its effectiveness has been evaluated in a real-life scenario. The results, on experiments of the real-life TV stations and comparisons of former works, show that the music positioning and annotation completed automatically by our system have significant improvement about over 30 times enhancement on the working efficiency.
Published: 2015

21. Source Separation Improves Music Emotion Recognition

Author: Jieping Xu, Xirong Li, Gang Yang, and Yun Hao
Subjects: Music and emotion, Computer science, Speech recognition, Source separation, Emotion recognition, Singing, Set (psychology), Music emotion recognition
Abstract: Despite the impressive progress in music emotion recognition, it remains unclear what aspect of a song, i.e., singing voice and accompanied music, carries more emotional information. As an initial attempt to answer the question, we introduce source separation into a standard music emotion recognition system. This allows us to compare systems with and without source separation, and consequently reveal the influence of singing voice and accompanied music on emotion recognition. Classification experiments on a set of 267 songs with last.fm annotations verify the new finding that source separation improves song music emotion recognition.
Published: 2014

22. Classifying Tag Relevance with Relevant Positive and Negative Examples

Author: Cees G. M. Snoek, Xirong Li, and Intelligent Sensory Information Systems (IVI, FNWI)
Subjects: Support vector machine, Computer science, Benchmark (computing), Process (computing), Relevance (information retrieval), Data mining, computer.software_genre, computer, Image (mathematics)
Abstract: Image tag relevance estimation aims to automatically determine what people label about images is factually present in the pictorial content. Different from previous works, which either use only positive examples of a given tag or use positive and random negative examples, we argue the importance of relevant positive and relevant negative examples for tag relevance estimation. We propose a system that selects positive and negative examples, deemed most relevant with respect to the given tag from crowd-annotated images. While applying models for many tags could be cumbersome, our system trains efficient ensembles of Support Vector Machines per tag, enabling fast classification. Experiments on two benchmark sets show that the proposed system compares favorably against five present day methods. Given extracted visual features, for each image our system can process up to 3,787 tags per second. The new system is both effective and efficient for tag relevance estimation.
Published: 2013

23. Social negative bootstrapping for visual categorization

Author: Cees G. M. Snoek, Xirong Li, Arnold W. M. Smeulders, Marcel Worring, Human-Centered Data Analytics, and Intelligent Sensory Information Systems (IVI, FNWI)
Subjects: Adaptive sampling, Categorization, Computer science, Human interaction, business.industry, Pattern recognition, Artificial intelligence, Machine learning, computer.software_genre, business, computer, Classifier (UML), De facto standard
Abstract: To learn classifiers for many visual categories, obtaining labeled training examples in an efficient way is crucial. Since a classifier tends to misclassify negative examples which are visually similar to positive examples, inclusion of such informative negatives should be stressed in the learning process. However, they are unlikely to be hit by random sampling, the de facto standard in literature. In this paper, we go beyond random sampling by introducing a novel social negative bootstrapping approach. Given a visual category and a few positive examples, the proposed approach adaptively and iteratively harvests informative negative examples from a large amount of social-tagged images. To label negative examples without human interaction, we design an effective virtual labeling procedure based on simple tag reasoning. Virtual labeling, in combination with adaptive sampling, enables us to select the most misclassified negatives as the informative samples. Learning from the positive set and a series of informative negative sets results in visual classifiers with higher accuracy. Experiments on two present-day image benchmarks employing 650K virtually labeled negative examples show the viability of the proposed approach. On a popular visual categorization benchmark our precision at 20 increases by 34%, compared to baselines trained on randomly sampled negatives. The robustness of the proposed approach is verified by a cross-dataset experiment. The results clearly show the advantage of our approach: more accurate visual categorization without the need of manually labeling any negatives.
Published: 2011

24. Visual categorization with negative examples for free

Author: Xirong Li and Cees G. M. Snoek
Subjects: Empirical research, Categorization, Computer science, business.industry, Relative loss, Supervised learning, Pascal (programming language), Artificial intelligence, Machine learning, computer.software_genre, business, computer, computer.programming_language
Abstract: Automatic visual categorization is critically dependent on labeled examples for supervised learning. As an alternative to traditional expert labeling, social-tagged multimedia is becoming a novel yet subjective and inaccurate source of learning examples. Different from existing work focusing on collecting positive examples, we study in this paper the potential of substituting social tagging for expert labeling for creating negative examples. We present an empirical study using 6.5 million Flickr photos as a source of social tagging. Our experiments on the PASCAL VOC challenge 2008 show that with a relative loss of only 4.3% in terms of mean average precision, expert-labeled negative examples can be completely replaced by social-tagged negative examples for consumer photo categorization.
Published: 2009

25. Learning tag relevance by neighbor voting for social image retrieval

Author: Xirong Li, Marcel Worring, Cees G. M. Snoek, and Intelligent Sensory Information Systems (IVI, FNWI)
Subjects: Information retrieval, Social image, business.industry, Computer science, Voting, media_common.quotation_subject, Voting algorithm, Pattern recognition, Artificial intelligence, Tag cloud, business, Intuition, media_common
Abstract: Social image retrieval is important for exploiting the increasing amounts of amateur-tagged multimedia such as Flickr images. Since amateur tagging is known to be uncontrolled, ambiguous, and personalized, a fundamental problem is how to reliably interpret the relevance of a tag with respect to the visual content it is describing. Intuitively, if different persons label similar images using the same tags, these tags are likely to reflect objective aspects of the visual content. Starting from this intuition, we propose a novel algorithm that scalably and reliably learns tag relevance by accumulating votes from visually similar neighbors. Further, treated as tag frequency, learned tag relevance is seamlessly embedded into current tag-based social image retrieval paradigms. Preliminary experiments on one million Flickr images demonstrate the potential of the proposed algorithm. Overall comparisons for both single-word queries and multiple-word queries show substantial improvement over the baseline by learning and using tag relevance. Specifically, compared with the baseline using the original tags, on average, retrieval using improved tags increases mean average precision by 24%, from 0.54 to 0.67. Moreover, simulated experiments indicate that performance can be improved further by scaling up the amount of images used in the proposed neighbor voting algorithm.
Published: 2008

26. SBIA

Author: Xin-Jing Wang, Changhu Wang, Xirong Li, and Lei Zhang
Subjects: Annotation, Search engine, Automatic image annotation, Information retrieval, Ranking, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Relevance (information retrieval), Cluster analysis, Image retrieval, Image (mathematics), Ranking (information retrieval)
Abstract: In this technical demonstration, we showcase the SBIA system - a search-based image annotation system. At the heart of the system lies a very large-scale image search engine which indexed three million Web images and supports both text and visual queries. Given an image (with initial annotations), SBIA first finds semantically/visually similar images via the search engine, and then mines representative keywords from the retrieved images. These keywords, after annotation rejection and relevance ranking, are finally used to annotate the query image.
Published: 2007

27. The importance of query-concept-mapping for automatic video retrieval

Author: Bo Zhang, Xirong Li, Dong Wang, and Jianmin Li
Subjects: Information retrieval, Concept search, Computer science, Concept map, business.industry, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, computer.software_genre, Lexicon, TRECVID, Query expansion, Human–computer information retrieval, Language model, Visual Word, Artificial intelligence, business, computer, Natural language processing
Abstract: A new video retrieval paradigm of query-by-concept emerges recently. However, it remains unclear how to exploit the detected concepts in retrieval given a multimedia query. In this paper, we point out that it is important to map the query to a few relevant concepts instead of search with all concepts. In addition, we show that solving this problem through both text and image inputs are effective for search, and it is possible to determine the number of related concepts by a language modeling approach. Experimental evidence is obtained on the automatic search task of TRECVID 2006 using a large lexicon of 311 learned semantic concept detectors.
Published: 2007

28. Video retrieval with multi-modal features

Author: Wujie Zheng, Bo Zhang, Jianmin Li, Xirong Li, Zhikun Wang, Tongchun Xiao, and Dong Wang
Subjects: Cognitive models of information retrieval, Decision support system, Information retrieval, Modalities, Modal, Computer science, Human–computer information retrieval, Relevance (information retrieval), Visual Word, Visualization
Abstract: In the paper, our video retrieval system is presented. The system acts as a decision support system to help users to find what they want with many analysis and visualization tools provided by the system. It consists of three basic retrieval models which searches shots in text, image and concept space respectively. The results from different modalities are fused to achieve better performance. The relevance shots are shown to users in different threads and expanded in different ways to help users try their best to make correct decision during the retrieval procedure.
Published: 2007

29. Video search in concept subspace

Author: Bo Zhang, Xirong Li, Jianmin Li, and Dong Wang
Subjects: Search engine, Information retrieval, Concept search, Web search query, Computer science, business.industry, Semantic search, Beam search, Full text search, Phrase search, business, TRECVID
Abstract: Though both quantity and quality of semantic concept detection in video are continuously improving, it still remains unclear how to exploit these detected concepts as semantic indices in video search, given a specific query. In this paper, we tackle this problem and propose a video search framework which operates like searching text documents. Noteworthy for its adoption of the well-founded text search principles, this framework first selects a few related concepts for a given query, by employing a tf-idf like scheme, called c-tf-idf, to measure the informativeness of the concepts to this query. These selected concepts form a concept subspace. Then search can be conducted in this concept subspace, either by a Vector Model or a Language Model. Further, two algorithms, i.e., Linear Summation and Random Walk through Concept-Link, are explored to combine the concept search results and other baseline search results in a reranking scheme. This framework is both effective and efficient. Using a lexicon of 311 concepts from the LSCOM concept ontology, experiments conducted on the TRECVID 2006 search data set show that: when used solely, search within the concept subspace achieves the state-of-the-art concept search result; when used to rerank the baseline results, it can improve over the top 20 automatic search runs in TRECVID 2006 on average by approx. 20%, on the most significant one by approx. 50%, all within 180 milliseconds on a normal PC.
Published: 2007

30. Image annotation by large-scale content-based image retrieval

Author: Le Chen, Fuzong Lin, Lei Zhang, Xirong Li, and Wei-Ying Ma
Subjects: Information retrieval, Automatic image annotation, Computer science, Image map, Nearest neighbor search, Search engine indexing, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Visual Word, Cluster analysis, Content-based image retrieval, Image retrieval
Abstract: Image annotation has been an active research topic in recent years due to its potentially large impact on both image understanding and Web image search. In this paper, we target at solving the automatic image annotation problem in a novel search and mining framework. Given an uncaptioned image, first in the search stage, we perform content-based image retrieval (CBIR) facilitated by high-dimensional indexing to find a set of visually similar images from a large-scale image database. The database consists of images crawled from the World Wide Web with rich annotations, e.g. titles and surrounding text. Then in the mining stage, a search result clustering technique is utilized to find most representative keywords from the annotations of the retrieved image subset. These keywords, after salience ranking, are finally used to annotate the uncaptioned image. Based on search technologies, this framework does not impose an explicit training stage, but efficiently leverages large-scale and well-annotated images, and is potentially capable of dealing with unlimited vocabulary. Based on 2.4 million real Web images, comprehensive evaluation of image annotation on Corel and U. Washington image databases show the effectiveness and efficiency of the proposed approach.
Published: 2006

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

30 results on '"Xirong Li"'

1. Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

2. Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning

3. Towards annotation-free evaluation of cross-lingual image captioning

4. A W2VV++ Case Study with Automated and Interactive Text-to-Video Retrieval

5. iCap: Interactive Image Captioning with Predictive Text

6. W2VV++

7. Deep Learning for Video Retrieval by Natural Language

8. Exploring Content-based Video Relevance for Video Click-Through Rate Prediction

9. Dissimilarity Representation Learning for Generalized Zero-Shot Recognition

10. Feature Re-Learning with Data Augmentation for Content-based Video Recommendation

11. Session details: FF-5

12. Imagination Based Sample Construction for Zero-Shot Learning

13. Harvesting Deep Models for Cross-Lingual Image Annotation

14. Detecting Violence in Video using Subclasses

15. Adding Chinese Captions to Images

16. Image Tag Assignment, Refinement and Retrieval

17. Image Retrieval by Cross-Media Relevance Fusion

18. Towards structured semantic embedding of multimedia

19. Zero-shot Image Tagging by Hierarchical Semantic Embedding

20. Music Positioning and Annotation For Television Videos

21. Source Separation Improves Music Emotion Recognition

22. Classifying Tag Relevance with Relevant Positive and Negative Examples

23. Social negative bootstrapping for visual categorization

24. Visual categorization with negative examples for free

25. Learning tag relevance by neighbor voting for social image retrieval

26. SBIA

27. The importance of query-concept-mapping for automatic video retrieval

28. Video retrieval with multi-modal features

29. Video search in concept subspace

30. Image annotation by large-scale content-based image retrieval

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

30 results on '"Xirong Li"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources