Author: "CHANGSHENG XU" / Topic: computingmethodologies_imageprocessingandcomputervision - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"CHANGSHENG XU"' showing total 113 results

Start Over Author "CHANGSHENG XU" Topic computingmethodologies_imageprocessingandcomputervision

113 results on '"CHANGSHENG XU"'

1. Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval

Author: Feifei Zhang, Mingliang Xu, and Changsheng Xu
Subjects: Computer Networks and Communications, Hardware and Architecture, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: Composing Text and Image to Image Retrieval ( CTI-IR ) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional cross-modal retrieval approaches usually take one modality data as the query to retrieve relevant data of another modality. Different from the existing methods, in this article, we propose an end-to-end trainable network for simultaneous image generation and CTI-IR . The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, it can learn a generative and discriminative feature for the query (a query image with text description) by jointly training a generative model and a retrieval model. Second, our model can automatically manipulate the visual features of the reference image in terms of the text description by the adversarial learning between the synthesized image and target image. Third, global-local collaborative discriminators and attention-based generators are exploited, allowing our approach to focus on both the global and local differences between the query image and the target image. As a result, the semantic consistency and fine-grained details of the generated images can be better enhanced in our model. The generated image can also be used to interpret and empower our retrieval model. Quantitative and qualitative evaluations on three benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
Published: 2022

2. Joint Expression Synthesis and Representation Learning for Facial Expression Recognition

Author: Feifei Zhanga, Changsheng Xu, and Xi Zhang
Subjects: business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Overfitting, Task (project management), ComputingMethodologies_PATTERNRECOGNITION, Facial expression recognition, Media Technology, Identity (object-oriented programming), Artificial intelligence, Electrical and Electronic Engineering, Representation (mathematics), business, Joint (audio engineering), Feature learning
Abstract: Facial expression recognition (FER) is a challenging task due to the large appearance variations and the lack of sufficient training data. Conventional deep approaches either learn a good representation through deep models or synthesize images automatically to enlarge the training set. In this paper, we perform both tasks jointly and propose an end-to-end deep model for simultaneous facial expression recognition and facial image synthesis. The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, the facial image synthesis and facial expression recognition tasks can boost their performance for each other via the unified model. Second, paired images are not required in our facial image synthesis network, which makes the proposed model much more general and flexible. Meanwhile, the generated facial images largely expand the training set and ease the overfitting problem in our FER task. Third, different expressions are encoded in a disentangled manner in a latent space, which enables us to synthesize facial images with arbitrary expressions by exchanging certain parts of their latent identity features. Quantitative and qualitative evaluations on both controlled and in-the-wild FER benchmarks (Multi-PIE, MMI, and RAF-DB) demonstrate the effectiveness of our proposed method on both facial image synthesis and facial expression recognition task.
Published: 2022

3. Multi-Target Multi-Camera Tracking With Optical-Based Pose Association

Author: Sisi You, Changsheng Xu, and Hantao Yao
Subjects: Similarity (geometry), Matching (graph theory), business.industry, Computer science, Feature extraction, Frame (networking), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Optical flow, Object detection, Visualization, Minimum bounding box, Media Technology, Computer vision, Artificial intelligence, Electrical and Electronic Engineering, business
Abstract: Multi-target multi-camera tracking (MTMCT) targets to generate trajectories of the object that appeared under multiple cameras automatically. MTMCT can be treated as a combination of intra-camera tracking and cross-camera tracking. The existing work only employs the global description to perform the tracklet generating. However, the global description cannot model the local similarity between targets, leading to existing methods not to be robust to occlusion and fast motion. To handle the mentioned problem, we propose an online Optical-based Pose Association (OPA) for multi-target multi-camera tracking. The proposed method utilizes local pose matching to solve the occlusion problem, and applies optical flow to reduce the distance caused by fast motion. For optical-based pose association, we firstly employ OpenPose to generate human pose for each proposal. Then, we utilize the optical flow generated by PWC-Net to adjust the estimated pose for the previous frame. Finally, the modified Object Keypoint Similarity is used to compute the similarity between the pose of the current frame and adjusted pose in the prior frame. Once obtaining the optical-based pose similarity, we combine it with the visual and bounding box spatial similarities to generate the final similarity matrix, and apply the Kuhn-Munkras algorithm for data association. The experiments on the MTMCT and MOT datasets verify the rationality of using human pose information and prove the superiority of the proposed method.
Published: 2021

4. PEN: Pose-Embedding Network for Pedestrian Detection

Author: Hantao Yao, Changsheng Xu, and Yifan Jiao
Subjects: Computer science, business.industry, Pedestrian detection, Feature extraction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Pedestrian, Object detection, Visualization, Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, Media Technology, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Electrical and Electronic Engineering, business, Pose
Abstract: In the past years, pedestrian detection has achieved significant progress via improving the visual description. However, the visual description is not robust to discover the occluded pedestrian, which is the bottleneck of the existing pedestrian methods. Targeting to overcome the shortcoming of visual description, we employ the human pose information, which is complementary to the visual description, to address the occlusion and false positive failure problems in pedestrian detection. The advantage of using human pose information is that the pose estimation model can localize the local part of the pedestrian once the pedestrian is occluded. By embedding the human pose information with the visual description, we propose a novel Pose-Embedding Network for pedestrian detection, which consists of two components: a Region Proposal Network, and a Pedestrian Recognization Network. The Region Proposal Network targets to generate lots of candidate proposals and corresponding confidence scores. Once obtaining the candidate proposals, the Pedestrian Recognization Network is proposed to distinguish pedestrian proposals by taking the visual information and pose information into consideration to refine the confidence scores and eliminate the false positives. Given the proposal image, the visual information is extracted with the Visual Feature Module. The Human Pose Module, which is proposed based on the pose estimation model, is used to predict the pose information. Further, the Classification Module is employed to fuse the visual and pose information and generates a pose-embedding pedestrian description. Extensive experiments on three challenging datasets, i.e., Caltech, CityPersons, and COCOPersons, show that the proposed approach achieves a significant improvement upon the state-of-the-art methods.
Published: 2021

5. Learning Dual-Pooling Graph Neural Networks for Few-Shot Video Classification

Author: Junyu Gao, Yufan Hu, and Changsheng Xu
Subjects: Computer science, business.industry, Node (networking), Feature extraction, Pooling, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Construct (python library), Machine learning, computer.software_genre, Semantics, Computer Science Applications, Data modeling, Discriminative model, Signal Processing, Media Technology, Enhanced Data Rates for GSM Evolution, Artificial intelligence, Electrical and Electronic Engineering, business, computer
Abstract: We address the problem of few-shot video classification that learns classifiers for novel concepts from only a few examples. Most current methods ignore to explicitly consider the relations in both intra-video and inter-video domains, thus cannot take full advantage of the structural information in few-shot learning. In this paper, we propose to exploit the comprehensive intra-video and inter-video relations via Graph Neural Networks (GNNs). To improve the discriminative ability for accurately selecting the representative video content and refining video relations, a Dual-Pooling GNN (DPGNN) is constructed, which stacks customized graph pooling layers in a hierarchical fashion. Specifically, to select the most representative frames in a video, we build intra-video graphs and utilize a node pooling module to extract robust video-level features. We construct an inter-video graph by taking the video-level features as nodes. By designing an edge pooling module, the proposed method can adaptively eliminate the negative relations in the inter-video graph. Extensive experimental results show that our method consistently outperforms the state-of-the-art on two benchmarks.
Published: 2021

6. Unsupervised Video Summarization via Relation-Aware Assignment Learning

Author: Yingying Zhang, Xiaoshan Yang, Changsheng Xu, and Junyu Gao
Subjects: Relation (database), Computer science, business.industry, Node (networking), Feature extraction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Context (language use), Machine learning, computer.software_genre, Automatic summarization, Computer Science Applications, Constraint (information theory), Signal Processing, Media Technology, Graph (abstract data type), Artificial intelligence, Electrical and Electronic Engineering, CLIPS, business, computer, computer.programming_language
Abstract: We address the problem of unsupervised video summarization that automatically selects key video clips. Most state-of-the-art approaches suffer from two issues: (1) they model video clips without explicitly exploiting their relations, and (2) they learn soft importance scores over all the video clips to generate the summary representation. However, a meaningful video summary should be inferred by taking the relation-aware context of the original video into consideration, and directly selecting a subset of clips with a hard assignment. In this paper, we propose to exploit clip-clip relations to learn relation-aware hard assignments for selecting key clips in an unsupervised manner. First, we consider the clips as graph nodes to construct an assignment-learning graph. Then, we utilize the magnitude of the node features to generate hard assignments as the summary selection. Finally, we optimize the whole framework via a proposed multi-task loss including a reconstruction constraint, and a contrastive constraint. Extensive experimental results on three popular benchmarks demonstrate the favourable performance of our approach.
Published: 2021

7. Self-Supervised Feature Augmentation for Large Image Object Detection

Author: Yiping Meng, Zhichao Song, Xingjia Pan, Weiming Dong, Yang Gu, Fan Tang, Changsheng Xu, Pengfei Xu, and Oliver Deussen
Subjects: business.industry, Computer science, Pipeline (computing), Feature extraction, Detector, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, 02 engineering and technology, Computer Graphics and Computer-Aided Design, Object detection, Convolution, Upsampling, Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Image resolution, Software, Block (data storage)
Abstract: Input scale plays an important role in modern detection frameworks, and an optimal training scale for images exists empirically. However, the optimal one usually cannot be reached in facing extremely large images under the memory constraint. In this study, we explore the scale effect inside the object detection pipeline and find that feature upsampling with the introduction of high-resolution information benefits the detection. Compared with direct input upscaling, feature upsampling trades a small performance loss for a large amount of memory savings. From these observations, we propose a self-supervised feature augmentation network, which takes downsampled images as inputs and aims to generate comparable features with the ones when feeding upscaled images to networks. We present a guided feature upsampling module, which takes downsampled images as inputs, to learn upscaled feature representations with the supervision of real large features acquired from upscaled images. In a self-supervised learning manner, we can introduce detailed information of images to the network. For an efficient feature upsampling, we design a residualized sub-pixel convolution block based on a sub-pixel convolution layer, which involves considerable information in upsampling process. Experiments on Mapillary Vistas Dataset (MVD), Cityscapes, and COCO are conducted to demonstrate the effectiveness of our method. On the MVD and Cityscapes detection benchmarks, in which the images are extremely large, our method surpasses current approaches. On COCO, the proposed method obtains comparable results to existing methods but with higher efficiency.
Published: 2020

8. Text Style Transfer With Decorative Elements

Author: Yuting Ma, Fan Tang, Changsheng Xu, and Weiming Dong
Subjects: Style (visual arts), Engineering drawing, Open source, Computer science, Font, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Information processing, Process (computing), Production efficiency, Glyph (data visualization), ComputingMethodologies_COMPUTERGRAPHICS, Visualization
Abstract: The text rendered by special effects can give a rich visual experience. Text stylization can help users migrate their favorite styles to specified texts, improving production efficiency and saving design cost. This paper proposes a novel text stylization framework, which can transfer mixed text styles, including font glyph and fine decorations, to user-specified texts. The transfer of decorative elements is difficult due to the text is obscured by decorative elements to a certain extent. Our method is divided into three stages: firstly, the position of decorative elements in the image is extracted and retained; secondly, the effects of font glyph and textures other than decorative elements are migrated; finally, a structure-aware strategy is used to reorganize the decorative elements to complete the entire stylization process. Experiments on open source text data sets demonstrated the advantages of our approach over other state- of-the-art style migration methods.
Published: 2021

9. Image Captioning by Asking Questions

Author: Changsheng Xu and Xiaoshan Yang
Subjects: Closed captioning, Computer Networks and Communications, Process (engineering), business.industry, Computer science, Perspective (graphical), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, computer.software_genre, 03 medical and health sciences, Task (computing), 0302 clinical medicine, Hardware and Architecture, Feature (computer vision), 030221 ophthalmology & optometry, 0202 electrical engineering, electronic engineering, information engineering, Question answering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing, Sentence, Word (computer architecture)
Abstract: Image captioning and visual question answering are typical tasks that connect computer vision and natural language processing. Both of them need to effectively represent the visual content using computer vision methods and smoothly process the text sentence using natural language processing skills. The key problem of these two tasks is to infer the target result based on the interactive understanding of the word sequence and the image. Though they practically use similar algorithms, they are studied independently in the past few years. In this article, we attempt to exploit the mutual correlation between these two tasks. We propose the first VQA-improved image-captioning method that transfers the knowledge learned from the VQA corpora to the image-captioning task. A VQA model is first pretrained on image--question--answer instances. Then, the pretrained VQA model is used to extract VQA-grounded semantic representations according to selected free-form open-ended visual question--answer pairs. The VQA-grounded features are complementary to the visual features, because they interpret images from a different perspective. We incorporate the VQA model into the image-captioning model by adaptively fusing the VQA-grounded feature and the attended visual feature. We show that such simple VQA-improved image-captioning (VQA-IIC) models perform better than conventional image-captioning methods on large-scale public datasets.
Published: 2019

10. Cross-domain personalized image captioning

Author: Changsheng Xu, Cuirong Long, and Xiaoshan Yang
Subjects: Closed captioning, Computer Networks and Communications, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, 02 engineering and technology, Expression (mathematics), Domain (software engineering), Image (mathematics), Hardware and Architecture, Human–computer interaction, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, Social media, Software, Word (computer architecture), Sentence
Abstract: Image captioning aims to translate an image to a complete and natural sentence. It involves both computer vision and natural language processing. Though image captioning has achieved good results under the rapid development of deep neural networks, excessively pursuing the evaluation results of the captioning models makes the generated text description too conservative in practical applications. It is necessary to increase the diversity of the text description and account for prior knowledge such as the user’s favorite vocabularies and writing styles. In this paper, we study the personalized image captioning which can generate sentences to describe the user’s own story and feelings of life with the most preferred word expression. Moreover, we propose cross-domain personalized image captioning (CDPIC) to learn domain-invariant captioning models which can be applied on different social media platforms. The proposed method can flexibly model user interest by embedding the user ID as an interest vector. To the best of our knowledge, we propose the first cross-domain personalized image captioning approach by combining the user interest modeling and a simple and effective domain-invariant constraint. The effectiveness of the proposed method is verified on datasets from the Instagram and Lookbook platforms.
Published: 2019

11. Multi-attribute Guided Painting Generation

Author: Fan Tang, Yingying Deng, Minxuan Lin, Weiming Dong, and Changsheng Xu
Subjects: FOS: Computer and information sciences, Structure (mathematical logic), Stylized fact, Painting, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), media_common.quotation_subject, Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Fidelity, Image (mathematics), Style (sociolinguistics), Artificial intelligence, Control (linguistics), business, Focus (optics), ComputingMethodologies_COMPUTERGRAPHICS, media_common
Abstract: Controllable painting generation plays a pivotal role in image stylization. Currently, the control way of style transfer is subject to exemplar-based reference or a random one-hot vector guidance. Few works focus on decoupling the intrinsic properties of painting as control conditions, e.g., artist, genre and period. Under this circumstance, we propose a novel framework adopting multiple attributes from the painting to control the stylized results. An asymmetrical cycle structure is equipped to preserve the fidelity, associating with style preserving and attribute regression loss to keep the unique distinction of colors and textures between domains. Several qualitative and quantitative results demonstrate the effect of the combinations of multiple attributes and achieve satisfactory performance.
Published: 2020

12. Autosoccer: An Automatic Soccer Live Broadcasting Generator

Author: Hongyun Bao, Changsheng Xu, Zhineng Chen, Chunyang Li, and Caiyan Jia
Subjects: Titan (supercomputer), Computer science, business.industry, Deep learning, ComputerApplications_GENERAL, Real-time computing, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Artificial intelligence, business
Abstract: Low-cost live broadcasting is highly anticipated for soccer matches that involve heavy expenses in human and equipment currently. This demo showcases the autoSoccer system to approach the target. It takes a fix-view panoramic soccer video as the input and automatically generates its live broadcasting, which continuously focuses on the most interesting area in the soccer field. To this end, a novel pipeline is developed by leveraging both low-level motion and visual features, and high-level soccer-related semantics. By appropriately utilizing these clues, autoSoccer is capable of delivering a soccer match video analogous to human directed. Demo on school soccer show that autoSoccer produces videos with satisfactory watching experience. Moreover, it executes in real-time on a PC with Intel i7 CPU and one Nvidia Titan XP GPU.
Published: 2020

13. Geometry Guided Pose-invariant Facial Expression Recognition

Author: Feifei Zhang, Tianzhu Zhang, Changsheng Xu, and Qirong Mao
Subjects: Training set, business.industry, Computer science, Deep learning, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Geometry, 02 engineering and technology, Computer Graphics and Computer-Aided Design, Image synthesis, Facial expression recognition, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Invariant (mathematics), business, Software
Abstract: Driven by recent advances in human-centered computing, Facial Expression Recognition (FER) has attracted significant attention in many applications. However, most conventional approaches either perform face frontalization on a non-frontal facial image or learn separate classifier for each pose. Different from existing methods, this paper proposes an end-to-end deep learning model that allows to simultaneous facial image synthesis and pose-invariant facial expression recognition by exploiting shape geometry of the face image. The proposed model is based on generative adversarial network (GAN) and enjoys several merits. First, given an input face and a target pose and expression designated by a set of facial landmarks, an identity-preserving face can be generated through guiding by the target pose and expression. Second, the identity representation is explicitly disentangled from both expression and pose variations through the shape geometry delivered by facial landmarks. Third, our model can automatically generate face images with different expressions and poses in a continuous way to enlarge and enrich the training set for the FER task. Our approach is demonstrated to perform well when compared with state-of-the-art algorithms on both controlled and in-the-wild benchmark datasets including Multi-PIE, BU-3DFE, and SFEW. The code is included in the supplementary material.
Published: 2020

14. Unpaired Images based Generator Architecture for Facial Expression Recognition

Author: Xi Zhang, Feifei Zhang, and Changsheng Xu
Subjects: Facial expression, Training set, Computer science, business.industry, Deep learning, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, 02 engineering and technology, Task (computing), Facial expression recognition, 0202 electrical engineering, electronic engineering, information engineering, Identity (object-oriented programming), 020201 artificial intelligence & image processing, Artificial intelligence, business, Generator (mathematics)
Abstract: Facial expression recognition (FER) is a challenging task due to the lack of sufficient training data. Most conventional approaches usually rotate or flip the images for data augmentation. More recently, numerous methods synthesize images automatically by using Generative Adversarial Network (GAN). However, paired images are always required in these methods. Different from existing methods, in this paper, we propose an end-to-end deep learning model for simultaneous facial expression synthesis and facial expression recognition. In our method, paired images are not required, which makes the proposed model much more flexible and general. Furthermore, different expressions are encoded in a disentangled manner in a latent space, which enables us to generate facial images with arbitrary expressions by exchanging certain parts of their latent identity features. Finally, the facial expression synthesis and facial expression recognition tasks can further boost their performance for each other via our model. Quantitative and qualitative evaluations on both controlled and in-the-wild datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.
Published: 2019

15. Deep-Structured Event Modeling for User-Generated Photos

Author: Tianzhu Zhang, Changsheng Xu, and Xiaoshan Yang
Subjects: Conditional random field, Computer science, Event (computing), business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, 02 engineering and technology, 010501 environmental sciences, Object (computer science), 01 natural sciences, Convolutional neural network, Computer Science Applications, Visualization, Recurrent neural network, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, 020201 artificial intelligence & image processing, Artificial intelligence, Timestamp, Electrical and Electronic Engineering, Hidden Markov model, business, 0105 earth and related environmental sciences
Abstract: Vision-based event analysis is difficult because of the following challenges. The first challenge is intraclass variation. Photos uploaded by users are sparsely sampled visual appearances of an event over time. Thus, each photo may only capture a single object or scene of a specific complex event. The second challenge is interclass confusion. Photos related to different events may contain similar objects or scenes. Third, unusual events are characterized by scarcity, and only a few samples are available for use in learning event patterns. In this paper, by considering the photo timestamp, we propose a structured event modeling (SEM) framework for event analysis that exploits the temporal information of visual features and event classes in a photo sequence. Specifically, the temporal event patterns of the photo sequence and the relationships of different photos are jointly learned using deep neural networks (convolutional neural networks and recurrent neural networks) and a conditional random field. We evaluate the proposed SEM framework in two applications: multiclass event recognition and unusual event detection in photo sequences. The results of extensive experiments performed on a public event recognition dataset and a collected unusual event dataset demonstrate the effectiveness of the proposed method.
Published: 2018

16. Learning explicit video attributes from mid-level representation for video captioning

Author: Changsheng Xu, Xinyu Wu, Yan Wang, Bingbing Ni, Teng Li, and Fudong Nian
Subjects: Closed captioning, Video post-processing, Computer science, Speech recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, 02 engineering and technology, Video processing, computer.file_format, Smacker video, Video compression picture types, Video tracking, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Multiview Video Coding, computer, Software, Block-matching algorithm
Abstract: Recent works on video captioning mainly learn the map from low-level visual features to language description directly without explicitly representing the high-level semantic video concepts (e.g. objects, actions in the annotated sentences). To bridge the semantic gap, in this paper, addressing it, we propose a novel video attribute representation learning algorithm for video concept understanding and utilize the learned explicit video attribute representation to improve video captioning performance. To achieve it, firstly, inspired by the success of spectrogram in audio processing, a novel mid-level video representation named “video response map” (VRM) is proposed, by which the frame sequence could be represented by a single image representation. Therefore, the video attributes representation learning could be converted to a well-studied multi-label image classification problem. Then in the captions prediction step, the learned video attributes feature is integrated with the single frame feature to improve previous sequence-to-sequence language generation model by adjusting the LSTM (Long-Short Term Memory) input units. The proposed video captioning framework could both handle variable frame inputs and utilize high-level semantic video attribute features. Experimental results on video captioning tasks show that the proposed method, utilizing only RGB frames as input without extra video or text training data, could achieve competitive performance with state-of-the-art methods. Furthermore, the extensive experimental evaluations on the UCF-101 action classification benchmark well demonstrate the representation capability of the proposed VRM.
Published: 2017

17. Deep Relative Tracking

Author: Tianzhu Zhang, Changsheng Xu, Xiaoshan Yang, and Junyu Gao
Subjects: Computer science, business.industry, Deep learning, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, 02 engineering and technology, Computer Graphics and Computer-Aided Design, Convolutional neural network, Active appearance model, Visualization, Support vector machine, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Eye tracking, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Software
Abstract: Most existing tracking methods are direct trackers, which directly exploit foreground or/and background information for object appearance modeling and decide whether an image patch is target object or not. As a result, these trackers cannot perform well when target appearance changes heavily and becomes different from its model. To deal with this issue, we propose a novel relative tracker, which can effectively exploit the relative relationship among image patches from both foreground and background for object appearance modeling. Different from direct trackers, the proposed relative tracker is robust to localize target object by use of the best image patch with the highest relative score to the target appearance model. To model relative relationship among large-scale image patch pairs, we propose a novel and effective deep relative learning algorithm through the convolutional neural network. We test the proposed approach on challenging sequences involving heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that our method consistently outperforms the state-of-the-art trackers due to the powerful capacity of the proposed deep relative model.
Published: 2017

18. Scene Recognition via Bi-enhanced Knowledge Space Learning

Author: Jin Zhang, Changsheng Xu, and Bing-Kun Bao
Subjects: Class (computer programming), Knowledge space, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Representation (systemics), Cognitive neuroscience of visual object recognition, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Image (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Action recognition, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: Scene recognition is one of the hallmark tasks in computer vision, as it provides rich information beyond object recognition and action recognition. It is easy to accept that scene images from the same class always include the same essential objects and relations, for example, scene images of “wedding” usually have bridegroom and bride next to him. Following this observation, we introduce a novel idea to boost the accuracy of scene recognition by mining essential scene sub-graph and learning a bi-enhanced knowledge space. The essential scene sub-graph describes the essential objects and their relations for each scene class. The learned knowledge space is bi-enhanced by global representation on the entire image and local representation on the corresponding essential scene sub-graph. Experimental results on the constructed dataset called Scene 30 demonstrate the effectiveness of our proposed method.
Published: 2019

19. Facial Expression Recognition in the Wild

Author: Feifei Zhang, Tianzhu Zhang, Ling-Yu Duan, Changsheng Xu, and Qirong Mao
Subjects: Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, 02 engineering and technology, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Facial expression recognition, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Classifier (UML)
Abstract: Facial expression recognition (FER) is a very challenging problem due to different expressions under arbitrary poses. Most conventional approaches mainly perform FER under laboratory controlled environment. Different from existing methods, in this paper, we formulate the FER in the wild as a domain adaptation problem, and propose a novel auxiliary domain guided Cycle-consistent adversarial Attention Transfer model (CycleAT) for simultaneous facial image synthesis and facial expression recognition in the wild. The proposed model utilizes large-scale unlabeled web facial images as an auxiliary domain to reduce the gap between source domain and target domain based on generative adversarial networks (GAN) embedded with an effective attention transfer module, which enjoys several merits. First, the GAN-based method can automatically generate labeled facial images in the wild through harnessing information from labeled facial images in source domain and unlabeled web facial images in auxiliary domain. Second, the class-discriminative spatial attention maps from the classifier in source domain are leveraged to boost the performance of the classifier in target domain. Third, it can effectively preserve the structural consistency of local pixels and global attributes in the synthesized facial images through pixel cycle-consistency and discriminative loss. Quantitative and qualitative evaluations on two challenging in-the-wild datasets demonstrate that the proposed model performs favorably against state-of-the-art methods.
Published: 2018

20. Joint Pose and Expression Modeling for Facial Expression Recognition

Author: Feifei Zhang, Changsheng Xu, Qirong Mao, and Tianzhu Zhang
Subjects: business.industry, Computer science, Deep learning, Feature extraction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, 02 engineering and technology, Facial recognition system, Expression (mathematics), Discriminative model, Face (geometry), 0202 electrical engineering, electronic engineering, information engineering, Identity (object-oriented programming), 020201 artificial intelligence & image processing, Artificial intelligence, business, Representation (mathematics)
Abstract: Facial expression recognition (FER) is a challenging task due to different expressions under arbitrary poses. Most conventional approaches either perform face frontalization on a non-frontal facial image or learn separate classifiers for each pose. Different from existing methods, in this paper, we propose an end-to-end deep learning model by exploiting different poses and expressions jointly for simultaneous facial image synthesis and pose-invariant facial expression recognition. The proposed model is based on generative adversarial network (GAN) and enjoys several merits. First, the encoder-decoder structure of the generator can learn a generative and discriminative identity representation for face images. Second, the identity representation is explicitly disentangled from both expression and pose variations through the expression and pose codes. Third, our model can automatically generate face images with different expressions under arbitrary poses to enlarge and enrich the training set for FER. Quantitative and qualitative evaluations on both controlled and in-the-wild datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
Published: 2018

21. Depth Information Guided Crowd Counting for Complex Crowd Scenes

Author: Pei Lv, Bing Zhou, Gaoge Cui, Changsheng Xu, Mingliang Xu, Xiaoheng Jiang, and Zhaoyang Ge
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Pedestrian detection, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Density estimation, 01 natural sciences, Image (mathematics), Artificial Intelligence, 0103 physical sciences, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, Crowd density, 010306 general physics, business, Software, Crowd counting
Abstract: It is important to monitor and analyze crowd events for the sake of city safety. In an EDOF (extended depth of field) image with a crowded scene, the distribution of people is highly imbalanced. People far away from the camera look much smaller and often occlude each other heavily, while people close to the camera look larger. In such a case, it is difficult to accurately estimate the number of people by using one technique. In this paper, we propose a Depth Information Guided Crowd Counting (DigCrowd) method to deal with crowded EDOF scenes. DigCrowd first uses the depth information of an image to segment the scene into a far-view region and a near-view region. Then Digcrowd maps the far-view region to its crowd density map and uses a detection method to count the people in the near-view region. In addition, we introduce a new crowd dataset that contains 1000 images. Experimental results demonstrate the effectiveness of our DigCrowd method, 9 pages, 8 figures. The paper is under consideration at Pattern Recognition Letters
Published: 2018

22. Latent Support Vector Machine Modeling for Sign Language Recognition with Kinect

Author: Changsheng Xu, Chao Sun, and Tianzhu Zhang
Subjects: Phrase, American Sign Language, Computer science, Color image, Speech recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Sign language, language.human_language, Theoretical Computer Science, Support vector machine, Discriminative model, Artificial Intelligence, Depth map, Feature (machine learning), language
Abstract: Vision-based sign language recognition has attracted more and more interest from researchers in the computer vision field. In this article, we propose a novel algorithm to model and recognize sign language performed in front of a Microsoft Kinect sensor. Under the assumption that some frames are expected to be both discriminative and representative in a sign language video, we first assign a binary latent variable to each frame in training videos for indicating its discriminative capability, then develop a latent support vector machine model to classify the signs, as well as localize the discriminative and representative frames in each video. In addition, we utilize the depth map together with the color image captured by the Kinect sensor to obtain a more effective and accurate feature to enhance the recognition accuracy. To evaluate our approach, we conducted experiments on both word-level sign language and sentence-level sign language. An American Sign Language dataset including approximately 2,000 word-level sign language phrases and 2,000 sentence-level sign language phrases was collected using the Kinect sensor, and each phrase contains color, depth, and skeleton information. Experiments on our dataset demonstrate the effectiveness of the proposed method for sign language recognition.
Published: 2015

23. A discriminative graph inferring framework towards weakly supervised image parsing

Author: Bing-Kun Bao, Lei Yu, and Changsheng Xu
Subjects: Computer Networks and Communications, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cryptography, 02 engineering and technology, Machine learning, computer.software_genre, Computer graphics, Discriminative model, 020204 information systems, Image parsing, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, Computer communication networks, business.industry, Pattern recognition, Active appearance model, ComputingMethodologies_PATTERNRECOGNITION, Automatic image annotation, Hardware and Architecture, Graph (abstract data type), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Software, Information Systems
Abstract: In this paper, we focus on the task of assigning labels to the over-segmented image patches in a weakly supervised manner, in which the training images contain the labels but do not have the labels' locations in the images. We propose a unified discriminative graph inferring framework by simultaneously inferring patch labels and learning the patch appearance models. On one hand, graph inferring reasons the patch labels by a graph propagation procedure. The graph is constructed by connecting the nearest neighbors which share the same image label, and multiple correlations among patches and image labels are imposed as constraints to the inferring. On the other hand, for each label, the patches which do not contain the target label are adopted as negative samples to learn the appearance model. In this way, the predicted labels will be more accurate in the propagation. Graph inferring and the learned patch appearance models are finally embedded to complement each other in one unified formulation. Experiments on three public datasets demonstrate the effectiveness of our method in comparison with other baselines.
Published: 2015

24. An incremental probabilistic model for temporal theme analysis of landmarks

Author: Weiqing Min, Bing-Kun Bao, and Changsheng Xu
Subjects: Information retrieval, Landmark, Computer Networks and Communications, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Timeline, Statistical model, 02 engineering and technology, Visualization, Computer graphics, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, 020201 artificial intelligence & image processing, Social media, Theme (computing), Temporal information, Software, Information Systems
Abstract: Social media sites (e.g., Flickr) generate a huge amount of landmark photos with temporal information in the real-world, such as the photos describing the events happening near landmarks, and those showing different seasonal sceneries. Analyzing this temporal information of landmarks can benefit various applications, such as landmark timeline construction and tour recommendation. In this paper, we propose a novel Incremental Spatio-Temporal Theme Model (ISTTM), which can incrementally mine temporal themes that characterize the temporal information of landmarks, by differentiating them from the other three kinds of themes, i.e., general themes shared by most of all landmarks, local themes related to certain landmarks and the background theme including non-informative content. ISTTM works in an online way and is capable of selectively processing the updates of the distributions on different types of themes. Based on the proposed ISTTM, we present a framework, namely Temporal Theme Analysis for Landmarks (TTAL), which enables both periodic theme detection from discovered temporal themes and temporal theme visualization by selecting the relevant photos. We have conducted experiments on a large-scale landmark dataset from Flickr. Qualitative and quantitative evaluation results demonstrate the effectiveness of the ISTTM as well as the TTAL framework.
Published: 2014

25. Snap & Play

Author: Qiang Chen, Changsheng Xu, Si Liu, Shuicheng Yan, and Hanqing Lu
Subjects: Sequential game, Computer science, business.industry, media_common.quotation_subject, ComputingMilieux_PERSONALCOMPUTING, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Image processing, Image editing, computer.software_genre, Theoretical Computer Science, Image (mathematics), Mode (computer interface), Artificial Intelligence, Computer graphics (images), Quality (business), Computer vision, Artificial intelligence, Game development tool, business, Game Developer, computer, media_common
Abstract: In this article, by taking a popular game, the Find-the-Difference (FiDi) game, as a concrete example, we explore how state-of-the-art image processing techniques can assist in developing a personalized, automatic, and dynamic game. Unlike the traditional FiDi game, where image pairs (source image and target image) with five different patches are manually produced by professional game developers, the proposed Personalized FiDi (P-FiDi) electronic game can be played in a fully automatic Snap & Play mode. Snap means that players first take photos with their digital cameras. The newly captured photos are used as source images and fed into the P-FiDi system to autogenerate the counterpart target images for users to play . Four steps are adopted to autogenerate target images: enhancing the visual quality of source images, extracting some changeable patches from the source image, selecting the most suitable combination of changeable patches and difference styles for the image, and generating the differences on the target image with state-of-the-art image processing techniques. In addition, the P-FiDi game can be easily redesigned for the im-game advertising. Extensive experiments show that the P-FiDi electronic game is satisfying in terms of player experience, seamless advertisement, and technical feasibility.
Published: 2014

26. CAMHID: Camera Motion Histogram Descriptor and Its Application to Cinematographic Shot Classification

Author: Xiangjian He, Min Xu, Muhammad Hasan, and Changsheng Xu
Subjects: Motion compensation, Motion analysis, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Motion vector, Quarter-pixel motion, Match moving, Motion field, Computer Science::Computer Vision and Pattern Recognition, Motion estimation, Computer Science::Multimedia, Media Technology, Artificial Intelligence & Image Processing, Computer vision, Artificial intelligence, Electrical and Electronic Engineering, business, Mathematics, Block-matching algorithm
Abstract: © 1991-2012 IEEE. In this paper, we propose a nonparametric camera motion descriptor for video shot classification. In the proposed method, a motion vector field (MVF) is constructed for each consecutive video frame by computing the motion vector (MV) of each macroblock. Then, the MVFs are divided into a number of local region of equal size. Next, the inconsistent/noisy MVs of each local region are eliminated by a motion consistency analysis. The remaining MVs of each local region from a number of consecutive frames are further collected for a compact representation. Initially, a matrix is formed using the MVs. Then, the matrix is decomposed using a singular value decomposition technique to represent the dominant motion. Finally, the angle of the most variance retaining principal component is computed and quantized to represent the motion of a local region by using a histogram. In order to represent the global camera motion, the local histograms are combined. The effectiveness of the proposed motion descriptor for video shot classification is tested by using a support vector machine. First, the proposed camera motion descriptors for video shots classification are computed on a video data set consisting of regular camera motion patterns (e.g., pan, zoom, tilt, static). Then, we apply the camera motion descriptors with an extended set of features to the classification of cinematographic shots. The experimental results show that the proposed shot level camera motion descriptor has a strong discriminative capability to classify different camera motion patterns of different videos effectively. We also show that our approach outperforms state-of-the-art methods.
Published: 2014

27. A new discriminative coding method for image classification

Author: Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu
Subjects: Contextual image classification, Computer Networks and Communications, Computer science, business.industry, Locality, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Scale-invariant feature transform, Pattern recognition, Computer graphics, Discriminative model, Hardware and Architecture, Bag-of-words model, Media Technology, Artificial intelligence, Quantization (image processing), business, Software, Information Systems, Coding (social sciences)
Abstract: The bag-of-words (BOW) based methods are widely used in image classification. However, huge number of visual information is omitted inevitably in the quantization step of the BOW. Recently, NBNN and its improved methods like Local NBNN were proposed to solve this problem. Nevertheless, these methods do not perform better than the state-of-the-art BOW based methods. In this paper, based on the advantages of BOW and Local NBNN, we introduce a novel locality discriminative coding (LDC) method. We convert each low level local feature, such as SIFT, into code vector using the Local Feature-to-Class distance other than by k-means quantization. After coding, sum-pooling combined with SPM is used to construct a single feature representation vector for each image. Extensive experimental results on several challenging benchmark datasets show that our LDC method outperforms six state-of-the-art image classification methods.
Published: 2014

28. Mobile Landmark Search with 3D Models

Author: Changsheng Xu, Min Xu, Bing-Kun Bao, Xian Xiao, and Weiqing Min
Subjects: Landmark, Computer science, business.industry, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Mobile computing, Content-based image retrieval, Computer Science Applications, Feature (computer vision), Signal Processing, Media Technology, Artificial Intelligence & Image Processing, Computer vision, Artificial intelligence, Electrical and Electronic Engineering, business, Image resolution, Image retrieval, Data compression
Abstract: Landmark search is crucial to improve the quality of travel experience. Smart phones make it possible to search landmarks anytime and anywhere. Most of the existing work computes image features on smart phones locally after taking a landmark image. Compared with sending original image to the remote server, sending computed features saves network bandwidth and consequently makes sending process fast. However, this scheme would be restricted by the limitations of phone battery power and computational ability. In this paper, we propose to send compressed (low resolution) images to remote server instead of computing image features locally for landmark recognition and search. To this end, a robust 3D model based method is proposed to recognize query images with corresponding landmarks. Using the proposed method, images with low resolution can be recognized accurately, even though images only contain a small part of the landmark or are taken under various conditions of lighting, zoom, occlusions and different viewpoints. In order to provide an attractive landmark search result, a 3D texture model is generated to respond to a landmark query. The proposed search approach, which opens up a new direction, starts from a 2D compressed image query input and ends with a 3D model search result. © 2014 IEEE.
Published: 2014

29. Discriminative Exemplar Coding for Sign Language Recognition With Kinect

Author: Tao Mei, Changsheng Xu, Bing-Kun Bao, Tianzhu Zhang, and Chao Sun
Subjects: Phrase, American Sign Language, Computer science, Speech recognition, Transducers, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Sign language, Pattern Recognition, Automated, Sign Language, Imaging, Three-Dimensional, Discriminative model, Artificial Intelligence, Computer Systems, Humans, Computer Simulation, Whole Body Imaging, Electrical and Electronic Engineering, Computer Peripherals, Contextual image classification, Euclidean space, Image Enhancement, Actigraphy, language.human_language, Computer Science Applications, Human-Computer Interaction, ComputingMethodologies_PATTERNRECOGNITION, Video Games, Control and Systems Engineering, language, Classifier (UML), Algorithms, Software, Information Systems, Coding (social sciences)
Abstract: Sign language recognition is a growing research area in the field of computer vision. A challenge within it is to model various signs, varying with time resolution, visual manual appearance, and so on. In this paper, we propose a discriminative exemplar coding (DEC) approach, as well as utilizing Kinect sensor, to model various signs. The proposed DEC method can be summarized as three steps. First, a quantity of class-specific candidate exemplars are learned from sign language videos in each sign category by considering their discrimination. Then, every video of all signs is described as a set of similarities between frames within it and the candidate exemplars. Instead of simply using a heuristic distance measure, the similarities are decided by a set of exemplar-based classifiers through the multiple instance learning, in which a positive (or negative) video is treated as a positive (or negative) bag and those frames similar to the given exemplar in Euclidean space as instances. Finally, we formulate the selection of the most discriminative exemplars into a framework and simultaneously produce a sign video classifier to recognize sign. To evaluate our method, we collect an American sign language dataset, which includes approximately 2000 phrases, while each phrase is captured by Kinect sensor with color, depth, and skeleton information. Experimental results on our dataset demonstrate the feasibility and effectiveness of the proposed approach for sign language recognition.
Published: 2013

30. M4L: Maximum margin Multi-instance Multi-cluster Learning for scene modeling

Author: Si Liu, Tianzhu Zhang, Hanqing Lu, and Changsheng Xu
Subjects: business.industry, Gaussian, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Mixture model, symbols.namesake, Motion field, Artificial Intelligence, Margin (machine learning), Motion estimation, Video tracking, Signal Processing, symbols, Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, Cluster analysis, business, Software, Mathematics, Block (data storage)
Abstract: Automatically learning and grouping key motion patterns in a traffic scene captured by a static camera is a fundamental and challenging task for intelligent video surveillance. To learn motion patterns, trajectory obtained by object tracking is parameterized, and scene image is spatially and evenly divided into multiple regular cell blocks which potentially contain several primary motion patterns. Then, for each block, Gaussian Mixture Model (GMM) is adopted to learn its motion patterns based on the parameters of trajectories. Grouping motion pattern can be done by clustering blocks indirectly, and each cluster of blocks corresponds to a certain motion pattern. For one particular block, each of its motion pattern (Gaussian component) can be viewed as an instance, and all motion patterns (Gaussian components) constitute a bag which can correspond to multiple semantic clusters. Therefore, blocks can be grouped as a Multi-instance Multi-cluster Learning (MIMCL) problem, and a novel Maximum Margin Multi-instance Multi-cluster Learning (M^4L) algorithm is proposed. To avoid processing a difficult optimization problem, M^4L is further relaxed and solved by making use of a combination of the Cutting Plane method and Constrained Concave-Convex Procedure (CCCP). Extensive experiments are conducted on multiple real world video sequences containing various patterns and the results validate the effectiveness of our proposed approach.
Published: 2013

31. Weakly Supervised Graph Propagation Towards Collective Image Parsing

Author: Changsheng Xu, Jing Liu, Tianzhu Zhang, Hanqing Lu, Shuicheng Yan, and Si Liu
Subjects: Optimization problem, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Regular polygon, Pattern recognition, Graph theory, Image segmentation, Graph, Computer Science Applications, Automatic image annotation, Computer Science::Computer Vision and Pattern Recognition, Signal Processing, Convex optimization, Media Technology, Graph (abstract data type), Artificial intelligence, Electrical and Electronic Engineering, business, Image retrieval, Feature detection (computer vision)
Abstract: In this work, we propose a weakly supervised graph propagation method to automatically assign the annotated labels at image level to those contextually derived semantic regions. The graph is constructed with the over-segmented patches of the image collection as nodes. Image-level labels are imposed on the graph as weak supervision information over subgraphs, each of which corresponds to all patches of one image, and the contextual information across different images at patch level are then mined to assist the process of label propagation from images to their descendent regions. The ultimate optimization problem is efficiently solved by Convex Concave Programming (CCCP). Extensive experiments on four benchmark datasets clearly demonstrate the effectiveness of our proposed method for the task of collective image parsing. Two extensions including image annotation and concept map based image retrieval demonstrate the proposed image parsing algorithm can effectively aid other vision tasks.
Published: 2012

32. Boosted Exemplar Learning for Action Recognition and Annotation

Author: Changsheng Xu, Si Liu, Tianzhu Zhang, Jing Liu, and Hanqing Lu
Subjects: Contextual image classification, Computer science, business.industry, Heuristic, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Feature selection, Machine learning, computer.software_genre, ComputingMethodologies_PATTERNRECOGNITION, Action (philosophy), Discriminative model, Media Technology, Artificial intelligence, AdaBoost, Electrical and Electronic Engineering, Set (psychology), business, computer
Abstract: Human action recognition and annotation is an active research topic in computer vision. How to model various actions, varying with time resolution, visual appearance, and others, is a challenging task. In this paper, we propose a boosted exemplar learning (BEL) approach to model various actions in a weakly supervised manner, i.e., only action bag-level labels are provided but action instance level ones are not. The proposed BEL method can be summarized as three steps. First, for each action category, amount of class-specific candidate exemplars are learned through an optimization formulation considering their discrimination and co-occurrence. Second, each action bag is described as a set of similarities between its instances and candidate exemplars. Instead of simply using a heuristic distance measure, the similarities are decided by the exemplar-based classifiers through the multiple instance learning, in which a positive (or negative) video or image set is deemed as a positive (or negative) action bag and those frames similar to the given exemplar in Euclidean Space as action instances. Third, we formulate the selection of the most discriminative exemplars into a boosted feature selection framework and simultaneously obtain an action bag-based detector. Experimental results on two publicly available datasets: the KTH dataset and Weizmann dataset, demonstrate the validity and effectiveness of the proposed approach for action recognition. We also apply BEL to learn representations of actions by using images collected from the Web and use this knowledge to automatically annotate action in YouTube videos. Results are very impressive, which proves that the proposed algorithm is also practical in unconstraint environments.
Published: 2011

33. Building topographic subspace model with transfer learning for sparse representation

Author: Hanqing Lu, Changsheng Xu, Jian Cheng, and Yang Liu
Subjects: Training set, Scale (ratio), Contextual image classification, Computer science, business.industry, Cognitive Neuroscience, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Sparse approximation, Machine learning, computer.software_genre, Computer Science Applications, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Artificial Intelligence, Artificial intelligence, Transfer of learning, Cluster analysis, business, Image retrieval, computer, Subspace topology
Abstract: In this paper, we propose a topographic subspace learning algorithm, named key-coding learning, which utilizes irrelevant unlabeled auxiliary data to facilitate image classification and retrieval tasks. It is worth noticing that we do not need to assume the auxiliary data follows the same class labels or generative distribution as the target training data. Firstly, the subspace model is learnt from enormous scale- and rotation-invariant SURF descriptors extracted from auxiliary and training images, which makes model insensitive to geometric and photometric image transformation. Then the bases of model are pooled by clustering to generate topographic basis banks. We provide insights to show that the topographic model is highly biologically plausible in simulating the complex cells in the visual cortex. Finally we generate the succinct sparse representations by mapping target data into this topographic model. Due to the capability of transferring knowledge, the proposed topographic subspace model can effectively address insufficient training data problem for image classification and is also helpful for generating discriminative features for image retrieval. Intensive experiments are conducted on three image datasets to evaluate the performance of our proposed model, the experimental results are encouraging and promising.
Published: 2010

34. Personalized Sports Video Customization Using Content and Context Analysis

Author: Chao Liang, Changsheng Xu, and Hanqing Lu
Subjects: Measure (data warehouse), Information retrieval, Article Subject, Social network, Multimedia, business.industry, Computer science, Communication, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, computer.software_genre, Automatic summarization, lcsh:Telecommunication, Personalization, Context analysis, Knapsack problem, lcsh:TK5101-6720, Content (measure theory), Media Technology, Electrical and Electronic Engineering, business, Adaptation (computer science), computer
Abstract: We present an integrated framework on personalized sports video customization, which addresses three research issues: semantic video annotation, personalized video retrieval and summarization, and system adaptation. Sports video annotation serves as the foundation of the video customization system. To acquire detailed description of video content, external web text is adopted to align with the related sports video according to their semantic correspondence. Based on the derived semantic annotation, a user-participant multiconstraint 0/1 Knapsack model is designed to model the personalized video customization, which can unify both video retrieval and summarization with different fusion parameters. As a measure to make the system adaptive to the particular user, a social network based system adaptation algorithm is proposed to learn latent user preference implicitly. Both quantitative and qualitative experiments conducted on twelve broadcast basketball and football videos validate the effectiveness of the proposed method.
Published: 2010

35. Personalized retrieval of sports video based on multi-modal analysis and user preference acquisition

Author: Changsheng Xu, Hanqing Lu, Xiaoyu Zhang, and Yifan Zhang
Subjects: Information retrieval, Computer Networks and Communications, Event (computing), Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Relevance feedback, Text annotation, Semantics, Preference, Annotation, Hardware and Architecture, Video tracking, Web page, Media Technology, Software
Abstract: In this paper, we present a novel framework on personalized retrieval of sports video, which includes two research tasks: semantic annotation and user preference acquisition. For semantic annotation, web-casting texts which are corresponding to sports videos are firstly captured from the webpages using data region segmentation and labeling. Incorporating the text, we detect events in the sports video and generate video event clips. These video clips are annotated by the semantics extracted from web-casting texts and indexed in a sports video database. Based on the annotation, these video clips can be retrieved from different semantic attributes according to the user preference. For user preference acquisition, we utilize click-through data as a feedback from the user. Relevance feedback is applied on text annotation and visual features to infer the intention and interested points of the user. A user preference model is learned to re-rank the initial results. Experiments are conducted on broadcast soccer and basketball videos and show an encouraging performance of the proposed method.
Published: 2009

36. Using Webcast Text for Semantic Event Detection in Broadcast Sports Video

Author: Guangyu Zhu, Changsheng Xu, Yong Rui, Qingming Huang, Yifan Zhang, and Hanqing Lu
Subjects: Information retrieval, Multimedia, Probabilistic latent semantic analysis, Event (computing), Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Image processing, computer.file_format, computer.software_genre, Smacker video, Automatic summarization, Object detection, Computer Science Applications, Webcast, Video tracking, Signal Processing, Media Technology, Electrical and Electronic Engineering, Face detection, computer
Abstract: Sports video semantic event detection is essential for sports video summarization and retrieval. Extensive research efforts have been devoted to this area in recent years. However, the existing sports video event detection approaches heavily rely on either video content itself, which face the difficulty of high-level semantic information extraction from video content using computer vision and image processing techniques, or manually generated video ontology, which is domain specific and difficult to be automatically aligned with the video content. In this paper, we present a novel approach for sports video semantic event detection based on analysis and alignment of Webcast text and broadcast video. Webcast text is a text broadcast channel for sports game which is co-produced with the broadcast video and is easily obtained from the Web. We first analyze Webcast text to cluster and detect text events in an unsupervised way using probabilistic latent semantic analysis (pLSA). Based on the detected text event and video structure analysis, we employ a conditional random field model (CRFM) to align text event and video event by detecting event moment and event boundary in the video. Incorporation of Webcast text into sports video analysis significantly facilitates sports video semantic event detection. We conducted experiments on 33 hours of soccer and basketball games for Webcast analysis, broadcast video analysis and text/video semantic alignment. The results are encouraging and compared with the manually labeled ground truth.
Published: 2008

37. Automatic composition of broadcast sports video

Author: Qi Tian, Eng Siong Chng, Hanqing Lu, Changsheng Xu, and Jinjun Wang
Subjects: Video production, Multimedia, Computer Networks and Communications, Computer science, Video capture, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Video content analysis, Video processing, computer.software_genre, Video compression picture types, Video editing, Hardware and Architecture, Video tracking, Media Technology, Multiview Video Coding, business, computer, Software, Information Systems
Abstract: This study examines an automatic broadcast soccer video composition system. The research is important as the ability to automatically compose broadcast sports video will not only improve broadcast video generation efficiency, but also provides the possibility to customize sports video broadcasting. We present a novel approach to the two major issues required in the system's implementation, specifically the camera view selection/switching module and the automatic replay generation module. In our implementation, we use multi-modal framework to perform video content analysis, event and event boundary detection from the raw unedited main/sub-camera captures. This framework explores the possible cues using mid-level representations to bridge the gap between low-level features and high-level semantics. The video content analysis results are utilized for camera view selection/switching in the generated video composition, and the event detection results and mid-level representations are used to generate replays which are automatically inserted into the broadcast soccer video. Our experimental results are promising and found to be comparable to those generated by broadcast professionals.
Published: 2008

38. Generation of Personalized Music Sports Video Using Multimodal Cues

Author: Jinjun Wang, Eng Siong Chng, Changsheng Xu, Q. Tian, and Hanqinq Lu
Subjects: Multimedia, business.industry, Computer science, Feature extraction, Search engine indexing, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Broadcasting, computer.software_genre, Computer Science Applications, Personalization, Video editing, ComputerApplications_MISCELLANEOUS, Video tracking, Signal Processing, Media Technology, Electrical and Electronic Engineering, business, computer, Content management
Abstract: In this paper, we propose a novel automatic approach for personalized music sports video generation. Two research challenges are addressed, specifically the semantic sports video content extraction and the automatic music video composition. For the first challenge, we propose to use multimodal (audio, video, and text) feature analysis and alignment to detect the semantics of events in broadcast sports video. For the second challenge, we introduce the video-centric and music-centric music video composition schemes and proposed a dynamic-programming based algorithm to perform fully or semi-automatic generation of personalized music sports video. The experimental results and user evaluations are promising and show that our systems generated music sports video is comparable to professionally generated ones. Our proposed system greatly facilitates the music sports video editing task for both professionals and amateurs
Published: 2007

39. Matching-CNN Meets KNN: Quasi-Parametric Human Parsing

Author: Jianchao Yang, Si Liu, Xiaohui Shen, Xiaochun Cao, Xiaodan Liang, Shuicheng Yan, Changsheng Xu, Liang Lin, and Luoqi Liu
Subjects: FOS: Computer and information sciences, Matching (statistics), Parsing, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, computer.software_genre, Convolutional neural network, k-nearest neighbors algorithm, Image (mathematics), Range (mathematics), Face (geometry), Computer vision, Artificial intelligence, business, computer, Smoothing, Parametric statistics
Abstract: Both parametric and non-parametric approaches have demonstrated encouraging performances in the human parsing task, namely segmenting a human image into several semantic regions (e.g., hat, bag, left arm, face). In this work, we aim to develop a new solution with the advantages of both methodologies, namely supervision from annotated data and the flexibility to use newly annotated (possibly uncommon) images, and present a quasi-parametric human parsing model. Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict the matching confidence and displacements of the best matched region in the testing image for a particular semantic region in one KNN image. Given a testing image, we first retrieve its KNN images from the annotated/manually-parsed human image corpus. Then each semantic region in each KNN image is matched with confidence to the testing image using M-CNN, and the matched regions from all KNN images are further fused, followed by a superpixel smoothing procedure to obtain the ultimate human parsing result. The M-CNN differs from the classic CNN in that the tailored cross image matching filters are introduced to characterize the matching between the testing image and the semantic region of a KNN image. The cross image matching filters are defined at different convolutional layers, each aiming to capture a particular range of displacements. Comprehensive evaluations over a large dataset with 7,700 annotated human images well demonstrate the significant performance gain from the quasi-parametric model over the state-of-the-arts, for the human parsing task., Comment: This manuscript is the accepted version for CVPR 2015
Published: 2015
Full Text: View/download PDF

40. Nonparametric motion characterization for robust classification of camera motion patterns

Author: Jesse S. Jin, Ling-Yu Duan, Changsheng Xu, and Qi Tian
Subjects: Motion compensation, Motion analysis, business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Optical flow, Pattern recognition, Computer Science Applications, Quarter-pixel motion, Match moving, Motion field, Histogram, Motion estimation, Signal Processing, Media Technology, Computer vision, Mean-shift, Artificial intelligence, Electrical and Electronic Engineering, Panning (camera), business, Image retrieval, Block-matching algorithm
Abstract: Motion characterization plays a critical role in video indexing. An effective way of characterizing camera motion facilitates the video representation, indexing and retrieval tasks. This paper describes a novel nonparametric motion representation to achieve an effective and robust recognition of parts of the video in which camera is static, or panning, or tilting, or zooming, etc. This representation employs the mean shift filtering and the vector histograms to produce a compact description of a motion field. The basic idea is to perform spatio-temporal mode-seeking in the motion feature space and use the histograms-based spatial distributions of dominant motion modes to represent a motion field. Unlike most existing approaches, which focus on the estimation of a parametric motion model from a dense optical flow field (OFF) or a block matching-based motion vector field (MVF), the proposed method combines the motion representation and machine learning techniques (e.g., support vector machines) to perform camera motion analysis from the classification point of view. The main motivation lies in the impossibility of uniformly securing a proper parametric assumption in a wide range of video scenarios. The diverse camera shot sizes and frequent occurrences of bad OFF/MVF necessitates a learning mechanism, which can not only capture the domain-independent parametric constraints, but also acquire the domain-dependent knowledge to tolerate the influence of bad OFF/MVF. In order to improve performance, we can use this learning-based method to train enhanced classifiers aiming at a certain context (i.e., shot size, neighbor OFF/MVFs, and video genre). Other visual cues (e.g., dominant color) can also be incorporated for further motion analysis. Our main aim is to use a generic feature space analysis method to explore a flexible OFF/MVF representation in a nonparametric technique, which could be fed into a learning framework to robustly capture the global motion by incorporating the context information. Results on videos with various types of content (23 191 MVFs culled from MPEG-7 dataset, and 20 000 MVFs culled from broadcast tennis, soccer, and basketball videos) are reported to validate the proposed approach.
Published: 2006

41. A unified framework for semantic shot classification in sports video

Author: Min Xu, Qi Tian, Ling-Yu Duan, Changsheng Xu, and Jesse S. Jin
Subjects: Information retrieval, Computer science, Shot (filmmaking), Feature vector, Semantic analysis (machine learning), Supervised learning, Search engine indexing, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Image processing, Semantic network, Computer Science Applications, Video tracking, Signal Processing, Media Technology, Electrical and Electronic Engineering, Cluster analysis, Image retrieval, Semantic gap
Abstract: The extensive amount of multimedia information available necessitates content-based video indexing and retrieval methods. Since humans tend to use high-level semantic concepts when querying and browsing multimedia databases, there is an increasing need for semantic video indexing and analysis. For this purpose, we present a unified framework for semantic shot classification in sports video, which has been widely studied due to tremendous commercial potentials. Unlike most existing approaches, which focus on clustering by aggregating shots or key-frames with similar low-level features, the proposed scheme employs supervised learning to perform a top-down video shot classification. Moreover, the supervised learning procedure is constructed on the basis of effective mid-level representations instead of exhaustive low-level features. This framework consists of three main steps: 1) identify video shot classes for each sport; 2) develop a common set of motion, color, shot length-related mid-level representations; and 3) supervised learning of the given sports video shots. It is observed that for each sport we can predefine a small number of semantic shot classes, about 5-10, which covers 90%-95% of broadcast sports video. We employ nonparametric feature space analysis to map low-level features to mid-level semantic video shot attributes such as dominant object (a player) motion, camera motion patterns, and court shape, etc. Based on the fusion of those mid-level shot attributes, we classify video shots into the predefined shot classes, each of which has clear semantic meanings. With this framework we have achieved good classification accuracy of 85%-95% on the game videos of five typical ball type sports (i.e., tennis, basketball, volleyball, soccer, and table tennis) with over 5500 shots of about 8 h. With correctly classified sports video shots, further structural and temporal analysis, such as event detection, highlight extraction, video skimming, and table of content, will be greatly facilitated.
Published: 2005

42. Scene and viewpoint based visual summarization for landmarks

Author: Weiqing Min, Changsheng Xu, and Bing-Kun Bao
Subjects: Topic model, Information retrieval, User experience design, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computer vision, Artificial intelligence, business, Theme (computing), Automatic summarization, Task (project management)
Abstract: Visual summarization of landmarks is an important task for applications, such as landmark organization, search and browsing. In this work, we make the first attempt towards landmark summarization by simultaneously considering both the scenes (e.g., sunny view and night view) and viewpoints (e.g., front-side and close-distant viewpoint). In the proposed framework of landmark summarization, we first group images into different clusters by viewpoints, then the distinctive scenes for each viewpoint cluster are discovered by the proposed scene-viewpoint based theme modeling. Compared with the existing topic models, our model is capable of mining scene-viewpoint themes directly from all viewpoint clusters and meanwhile differentiating among these themes by viewpoints. The landmark summary is generated by the discovered scene-viewpoint themes, where each theme is represented by the selected images with one certain scene and viewpoint. The experimental results validate the proposed method and demonstrate its advantage in improving user experience.
Published: 2014

43. Occlusion Detection via Structured Sparse Learning for Robust Object Tracking

Author: Narendra Ahuja, Tianzhu Zhang, Changsheng Xu, and Bernard Ghanem
Subjects: Pixel, Computer science, business.industry, Frame (networking), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Sparse approximation, Tracking (particle physics), Video tracking, Eye tracking, Computer vision, Artificial intelligence, business, Particle filter, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Sparse representation based methods have recently drawn much attention in visual tracking due to good performance against illumination variation and occlusion. They assume the errors caused by image variations can be modeled as pixel-wise sparse. However, in many practical scenarios, these errors are not truly pixel-wise sparse but rather sparsely distributed in a structured way. In fact, pixels in error constitute contiguous regions within the object’s track. This is the case when significant occlusion occurs. To accommodate for nonsparse occlusion in a given frame, we assume that occlusion detected in previous frames can be propagated to the current one. This propagated information determines which pixels will contribute to the sparse representation of the current track. In other words, pixels that were detected as part of an occlusion in the previous frame will be removed from the target representation process. As such, this paper proposes a novel tracking algorithm that models and detects occlusion through structured sparse learning. We test our tracker on challenging benchmark sequences, such as sports videos, which involve heavy occlusion, drastic illumination changes, and large pose variations. Extensive experimental results show that our proposed tracker consistently outperforms the state-of-the-art trackers.
Published: 2014

44. Latent support vector machine for sign language recognition with Kinect

Author: Tianzhu Zhang, Changsheng Xu, Bing-Kun Bao, and Chao Sun
Subjects: Phrase, American Sign Language, Contextual image classification, Computer science, Color image, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Sign language, language.human_language, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Depth map, Feature (machine learning), language, Computer vision, Artificial intelligence, business
Abstract: In this paper, we propose a novel algorithm to model and recognize sign language with Kinect sensor. We assume that in a sign language video, some frames are expected to be both discriminative and representative. Under this assumption, each frame in training videos is assigned a binary latent variable indicating its discriminative capability. A Latent Support Vector Machine model is then developed to classify the signs, as well as localize the discriminative and representative frames in videos. In addition, we utilize the depth map together with color image captured by Kinect sensor to obtain more effective and accurate feature to enhance the recognition accuracy. To evaluate our approach, we collected an American Sign Language (ASL) dataset which included approximately 2000 phrases, while each phrase was captured by Kinect sensor and hence included color, depth and skeleton information. Experiments on our dataset demonstrate the effectiveness of the proposed method for sign language recognition.
Published: 2013

45. Label localization by appearance guided graph inferring

Author: Jing Liu, Changsheng Xu, and Lei Yu
Subjects: Graph database, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Graph theory, computer.software_genre, Active appearance model, Support vector machine, Iterated function, Graph (abstract data type), Computer vision, Artificial intelligence, Semantic information, business, computer, Image retrieval, Mathematics
Abstract: Automatically localizing the image labels to the corresponding regions is a challenging but valuable task, which provides detailed semantic information for better image understanding and image retrieval. In this paper, we propose a novel appearance guided graph inferring (AGI) framework for label localization. The framework iterates with two stages: graph inferring and appearance learning. Given the image set, each image is over-segmented into a bag of small patches. In the first step, we adopt graph propagation based method to infer the patch labels collaboratively on the whole image set. A multi-cue graph is constructed for more consistent spatial layout and image label constraints are imposed in propagation. In the second step, SVM classifiers are trained as appearance models by gradually exploiting the inferring results. And then the patch labels are reevaluated by the learned appearance model and feedback to the first step. The global graph propagation and local appearance model complement each other by iteration. Extensive experiments on three public datasets, MSRC-v1, MSRC-v2 and SAIAPR TC-12, demonstrate the encouraging performance of our method in comparison with other baselines.
Published: 2013

46. Locality discriminative coding for image classification

Author: Xiaoshan Yang, Changsheng Xu, and Tianzhu Zhang
Subjects: Feature coding, Contextual image classification, Computer science, business.industry, Locality, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Scale-invariant feature transform, Pattern recognition, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Bag-of-words model, Artificial intelligence, Quantization (image processing), business, Coding (social sciences)
Abstract: The Bag-of-Words (BOW) based methods are widely used in image classification. However, huge number of visual information is omitted inevitably in the quantization step of the BOW. Recently, NBNN and its improved methods like Local NBNN were proposed to solve this problem. Nevertheless, these methods do not perform better than the state-of-the-art BOW based methods. In this paper, based on the advantages of BOW and Local NBNN, we introduce a novel locality discriminative coding (LDC) method. We convert each low level local feature, such as SIFT, into code vector using the Local Feature-to-Class distance other than by k-means quantization. Extensive experimental results on 4 challenging benchmark datasets show that our LDC method outperforms 6 state-of-the-art image classification methods (3 based on NBNN, 3 based on BOW).
Published: 2013

47. Object Tracking by Occlusion Detection via Structured Sparse Learning

Author: Narendra Ahuja, Changsheng Xu, Tianzhu Zhang, and Bernard Ghanem
Subjects: Pixel, business.industry, Computer science, Frame (networking), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Sparse approximation, Tracking (particle physics), Object detection, Video tracking, Eye tracking, Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Sparse representation based methods have recently drawn much attention in visual tracking due to good performance against illumination variation and occlusion. They assume the errors caused by image variations can be modeled as pixel-wise sparse. However, in many practical scenarios these errors are not truly pixel-wise sparse but rather sparsely distributed in a structured way. In fact, pixels in error constitute contiguous regions within the object's track. This is the case when significant occlusion occurs. To accommodate for non-sparse occlusion in a given frame, we assume that occlusion detected in previous frames can be propagated to the current one. This propagated information determines which pixels will contribute to the sparse representation of the current track. In other words, pixels that were detected as part of an occlusion in the previous frame will be removed from the target representation process. As such, this paper proposes a novel tracking algorithm that models and detects occlusion through structured sparse learning. We test our tracker on challenging benchmark sequences, such as sports videos, which involve heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that our tracker consistently outperforms the state-of-the-art.
Published: 2013

48. Landmark History Visualization

Author: Bing-Kun Bao, Changsheng Xu, and Weiqing Min
Subjects: Landmark, Information retrieval, Event list, Computer science, business.industry, Event (computing), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, History education, Pattern recognition, Image (mathematics), Visualization, Manifold ranking, Anomaly detection, Artificial intelligence, business
Abstract: Landmark image mining and detection have been studied for many years, however, most of the existing work focuses on their spatial attributes, while largely ignoring the temporal information in specific ones, which are taken during historical moments. This kind of images are more valuable than the normal ones as they not only contain more comprehensive information to illustrate the landmarks in different moments, but also are useful in many real world applications, such as tour recommendation and history education. In this paper, we present a novel framework named Landmark History Visualization (LHV) to mine relevant and diverse images for each landmark’s historic moments. There are two steps in LHV. The first one is to extract the event list of each landmark from Wikipedia. The event keywords are extracted, and some of them are automatically labeled as 3W (What, Who, When). In the second step, images searched by the landmark name are firstly collected from Flickr and Google images. Secondly, we employ manifold ranking with detected 3W to retrieve the relevant images, and lastly, an outlier detection and diversification based re-ranking approach is introduced to provide users with various returned images. We implemented our approach on 6 landmarks and the results demonstrate the effectiveness of LHV.
Published: 2013

49. Multi-cue Based Multi-target Tracking with Boosted MHT

Author: Tianzhu Zhang, Shengsheng Qian, Changsheng Xu, and Long Ying
Subjects: Boosting (machine learning), Computer science, business.industry, Multiple hypotheses, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Optical flow, Video content analysis, Pattern recognition, Virtual reality, Discriminative model, Data association, Multi target tracking, Computer vision, Artificial intelligence, business
Abstract: Tracking multiple objects is critical to automatic video content analysis and virtual reality. The major problem is how to solve data association problem when ambiguous observations are caused by objects in close proximity or occlusion. To tackle this problem, we propose a boosted multiple hypotheses tracking (BMHT) algorithm for multiobject tracking. Here, on-line boosting learning is adopted to enhance the discriminative property and enlarge search space of the generative tracker MHT. To make the tracker be more reliable, a multi-cue integration strategy is adopted to consider different kinds of features under the on-line boosting framework. In this paper, we integrate both appearance and motion pattern information. For simplicity, Haar-like features and optical flow are adopted. We test our BMHT tracker on several challenging video sequences that involve heavy occlusion and pose variations. Experimental results show that the proposed BMHT achieves good performance.
Published: 2013

50. Extended MHT algorithm for multiple object tracking

Author: Changsheng Xu, Wen Guo, and Long Ying
Subjects: Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Video content analysis, Virtual reality, Tracking (particle physics), Data association, Computer Science::Computer Vision and Pattern Recognition, Video tracking, Histogram, Identity (object-oriented programming), Computer vision, Artificial intelligence, business, Likelihood function, Algorithm
Abstract: In this paper, we propose an improved efficient MHT algorithm integrated with HSV-LBP appearance and repulsion-inertia model for multi-object tracking. Simultaneously tracking multiple objects is critical to video content analysis and virtual reality. The main issues we want to address in this paper are integration of video image patch information into data association and ambiguous observations caused by objects in close proximity. A likelihood function of HSV-LBP histogram with strategy of template updating is constructed. A repulsion-inertia model is adopted to explore more useful information from ambiguous detections. Experimental results show that the proposed approach generates better trajectories with less missing objects and identity switches.
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

113 results on '"CHANGSHENG XU"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources