Author: "Wu, Fei" / Journal: ieee transactions on image processing - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wu, Fei"' showing total 13 results

Start Over Author "Wu, Fei" Journal ieee transactions on image processing

13 results on '"Wu, Fei"'

1. Cross-Modal Learning to Rank via Latent Joint Representation.

Author: Wu, Fei, Jiang, Xinyang, Li, Xi, Tang, Siliang, Lu, Weiming, Zhang, Zhongfei, and Zhuang, Yueting
Subjects: *MACHINE learning, *RANKING (Statistics), *IMAGE representation, *IMAGE analysis, *QUERY (Information retrieval system)
Abstract: Cross-modal ranking is a research topic that is imperative to many applications involving multimodal data. Discovering a joint representation for multimodal data and learning a ranking function are essential in order to boost the cross-media retrieval (i.e., image-query-text or text-query-image). In this paper, we propose an approach to discover the latent joint representation of pairs of multimodal data (e.g., pairs of an image query and a text document) via a conditional random field and structural learning in a listwise ranking manner. We call this approach cross-modal learning to rank via latent joint representation (CML ^2\textR ). In CML ^2\textR , the correlations between multimodal data are captured in terms of their sharing hidden variables (e.g., topics), and a hidden-topic-driven discriminative ranking function is learned in a listwise ranking manner. The experiments show that the proposed approach achieves a good performance in cross-media retrieval and meanwhile has the capability to learn the discriminative representation of multimodal data. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

2. Image Annotation by Input–Output Structural Grouping Sparsity.

Author: Han, Yahong, Wu, Fei, Tian, Qi, and Zhuang, Yueting
Subjects: *THREE-dimensional imaging, *FEATURE extraction, *STATISTICAL correlation, *SEMANTICS, *FEATURE selection, *ANNOTATIONS, *IMAGE retrieval, *IMAGE processing
Abstract: Automatic image annotation (AIA) is very important to image retrieval and image understanding. Two key issues in AIA are explored in detail in this paper, i.e., structured visual feature selection and the implementation of hierarchical correlated structures among multiple tags to boost the performance of image annotation. This paper simultaneously introduces an input and output structural grouping sparsity into a regularized regression model for image annotation. For input high-dimensional heterogeneous features such as color, texture, and shape, different kinds (groups) of features have different intrinsic discriminative power for the recognition of certain concepts. The proposed structured feature selection by structural grouping sparsity can be used not only to select group-of-features but also to conduct within-group selection. Hierarchical correlations among output labels are well represented by a tree structure, and therefore, the proposed tree-structured grouping sparsity can be used to boost the performance of multitag image annotation. In order to efficiently solve the proposed regression model, we relax the solving process as a framework of the bilayer regression model for multilabel boosting by the selection of heterogeneous features with structural grouping sparsity (Bi-MtBGS). The first-layer regression is to select the discriminative features for each label. The aim of the second-layer regression is to refine the feature selection model learned from the first layer, which can be taken as a multilabel boosting process. Extensive experiments on public benchmark image data sets and real-world image data sets demonstrate that the proposed approach has better performance of multitag image annotation and leads to a quite interpretable model for image understanding. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

3. Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding.

Author: Yang, Yi, Wu, Fei, Nie, Feiping, Shen, Heng Tao, Zhuang, Yueting, and Hauptmann, Alexander G.
Subjects: *DIGITAL images, *EMBEDDINGS (Mathematics), *STATISTICAL correlation, *ALGORITHMS, *IMAGE quality analysis, *IMAGE databases
Abstract: The number of digital images rapidly increases, and it becomes an important challenge to organize these resources effectively. As a way to facilitate image categorization and retrieval, automatic image annotation has received much research attention. Considering that there are a great number of unlabeled images available, it is beneficial to develop an effective mechanism to leverage unlabeled images for large-scale image annotation. Meanwhile, a single image is usually associated with multiple labels, which are inherently correlated to each other. A straightforward method of image annotation is to decompose the problem into multiple independent single-label problems, but this ignores the underlying correlations among different labels. In this paper, we propose a new inductive algorithm for image annotation by integrating label correlation mining and visual similarity mining into a joint framework. We first construct a graph model according to image visual features. A multilabel classifier is then trained by simultaneously uncovering the shared structure common to different labels and the visual graph embedded label prediction matrix for image annotation. We show that the globally optimal solution of the proposed framework can be obtained by performing generalized eigen-decomposition. We apply the proposed framework to both web image annotation and personal album labeling using the NUS-WIDE, MSRA MM 2.0, and Kodak image data sets, and the AUC evaluation metric. Extensive experiments on large-scale image databases collected from the web and personal album show that the proposed algorithm is capable of utilizing both labeled and unlabeled data for image annotation and outperforms other algorithms. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

4. Training Robust Object Detectors From Noisy Category Labels and Imprecise Bounding Boxes.

Author: Xu, Youjiang, Zhu, Linchao, Yang, Yi, and Wu, Fei
Subjects: *OBJECT recognition (Computer vision), *CONVOLUTIONAL neural networks, *DETECTORS, *FOOD labeling, *SUPERVISED learning
Abstract: Object detection has gained great improvements with the advances of convolutional neural networks and the availability of large amounts of accurate training data. Though the amount of data is increasing significantly, the quality of data annotations is not guaranteed from the existing crowd-sourcing labeling platforms. In addition to noisy category labels, imprecise bounding box annotations are commonly existed for object detection data. When the quality of training data degenerates, the performance of the typical object detectors is severely impaired. In this paper, we propose a Meta-Refine-Net (MRNet) to train object detectors from noisy category labels and imprecise bounding boxes. First, MRNet learns to adaptively assign lower weights to proposals with incorrect labels so as to suppress large loss values generated by these proposals on the classification branch. Second, MRNet learns to dynamically generate more accurate bounding box annotations to overcome the misleading of imprecisely annotated bounding boxes. Thus, the imprecise bounding boxes could impose positive impacts on the regression branch rather than simply be ignored. Third, we propose to refine the imprecise bounding box annotations by jointly learning from both the category and the localization information. By doing this, the approximation of ground-truth bounding boxes is more accurate while the misleading would be further alleviated. Our MRNet is model-agnostic and is capable of learning from noisy object detection data with only a few clean examples (less than 2%). Extensive experiments on PASCAL VOC 2012 and MS COCO 2017 demonstrate the effectiveness and efficiency of our method. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

5. ResKD: Residual-Guided Knowledge Distillation.

Author: Li, Xuewei, Li, Songyuan, Omar, Bourahla, Wu, Fei, and Li, Xi
Subjects: *TRAINING of student teachers, *KNOWLEDGE transfer
Abstract: Knowledge distillation, aimed at transferring the knowledge from a heavy teacher network to a lightweight student network, has emerged as a promising technique for compressing neural networks. However, due to the capacity gap between the heavy teacher and the lightweight student, there still exists a significant performance gap between them. In this article, we see knowledge distillation in a fresh light, using the knowledge gap, or the residual, between a teacher and a student as guidance to train a much more lightweight student, called a res-student. We combine the student and the res-student into a new student, where the res-student rectifies the errors of the former student. Such a residual-guided process can be repeated until the user strikes the balance between accuracy and cost. At inference time, we propose a sample-adaptive strategy to decide which res-students are not necessary for each sample, which can save computational cost. Experimental results show that we achieve competitive performance with 18.04%, 23.14%, 53.59%, and 56.86% of the teachers’ computational costs on the CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet datasets. Finally, we do thorough theoretical and empirical analysis for our method. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

6. Interaction-Integrated Network for Natural Language Moment Localization.

Author: Ning, Ke, Xie, Lingxi, Liu, Jianzhuang, Wu, Fei, and Tian, Qi
Subjects: *NATURAL languages, *VIDEO compression, *VIDEO excerpts, *TASK analysis
Abstract: Natural language moment localization aims at localizing video clips according to a natural language description. The key to this challenging task lies in modeling the relationship between verbal descriptions and visual contents. Existing approaches often sample a number of clips from the video, and individually determine how each of them is related to the query sentence. However, this strategy can fail dramatically, in particular when the query sentence refers to some visual elements that appear outside of, or even are distant from, the target clip. In this paper, we address this issue by designing an Interaction-Integrated Network (I2N), which contains a few Interaction-Integrated Cells (I2Cs). The idea lies in the observation that the query sentence not only provides a description to the video clip, but also contains semantic cues on the structure of the entire video. Based on this, I2Cs go one step beyond modeling short-term contexts in the time domain by encoding long-term video content into every frame feature. By stacking a few I2Cs, the obtained network, I2N, enjoys an improved ability of inference, brought by both (I) multi-level correspondence between vision and language and (II) more accurate cross-modal alignment. When evaluated on a challenging video moment localization dataset named DiDeMo, I2N outperforms the state-of-the-art approach by a clear margin of 1.98%. On other two challenging datasets, Charades-STA and TACoS, I2N also reports competitive performance. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

7. FREE: A Fast and Robust End-to-End Video Text Spotter.

Author: Cheng, Zhanzhan, Lu, Jing, Zou, Baorui, Qiao, Liang, Xu, Yunlu, Pu, Shiliang, Niu, Yi, Wu, Fei, and Zhou, Shuigeng
Subjects: *TEXT recognition, *STREAMING media, *VIDEO surveillance, *GLOBAL optimization, *VIDEOS
Abstract: Currently, video text spotting tasks usually fall into the four-staged pipeline: detecting text regions in individual images, recognizing localized text regions frame-wisely, tracking text streams and post-processing to generate final results. However, they may suffer from the huge computational cost as well as sub-optimal results due to the interferences of low-quality text and the none-trainable pipeline strategy. In this article, we propose a fast and robust end-to-end video text spotting framework named FREE by only recognizing the localized text stream one-time instead of frame-wise recognition. Specifically, FREE first employs a well-designed spatial-temporal detector that learns text locations among video frames. Then a novel text recommender is developed to select the highest-quality text from text streams for recognizing. Here, the recommender is implemented by assembling text tracking, quality scoring and recognition into a trainable module. It not only avoids the interferences from the low-quality text but also dramatically speeds up the video text spotting. FREE unites the detector and recommender into a whole framework, and helps achieve global optimization. Besides, we collect a large scale video text dataset for promoting the video text spotting community, containing 100 videos from 21 real-life scenarios. Extensive experiments on public benchmarks show our method greatly speeds up the text spotting process, and also achieves the remarkable state-of-the-art. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

8. Context-Aware Graph Label Propagation Network for Saliency Detection.

Author: Ji, Wei, Li, Xi, Wei, Lina, Wu, Fei, and Zhuang, Yueting
Subjects: *GRAPH labelings, *SPINE
Abstract: Recently, a large number of existing methods for saliency detection have mainly focused on designing complex network architectures to aggregate powerful features from backbone networks. However, contextual information is not well utilized, which often causes false background regions and blurred object boundaries. Motivated by these issues, we propose an easy-to-implement module that utilizes the edge-preserving ability of superpixels and the graph neural network to interact the context of superpixel nodes. In more detail, we first extract the features from the backbone network and obtain the superpixel information of images. This step is followed by superpixel pooling in which we transfer the irregular superpixel information to a structured feature representation. To propagate the information among the foreground and background regions, we use a graph neural network and self-attention layer to better evaluate the degree of saliency degree. Additionally, an affinity loss is proposed to regularize the affinity matrix to constrain the propagation path. Moreover, we extend our module to a multiscale structure with different numbers of superpixels. Experiments on five challenging datasets show that our approach can improve the performance of three baseline methods in terms of some popular evaluation metrics. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

9. Adaptive Graph Representation Learning for Video Person Re-Identification.

Author: Wu, Yiming, Bourahla, Omar El Farouk, Li, Xi, Wu, Fei, Tian, Qi, and Zhou, Xue
Subjects: *REPRESENTATIONS of graphs, *DEEP learning, *VIDEOS, *LEARNING
Abstract: Recent years have witnessed the remarkable progress of applying deep learning models in video person re-identification (Re-ID). A key factor for video person Re-ID is to effectively construct discriminative and robust video feature representations for many complicated situations. Part-based approaches employ spatial and temporal attention to extract representative local features. While correlations between parts are ignored in the previous methods, to leverage the relations of different parts, we propose an innovative adaptive graph representation learning scheme for video person Re-ID, which enables the contextual interactions between relevant regional features. Specifically, we exploit the pose alignment connection and the feature affinity connection to construct an adaptive structure-aware adjacency graph, which models the intrinsic relations between graph nodes. We perform feature propagation on the adjacency graph to refine regional features iteratively, and the neighbor nodes’ information is taken into account for part feature representation. To learn compact and discriminative representations, we further propose a novel temporal resolution-aware regularization, which enforces the consistency among different temporal resolutions for the same identities. We conduct extensive evaluations on four benchmarks, i.e. iLIDS-VID, PRID2011, MARS, and DukeMTMC-VideoReID, experimental results achieve the competitive performance which demonstrates the effectiveness of our proposed method. Code is available at https://github.com/weleen/AGRL.pytorch. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

10. Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks.

Author: Zhao, Zhou, Zhang, Zhu, Xiao, Shuwen, Xiao, Zhenxin, Yan, Xiaohui, Yu, Jun, Cai, Deng, and Wu, Fei
Subjects: *STREAMING video & television, *NATURAL languages, *VIDEOS, *INFORMATION retrieval, *REINFORCEMENT learning, *QUESTIONING
Abstract: Open-ended long-form video question answering is a challenging task in visual information retrieval, which automatically generates a natural language answer from the referenced long-form video contents according to a given question. However, the existing works mainly focus on short-form video question answering, due to the lack of modeling semantic representations from long-form video contents. In this paper, we introduce a dynamic hierarchical reinforced network for open-ended long-form video question answering, which employs an encoder–decoder architecture with a dynamic hierarchical encoder and a reinforced decoder. Concretely, we first propose a frame-level dynamic long-short term memory (LSTM) network with binary segmentation gate to learn frame-level semantic representations according to the given question. We then develop a segment-level highway LSTM network with a question-aware highway gate for segment-level semantic modeling. Furthermore, we devise the reinforced decoder with a hierarchical attention mechanism to generate natural language answers. We construct a large-scale long-form video question answering dataset. The extensive experiments on the long-form dataset and another public short-form dataset show the effectiveness of our method. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

11. User-Ranking Video Summarization With Multi-Stage Spatio–Temporal Representation.

Author: Huang, Siyu, Li, Xi, Zhang, Zhongfei, Wu, Fei, and Han, Junwei
Subjects: *SUPERVISED learning, *SHORT-term memory, *RECURRENT neural networks, *VIDEOS, *ARTIFICIAL neural networks
Abstract: Video summarization is a challenging task, mainly due to the difficulties in learning complicated semantic structural relations between videos and summaries. In this paper, we present a novel supervised video summarization scheme based on three-stage deep neural networks. The scheme takes a divide-and-conquer strategy to resolve the complicated task of 3D video summarization into a set of easy and flexible computational subtasks, and then to sequentially perform 2D CNNs, 1D CNNs, and long short-term memory to address the subtasks in an hierarchical fashion. The hierarchical modeling of spatio–temporal structure leads to high performance and efficiency. In addition, we propose a simple but effective user-ranking method to cope with the labeling subjectivity problem of user-created video summarization, leading to the labeling quality refinement for robust supervised learning. Experimental results show that our approach outperforms the state-of-the-art video summarization methods on two benchmark datasets. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

12. Deep Context-Sensitive Facial Landmark Detection With Tree-Structured Modeling.

Author: Zeng, Jiajian, Liu, Siyuan, Li, Xi, Mahdi, Debbah Abderrahmane, Wu, Fei, and Wang, Gang
Subjects: *IMAGE analysis, *CARTOGRAPHY, *HUMAN facial recognition software, *GENETIC algorithms, *SOFTWARE engineering
Abstract: Facial landmark detection is typically cast as a point-wise regression problem that focuses on how to build an effective image-to-point mapping function. In this paper, we propose an end-to-end deep learning approach for contextually discriminative feature construction together with effective facial structure modeling. The proposed learning approach is able to predict more contextually discriminative facial landmarks by capturing their associated contextual information. Moreover, we present a tree model to characterize human face structure and a structural loss function to measure the deformation cost between the ground-truth and predicted tree model, which are further incorporated into the proposed learning approach and jointly optimized within a unified framework. The presented tree model is able to well characterize the spatial layout patterns of facial landmarks for capturing the facial structure information. Experimental results demonstrate the effectiveness of the proposed approach against the state-of-the-art over the MTFL and AFLW-full data sets. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

13. Body Structure Aware Deep Crowd Counting.

Author: Huang, Siyu, Li, Xi, Zhang, Zhongfei, Wu, Fei, Gao, Shenghua, Ji, Rongrong, and Han, Junwei
Subjects: *PEDESTRIAN areas design, *SEMANTIC networks (Information theory), *ARTIFICIAL neural networks, *MATHEMATICAL convolutions, *VISUAL analytics
Abstract: Crowd counting is a challenging task, mainly due to the severe occlusions among dense crowds. This paper aims to take a broader view to address crowd counting from the perspective of semantic modeling. In essence, crowd counting is a task of pedestrian semantic analysis involving three key factors: pedestrians, heads, and their context structure. The information of different body parts is an important cue to help us judge whether there exists a person at a certain position. Existing methods usually perform crowd counting from the perspective of directly modeling the visual properties of either the whole body or the heads only, without explicitly capturing the composite body-part semantic structure information that is crucial for crowd counting. In our approach, we first formulate the key factors of crowd counting as semantic scene models. Then, we convert the crowd counting problem into a multi-task learning problem, such that the semantic scene models are turned into different sub-tasks. Finally, the deep convolutional neural networks are used to learn the sub-tasks in a unified scheme. Our approach encodes the semantic nature of crowd counting and provides a novel solution in terms of pedestrian semantic analysis. In experiments, our approach outperforms the state-of-the-art methods on four benchmark crowd counting data sets. The semantic structure information is demonstrated to be an effective cue in scene of crowd counting. [ABSTRACT FROM PUBLISHER]
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

13 results on '"Wu, Fei"'

1. Cross-Modal Learning to Rank via Latent Joint Representation.

2. Image Annotation by Input–Output Structural Grouping Sparsity.

3. Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding.

4. Training Robust Object Detectors From Noisy Category Labels and Imprecise Bounding Boxes.

5. ResKD: Residual-Guided Knowledge Distillation.

6. Interaction-Integrated Network for Natural Language Moment Localization.

7. FREE: A Fast and Robust End-to-End Video Text Spotter.

8. Context-Aware Graph Label Propagation Network for Saliency Detection.

9. Adaptive Graph Representation Learning for Video Person Re-Identification.

10. Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks.

11. User-Ranking Video Summarization With Multi-Stage Spatio–Temporal Representation.

12. Deep Context-Sensitive Facial Landmark Detection With Tree-Structured Modeling.

13. Body Structure Aware Deep Crowd Counting.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

13 results on '"Wu, Fei"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources