Author: "Zhong, Yujie" / Publication Type: Electronic Resources - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhong, Yujie"' showing total 35 results

Start Over Author "Zhong, Yujie" Publication Type Electronic Resources

35 results on '"Zhong, Yujie"'

1. Visual retrieval for compound queries

Author: Zhong, Yujie, Zisserman, Andrew, and Arandjelović, Relja
Subjects: 006.3
Abstract: There are now vast number of images available on the Internet or in personal collections. As a result, searching for images based on their visual content has many useful applications. In this thesis, we focus on compound query image retrieval. Namely, the query can consist of objects of different natures, a set of objects of the same type or multiple examples of an object, and the goal is to retrieve (based on visual content) images that match the query from a large image corpus. However, compound query retrieval is very challenging, as it may require the system to handle queries of different object types. Furthermore, the retrieval should be real-time with high performance. The first task we consider is to retrieve images containing both a target person and a target scene type from a large dataset of images. We propose a hybrid convolutional neural network architecture that produces place-descriptors that are aware of faces and their corresponding descriptors. We also propose an image synthesis system to render high quality fully-labelled face-and-place images which are used to train the network. To facilitate this research, we collect and annotate a dataset of real images containing celebrities in different places, which can be used to evaluate the retrieval system. We demonstrate significantly improved retrieval performance for compound queries using the new face-aware place-descriptors compared to baseline methods. Set retrieval is another example of compound query retrieval. Namely, we wish to rank the images, given a set of query identities, such that those containing all the identities of the query are ranked first, followed by those which satisfy all but one of the query identities, and so on. To this end, we propose a network architecture to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval. We also explore the speed vs. retrieval quality trade-off for set retrieval using this compact descriptor. For evaluation, we collect and annotate a large dataset of images containing various numbers of celebrities, which we use and is publicly available. Template-based face recognition, where a set of faces of the same subject is available, is now gaining attention as there are usually more than one examples for each subject in real-world situations. To tackle this problem, we propose a network architecture which aggregates and embeds the face descriptors produced by deep convolutional neural networks into a compact template representation. This compact representation requires minimal memory storage and enables efficient similarity computation. The proposed architecture contains a novel GhostVLAD layer which enables the network to deal with poor quality images, i.e. informative images contribute more than the low quality ones. We also show that such quality weighting on the input faces emerges automatically. The performance of the network far exceeds the state-of-the-art on one of the most challenging public benchmarks.
Published: 2018

2. InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

Author: Feng, Chengjian, Zhong, Yujie, Jie, Zequn, Xie, Weidi, Ma, Lin, Feng, Chengjian, Zhong, Yujie, Jie, Zequn, Xie, Weidi, and Ma, Lin
Abstract: In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code: https://fcjian.github.io/InstaGen., Comment: CVPR2024
Published: 2024

3. LaSagnA: Language-based Segmentation Assistant for Complex Queries

Author: Wei, Cong, Tan, Haoxian, Zhong, Yujie, Yang, Yujiu, Ma, Lin, Wei, Cong, Tan, Haoxian, Zhong, Yujie, Yang, Yujiu, and Ma, Lin
Abstract: Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incapability of handling multiple targets per query and the failure to identify the absence of query objects in the image. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. Consequently, we define the general sequence format for complex queries. Then we incorporate a semantic segmentation task in the current pipeline to fulfill the requirements of training data. Furthermore, we present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format. The effectiveness of our model in processing complex queries is validated by the comparable results with conventional methods on both close-set and open-set semantic segmentation datasets. Additionally, we outperform a series of vLLMs in reasoning and referring segmentation, showcasing our model's remarkable capabilities. We release the code at https://github.com/congvvc/LaSagnA.
Published: 2024

4. UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Author: Zeng, Yingsen, Zhong, Yujie, Feng, Chengjian, Ma, Lin, Zeng, Yingsen, Zhong, Yujie, Feng, Chengjian, and Ma, Lin
Abstract: Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code will be released at https://github.com/yingsen1/UniMD., Comment: Tech report
Published: 2024

5. Matten: Video Generation with Mamba-Attention

Author: Gao, Yu, Huang, Jiancheng, Sun, Xiaopeng, Jie, Zequn, Zhong, Yujie, Ma, Lin, Gao, Yu, Huang, Jiancheng, Sun, Xiaopeng, Jie, Zequn, Zhong, Yujie, and Ma, Lin
Abstract: In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation. With minimal computational cost, Matten employs spatial-temporal attention for local video content modeling and bidirectional Mamba for global video content modeling. Our comprehensive experimental evaluation demonstrates that Matten has competitive performance with the current Transformer-based and GAN-based models in benchmark performance, achieving superior FVD scores and efficiency. Additionally, we observe a direct positive correlation between the complexity of our designed model and the improvement in video quality, indicating the excellent scalability of Matten.
Published: 2024

6. Genomic profiling of post-transplant lymphoproliferative disorders using cell-free DNA

Author: Veltmaat, Nick, Zhong, Yujie, de Jesus, Filipe Montes, Tan, Geok Wee, Bult, Johanna A.A., Terpstra, Martijn M., Mutsaers, Pim G.N.J., Stevens, Wendy B.C., Mous, Rogier, Vermaat, Joost S.P., Chamuleau, Martine E.D., Noordzij, Walter, Verschuuren, Erik A.M., Kok, Klaas, Kluiver, Joost L., Diepstra, Arjan, Plattel, Wouter J., van den Berg, Anke, Nijland, Marcel, Veltmaat, Nick, Zhong, Yujie, de Jesus, Filipe Montes, Tan, Geok Wee, Bult, Johanna A.A., Terpstra, Martijn M., Mutsaers, Pim G.N.J., Stevens, Wendy B.C., Mous, Rogier, Vermaat, Joost S.P., Chamuleau, Martine E.D., Noordzij, Walter, Verschuuren, Erik A.M., Kok, Klaas, Kluiver, Joost L., Diepstra, Arjan, Plattel, Wouter J., van den Berg, Anke, and Nijland, Marcel
Abstract: Diagnosing post-transplant lymphoproliferative disorder (PTLD) is challenging and often requires invasive procedures. Analyses of cell-free DNA (cfDNA) isolated from plasma is minimally invasive and highly effective for genomic profiling of tumors. We studied the feasibility of using cfDNA to profile PTLD and explore its potential to serve as a screening tool. We included seventeen patients with monomorphic PTLD after solid organ transplantation in this multi-center observational cohort study. We used low-coverage whole genome sequencing (lcWGS) to detect copy number variations (CNVs) and targeted next-generation sequencing (NGS) to identify Epstein-Barr virus (EBV) DNA load and somatic single nucleotide variants (SNVs) in cfDNA from plasma. Seven out of seventeen (41%) patients had EBV-positive tumors, and 13/17 (76%) had stage IV disease. Nine out of seventeen (56%) patients showed CNVs in cfDNA, with more CNVs in EBV-negative cases. Recurrent gains were detected for 3q, 11q, and 18q. Recurrent losses were observed at 6q. The fraction of EBV reads in cfDNA from EBV-positive patients was 3-log higher compared to controls and EBV-negative patients. 289 SNVs were identified, with a median of 19 per sample. SNV burden correlated significantly with lactate dehydrogenase levels. Similar SNV burdens were observed in EBV-negative and EBV-positive PTLD. The most commonly mutated genes were TP53 and KMT2D (41%), followed by SPEN, TET2 (35%), and ARID1A, IGLL5, and PIM1 (29%), indicating DNA damage response, epigenetic regulation, and B-cell signaling/NFkB pathways as drivers of PTLD. Overall, CNVs were more prevalent in EBV-negative lymphoma, while no difference was observed in the number of SNVs. Our data indicated the potential of analyzing cfDNA as a tool for PTLD screening and response monitoring.
Published: 2023

7. TriDet: Temporal Action Detection with Relative Boundary Modeling

Author: Shi, Dingfeng, Zhong, Yujie, Cao, Qiong, Ma, Lin, Li, Jia, Tao, Dacheng, Shi, Dingfeng, Zhong, Yujie, Cao, Qiong, Ma, Lin, Li, Jia, and Tao, Dacheng
Abstract: In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet., Comment: CVPR2023; Temporal Action Detection; Temporal Action Localization
Published: 2023

8. MotionTrack: Learning Motion Predictor for Multiple Object Tracking

Author: Xiao, Changcheng, Cao, Qiong, Zhong, Yujie, Lan, Long, Zhang, Xiang, Luo, Zhigang, Tao, Dacheng, Xiao, Changcheng, Cao, Qiong, Zhong, Yujie, Lan, Long, Zhang, Xiang, Luo, Zhigang, and Tao, Dacheng
Abstract: Significant progress has been achieved in multi-object tracking (MOT) through the evolution of detection and re-identification (ReID) techniques. Despite these advancements, accurately tracking objects in scenarios with homogeneous appearance and heterogeneous motion remains a challenge. This challenge arises from two main factors: the insufficient discriminability of ReID features and the predominant utilization of linear motion models in MOT. In this context, we introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor that relies solely on object trajectory information. This predictor comprehensively integrates two levels of granularity in motion features to enhance the modeling of temporal dynamics and facilitate precise future motion prediction for individual objects. Specifically, the proposed approach adopts a self-attention mechanism to capture token-level information and a Dynamic MLP layer to model channel-level features. MotionTrack is a simple, online tracking approach. Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT, characterized by highly complex object motion.
Published: 2023

9. Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models

Author: Liu, Chang, Wu, Haoning, Zhong, Yujie, Zhang, Xiaoyun, Wang, Yanfeng, Xie, Weidi, Liu, Chang, Wu, Haoning, Zhong, Yujie, Zhang, Xiaoyun, Wang, Yanfeng, and Xie, Weidi
Abstract: Generative models have recently exhibited exceptional capabilities in text-to-image generation, but still struggle to generate image sequences coherently. In this work, we focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We make the following three contributions: (i) to fulfill the task of visual storytelling, we propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module, that enables to generate the current frame by conditioning on the corresponding text prompt and preceding image-caption pairs; (ii) to address the data shortage of visual storytelling, we collect paired image-text sequences by sourcing from online videos and open-source E-books, establishing processing pipeline for constructing a large-scale dataset with diverse characters, storylines, and artistic styles, named StorySalon; (iii) Quantitative experiments and human evaluations have validated the superiority of our StoryGen, where we show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character. Code, dataset, and models are available at https://haoningwu3639.github.io/StoryGen_Webpage, Comment: Accepted by CVPR 2024. Project Page: https://haoningwu3639.github.io/StoryGen_Webpage
Published: 2023

10. Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking

Author: Yan, Feng, Luo, Weixin, Zhong, Yujie, Gan, Yiyang, Ma, Lin, Yan, Feng, Luo, Weixin, Zhong, Yujie, Gan, Yiyang, and Ma, Lin
Abstract: Existing end-to-end Multi-Object Tracking (e2e-MOT) methods have not surpassed non-end-to-end tracking-by-detection methods. One potential reason is its label assignment strategy during training that consistently binds the tracked objects with tracking queries and then assigns the few newborns to detection queries. With one-to-one bipartite matching, such an assignment will yield unbalanced training, i.e., scarce positive samples for detection queries, especially for an enclosed scene, as the majority of the newborns come on stage at the beginning of videos. Thus, e2e-MOT will be easier to yield a tracking terminal without renewal or re-initialization, compared to other tracking-by-detection methods. To alleviate this problem, we present Co-MOT, a simple and effective method to facilitate e2e-MOT by a novel coopetition label assignment with a shadow concept. Specifically, we add tracked objects to the matching targets for detection queries when performing the label assignment for training the intermediate decoders. For query initialization, we expand each query by a set of shadow counterparts with limited disturbance to itself. With extensive ablations, Co-MOT achieves superior performance without extra costs, e.g., 69.4% HOTA on DanceTrack and 52.8% TETA on BDD100K. Impressively, Co-MOT only requires 38\% FLOPs of MOTRv2 to attain a similar performance, resulting in the 1.4$\times$ faster inference speed.
Published: 2023

11. Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network

Author: Han, Cong, Zhong, Yujie, Li, Dengjie, Han, Kai, Ma, Lin, Han, Cong, Zhong, Yujie, Li, Dengjie, Han, Kai, and Ma, Lin
Abstract: Recently, the open-vocabulary semantic segmentation problem has attracted increasing attention and the best performing methods are based on two-stream networks: one stream for proposal mask generation and the other for segment classification using a pretrained visual-language model. However, existing two-stream methods require passing a great number of (up to a hundred) image crops into the visual-language model, which is highly inefficient. To address the problem, we propose a network that only needs a single pass through the visual-language model for each input image. Specifically, we first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification. Extensive experiments demonstrate that the proposed method achieves outstanding performance, surpassing state-of-the-art methods while being 4 to 7 times faster at inference. Code: https://github.com/CongHan0808/DeOP.git, Comment: Accepted by ICCV2023
Published: 2023

12. Adaptive Sparse Pairwise Loss for Object Re-Identification

Author: Zhou, Xiao, Zhong, Yujie, Cheng, Zhen, Liang, Fan, Ma, Lin, Zhou, Xiao, Zhong, Yujie, Cheng, Zhen, Liang, Fan, and Ma, Lin
Abstract: Object re-identification (ReID) aims to find instances with the same identity as the given probe from a large gallery. Pairwise losses play an important role in training a strong ReID network. Existing pairwise losses densely exploit each instance as an anchor and sample its triplets in a mini-batch. This dense sampling mechanism inevitably introduces positive pairs that share few visual similarities, which can be harmful to the training. To address this problem, we propose a novel loss paradigm termed Sparse Pairwise (SP) loss that only leverages few appropriate pairs for each class in a mini-batch, and empirically demonstrate that it is sufficient for the ReID tasks. Based on the proposed loss framework, we propose an adaptive positive mining strategy that can dynamically adapt to diverse intra-class variations. Extensive experiments show that SP loss and its adaptive variant AdaSP loss outperform other pairwise losses, and achieve state-of-the-art performance across several ReID benchmarks. Code is available at https://github.com/Astaxanthin/AdaSP., Comment: Accepted by CVPR 2023
Published: 2023

13. Genomic profiling of post-transplant lymphoproliferative disorders using cell-free DNA

Author: MS Hematologie, Cancer, Infection & Immunity, Veltmaat, Nick, Zhong, Yujie, de Jesus, Filipe Montes, Tan, Geok Wee, Bult, Johanna A.A., Terpstra, Martijn M., Mutsaers, Pim G.N.J., Stevens, Wendy B.C., Mous, Rogier, Vermaat, Joost S.P., Chamuleau, Martine E.D., Noordzij, Walter, Verschuuren, Erik A.M., Kok, Klaas, Kluiver, Joost L., Diepstra, Arjan, Plattel, Wouter J., van den Berg, Anke, Nijland, Marcel, MS Hematologie, Cancer, Infection & Immunity, Veltmaat, Nick, Zhong, Yujie, de Jesus, Filipe Montes, Tan, Geok Wee, Bult, Johanna A.A., Terpstra, Martijn M., Mutsaers, Pim G.N.J., Stevens, Wendy B.C., Mous, Rogier, Vermaat, Joost S.P., Chamuleau, Martine E.D., Noordzij, Walter, Verschuuren, Erik A.M., Kok, Klaas, Kluiver, Joost L., Diepstra, Arjan, Plattel, Wouter J., van den Berg, Anke, and Nijland, Marcel
Published: 2023

14. SoccerNet 2023 Challenges Results

Author: Cioppa, Anthony, Giancola, Silvio, Somers, Vladimir, Magera, Floriane, Zhou, Xin, Mkhallati, Hassan, Deliège, Adrien, Held, Jan, Hinojosa, Carlos, Mansourian, Amir M., Miralles, Pierre, Barnich, Olivier, De Vleeschouwer, Christophe, Alahi, Alexandre, Ghanem, Bernard, Van Droogenbroeck, Marc, Kamal, Abdullah, Maglo, Adrien, Clapés, Albert, Abdelaziz, Amr, Xarles, Artur, Orcesi, Astrid, Scott, Atom, Liu, Bin, Lim, Byoungkwon, Chen, Chen, Deuser, Fabian, Yan, Feng, Yu, Fufu, Shitrit, Gal, Wang, Guanshuo, Choi, Gyusik, Kim, Hankyul, Guo, Hao, Fahrudin, Hasby, Koguchi, Hidenari, Ardö, Håkan, Salah, Ibrahim, Yerushalmy, Ido, Muhammad, Iftikar, Uchida, Ikuma, Be'ery, Ishay, Rabarisoa, Jaonary, Lee, Jeongae, Fu, Jiajun, Yin, Jianqin, Xu, Jinghang, Nang, Jongho, Denize, Julien, Li, Junjie, Zhang, Junpei, Kim, Juntae, Synowiec, Kamil, Kobayashi, Kenji, Zhang, Kexin, Habel, Konrad, Nakajima, Kota, Jiao, Licheng, Ma, Lin, Wang, Lizhi, Wang, Luping, Li, Menglong, Zhou, Mengying, Nasr, Mohamed, Abdelwahed, Mohamed, Liashuha, Mykola, Falaleev, Nikolay, Oswald, Norbert, Jia, Qiong, Pham, Quoc-Cuong, Song, Ran, Hérault, Romain, Peng, Rui, Chen, Ruilong, Liu, Ruixuan, Baikulov, Ruslan, Fukushima, Ryuto, Escalera, Sergio, Lee, Seungcheon, Chen, Shimin, Ding, Shouhong, Someya, Taiga, Moeslund, Thomas B., Li, Tianjiao, Shen, Wei, Zhang, Wei, Li, Wei, Dai, Wei, Luo, Weixin, Zhao, Wending, Zhang, Wenjie, Yang, Xinquan, Ma, Yanbiao, Joo, Yeeun, Zeng, Yingsen, Gan, Yiyang, Zhu, Yongqiang, Zhong, Yujie, Ruan, Zheng, Li, Zhiheng, Huang, Zhijian, Meng, Ziyu, Cioppa, Anthony, Giancola, Silvio, Somers, Vladimir, Magera, Floriane, Zhou, Xin, Mkhallati, Hassan, Deliège, Adrien, Held, Jan, Hinojosa, Carlos, Mansourian, Amir M., Miralles, Pierre, Barnich, Olivier, De Vleeschouwer, Christophe, Alahi, Alexandre, Ghanem, Bernard, Van Droogenbroeck, Marc, Kamal, Abdullah, Maglo, Adrien, Clapés, Albert, Abdelaziz, Amr, Xarles, Artur, Orcesi, Astrid, Scott, Atom, Liu, Bin, Lim, Byoungkwon, Chen, Chen, Deuser, Fabian, Yan, Feng, Yu, Fufu, Shitrit, Gal, Wang, Guanshuo, Choi, Gyusik, Kim, Hankyul, Guo, Hao, Fahrudin, Hasby, Koguchi, Hidenari, Ardö, Håkan, Salah, Ibrahim, Yerushalmy, Ido, Muhammad, Iftikar, Uchida, Ikuma, Be'ery, Ishay, Rabarisoa, Jaonary, Lee, Jeongae, Fu, Jiajun, Yin, Jianqin, Xu, Jinghang, Nang, Jongho, Denize, Julien, Li, Junjie, Zhang, Junpei, Kim, Juntae, Synowiec, Kamil, Kobayashi, Kenji, Zhang, Kexin, Habel, Konrad, Nakajima, Kota, Jiao, Licheng, Ma, Lin, Wang, Lizhi, Wang, Luping, Li, Menglong, Zhou, Mengying, Nasr, Mohamed, Abdelwahed, Mohamed, Liashuha, Mykola, Falaleev, Nikolay, Oswald, Norbert, Jia, Qiong, Pham, Quoc-Cuong, Song, Ran, Hérault, Romain, Peng, Rui, Chen, Ruilong, Liu, Ruixuan, Baikulov, Ruslan, Fukushima, Ryuto, Escalera, Sergio, Lee, Seungcheon, Chen, Shimin, Ding, Shouhong, Someya, Taiga, Moeslund, Thomas B., Li, Tianjiao, Shen, Wei, Zhang, Wei, Li, Wei, Dai, Wei, Luo, Weixin, Zhao, Wending, Zhang, Wenjie, Yang, Xinquan, Ma, Yanbiao, Joo, Yeeun, Zeng, Yingsen, Gan, Yiyang, Zhu, Yongqiang, Zhong, Yujie, Ruan, Zheng, Li, Zhiheng, Huang, Zhijian, and Meng, Ziyu
Abstract: The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet.
Published: 2023

15. Temporal Action Localization with Enhanced Instant Discriminability

Author: Shi, Dingfeng, Cao, Qiong, Zhong, Yujie, An, Shan, Cheng, Jian, Zhu, Haogang, Tao, Dacheng, Shi, Dingfeng, Cao, Qiong, Zhong, Yujie, An, Shan, Cheng, Jian, Zhu, Haogang, and Tao, Dacheng
Abstract: Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets., Comment: An extended version of the CVPR paper arXiv:2303.07347, submitted to IJCV
Published: 2023

16. SoccerNet 2022 Challenges Results

Author: Giancola, Silvio, Cioppa, Anthony, Deliège, Adrien, Magera, Floriane, Somers, Vladimir, Kang, Le, Zhou, Xin, Barnich, Olivier, De Vleeschouwer, Christophe, Alahi, Alexandre, Ghanem, Bernard, Van Droogenbroeck, Marc, Darwish, Abdulrahman, Maglo, Adrien, Clapés, Albert, Luyts, Andreas, Boiarov, Andrei, Xarles, Artur, Orcesi, Astrid, Shah, Avijit, Fan, Baoyu, Comandur, Bharath, Chen, Chen, Zhang, Chen, Zhao, Chen, Lin, Chengzhi, Chan, Cheuk-Yiu, Hui, Chun Chuen, Li, Dengjie, Yang, Fan, Liang, Fan, Da, Fang, Yan, Feng, Yu, Fufu, Wang, Guanshuo, Chan, H. Anthony, Zhu, He, Kan, Hongwei, Chu, Jiaming, Hu, Jianming, Gu, Jianyang, Chen, Jin, Soares, João V. B., Theiner, Jonas, De Corte, Jorge, Brito, José Henrique, Zhang, Jun, Li, Junjie, Liang, Junwei, Shen, Leqi, Ma, Lin, Chen, Lingchi, Santos Marques, Miguel, Azatov, Mike, Kasatkin, Nikita, Wang, Ning, Jia, Qiong, Pham, Quoc Cuong, Ewerth, Ralph, Song, Ran, Li, Rengang, Gade, Rikke, Debien, Ruben, Zhang, Runze, Lee, Sangrok, Escalera, Sergio, Jiang, Shan, Odashima, Shigeyuki, Chen, Shimin, Masui, Shoichi, Ding, Shouhong, Chan, Sin-wai, Chen, Siyu, El-Shabrawy, Tallal, He, Tao, Moeslund, Thomas B., Siu, Wan-Chi, Zhang, Wei, Li, Wei, Wang, Xiangwei, Tan, Xiao, Li, Xiaochuan, Wei, Xiaolin, Ye, Xiaoqing, Liu, Xing, Wang, Xinying, Guo, Yandong, Zhao, Yaqian, Yu, Yi, Li, Yingying, He, Yue, Zhong, Yujie, Guo, Zhenhua, Li, Zhiheng, Giancola, Silvio, Cioppa, Anthony, Deliège, Adrien, Magera, Floriane, Somers, Vladimir, Kang, Le, Zhou, Xin, Barnich, Olivier, De Vleeschouwer, Christophe, Alahi, Alexandre, Ghanem, Bernard, Van Droogenbroeck, Marc, Darwish, Abdulrahman, Maglo, Adrien, Clapés, Albert, Luyts, Andreas, Boiarov, Andrei, Xarles, Artur, Orcesi, Astrid, Shah, Avijit, Fan, Baoyu, Comandur, Bharath, Chen, Chen, Zhang, Chen, Zhao, Chen, Lin, Chengzhi, Chan, Cheuk-Yiu, Hui, Chun Chuen, Li, Dengjie, Yang, Fan, Liang, Fan, Da, Fang, Yan, Feng, Yu, Fufu, Wang, Guanshuo, Chan, H. Anthony, Zhu, He, Kan, Hongwei, Chu, Jiaming, Hu, Jianming, Gu, Jianyang, Chen, Jin, Soares, João V. B., Theiner, Jonas, De Corte, Jorge, Brito, José Henrique, Zhang, Jun, Li, Junjie, Liang, Junwei, Shen, Leqi, Ma, Lin, Chen, Lingchi, Santos Marques, Miguel, Azatov, Mike, Kasatkin, Nikita, Wang, Ning, Jia, Qiong, Pham, Quoc Cuong, Ewerth, Ralph, Song, Ran, Li, Rengang, Gade, Rikke, Debien, Ruben, Zhang, Runze, Lee, Sangrok, Escalera, Sergio, Jiang, Shan, Odashima, Shigeyuki, Chen, Shimin, Masui, Shoichi, Ding, Shouhong, Chan, Sin-wai, Chen, Siyu, El-Shabrawy, Tallal, He, Tao, Moeslund, Thomas B., Siu, Wan-Chi, Zhang, Wei, Li, Wei, Wang, Xiangwei, Tan, Xiao, Li, Xiaochuan, Wei, Xiaolin, Ye, Xiaoqing, Liu, Xing, Wang, Xinying, Guo, Yandong, Zhao, Yaqian, Yu, Yi, Li, Yingying, He, Yue, Zhong, Yujie, Guo, Zhenhua, and Li, Zhiheng
Abstract: The SoccerNet 2022 challenges were the second annual video understanding challenges organized by the SoccerNet team. In 2022, the challenges were composed of 6 vision-based tasks: (1) action spotting, focusing on retrieving action timestamps in long untrimmed videos, (2) replay grounding, focusing on retrieving the live moment of an action shown in a replay, (3) pitch localization, focusing on detecting line and goal part elements, (4) camera calibration, dedicated to retrieving the intrinsic and extrinsic camera parameters, (5) player re-identification, focusing on retrieving the same players across multiple views, and (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams. Compared to last year's challenges, tasks (1-2) had their evaluation metrics redefined to consider tighter temporal accuracies, and tasks (3-6) were novel, including their underlying data and annotations. More information on the tasks, challenges and leaderboards are available on https://www.soccer-net.org. Baselines and development kits are available on https://github.com/SoccerNet.
Published: 2022

17. DiP: Learning Discriminative Implicit Parts for Person Re-Identification

Author: Li, Dengjie, Chen, Siyu, Zhong, Yujie, Ma, Lin, Li, Dengjie, Chen, Siyu, Zhong, Yujie, and Ma, Lin
Abstract: In person re-identification (ReID) tasks, many works explore the learning of part features to improve the performance over global image features. Existing methods explicitly extract part features by either using a hand-designed image division or keypoints obtained with external visual systems. In this work, we propose to learn Discriminative implicit Parts (DiPs) which are decoupled from explicit body parts. Therefore, DiPs can learn to extract any discriminative features that can benefit in distinguishing identities, which is beyond predefined body parts (such as accessories). Moreover, we propose a novel implicit position to give a geometric interpretation for each DiP. The implicit position can also serve as a learning signal to encourage DiPs to be more position-equivariant with the identity in the image. Lastly, an additional DiP weighting is introduced to handle the invisible or occluded situation and further improve the feature representation of DiPs. Extensive experiments show that the proposed method achieves state-of-the-art performance on multiple person ReID benchmarks.
Published: 2022

18. AeDet: Azimuth-invariant Multi-view 3D Object Detection

Author: Feng, Chengjian, Jie, Zequn, Zhong, Yujie, Chu, Xiangxiang, Ma, Lin, Feng, Chengjian, Jie, Zequn, Zhong, Yujie, Chu, Xiangxiang, and Ma, Lin
Abstract: Recent LSS-based multi-view 3D object detection has made tremendous progress, by processing the features in Brid-Eye-View (BEV) via the convolutional detector. However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization. To preserve the inherent property of the BEV features and ease the optimization, we propose an azimuth-equivariant convolution (AeConv) and an azimuth-equivariant anchor. The sampling grid of AeConv is always in the radial direction, thus it can learn azimuth-invariant BEV features. The proposed anchor enables the detection head to learn predicting azimuth-irrelevant targets. In addition, we introduce a camera-decoupled virtual depth to unify the depth prediction for the images with different camera intrinsic parameters. The resultant detector is dubbed Azimuth-equivariant Detector (AeDet). Extensive experiments are conducted on nuScenes, and AeDet achieves a 62.0% NDS, surpassing the recent multi-view 3D object detectors such as PETRv2 and BEVDepth by a large margin. Project page: https://fcjian.github.io/aedet., Comment: CVPR2023
Published: 2022

19. Contrastive Video-Language Learning with Fine-grained Frame Sampling

Author: Wang, Zixu, Zhong, Yujie, Miao, Yishu, Ma, Lin, Specia, Lucia, Wang, Zixu, Zhong, Yujie, Miao, Yishu, Ma, Lin, and Specia, Lucia
Abstract: Despite recent progress in video and language representation learning, the weak or sparse correspondence between the two modalities remains a bottleneck in the area. Most video-language models are trained via pair-level loss to predict whether a pair of video and text is aligned. However, even in paired video-text segments, only a subset of the frames are semantically relevant to the corresponding text, with the remainder representing noise; where the ratio of noisy frames is higher for longer videos. We propose FineCo (Fine-grained Contrastive Loss for Frame Sampling), an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames. It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence. Building on the well established VideoCLIP model as a starting point, FineCo achieves state-of-the-art performance on YouCookII, a text-video retrieval benchmark with long videos. FineCo also achieves competitive results on text-video retrieval (MSR-VTT), and video question answering datasets (MSR-VTT QA and MSR-VTT MC) with shorter videos., Comment: AACL-IJCNLP 2022
Published: 2022

20. CounTR: Transformer-based Generalised Visual Counting

Author: Liu, Chang, Zhong, Yujie, Zisserman, Andrew, Xie, Weidi, Liu, Chang, Zhong, Yujie, Zisserman, Andrew, and Xie, Weidi
Abstract: In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings., Comment: Accepted by BMVC2022
Published: 2022

21. ReAct: Temporal Action Detection with Relational Queries

Author: Shi, Dingfeng, Zhong, Yujie, Cao, Qiong, Zhang, Jing, Ma, Lin, Li, Jia, Tao, Dacheng, Shi, Dingfeng, Zhong, Yujie, Cao, Qiong, Zhang, Jing, Ma, Lin, Li, Jia, and Tao, Dacheng
Abstract: This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries, similar to DETR, which has shown great success in object detection. However, the framework suffers from several problems if directly applied to TAD: the insufficient exploration of inter-query relation in the decoder, the inadequate classification training due to a limited number of training samples, and the unreliable classification scores at inference. To this end, we first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations. Moreover, we propose two losses to facilitate and stabilize the training of action classification. Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries. The proposed method, named ReAct, achieves the state-of-the-art performance on THUMOS14, with much lower computational costs than previous methods. Besides, extensive ablation studies are conducted to verify the effectiveness of each proposed component. The code is available at https://github.com/sssste/React., Comment: ECCV2022
Published: 2022

22. DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Author: Chen, Xianing, Cao, Qiong, Zhong, Yujie, Zhang, Jing, Gao, Shenghua, Tao, Dacheng, Chen, Xianing, Cao, Qiong, Zhong, Yujie, Zhang, Jing, Gao, Shenghua, and Tao, Dacheng
Abstract: Transformers are successfully applied to computer vision due to their powerful modeling capacity with self-attention. However, the excellent performance of transformers heavily depends on enormous training images. Thus, a data-efficient transformer solution is urgently needed. In this work, we propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers. Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation. Further, our DearKD can be readily applied to the extreme data-free case where no real images are available. In this case, we propose a boundary-preserving intra-divergence loss based on DeepInversion to further close the performance gap against the full-data counterpart. Extensive experiments on ImageNet, partial ImageNet, data-free setting and other downstream tasks prove the superiority of DearKD over its baselines and state-of-the-art methods., Comment: CVPR 2022
Published: 2022

23. PromptDet: Towards Open-vocabulary Detection using Uncurated Images

Author: Feng, Chengjian, Zhong, Yujie, Jie, Zequn, Chu, Xiangxiang, Ren, Haibing, Wei, Xiaolin, Xie, Weidi, Ma, Lin, Feng, Chengjian, Zhong, Yujie, Jie, Zequn, Chu, Xiangxiang, Ren, Haibing, Wei, Xiaolin, Xie, Weidi, and Ma, Lin
Abstract: The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet., Comment: ECCV2022
Published: 2022

24. Cross-Architecture Self-supervised Video Representation Learning

Author: Guo, Sheng, Xiong, Zihua, Zhong, Yujie, Wang, Limin, Guo, Xiaobo, Han, Bing, Huang, Weilin, Guo, Sheng, Xiong, Zihua, Zhong, Yujie, Wang, Limin, Guo, Xiaobo, Han, Bing, and Huang, Weilin
Abstract: In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning. CACL consists of a 3D CNN and a video transformer which are used in parallel to generate diverse positive pairs for contrastive learning. This allows the model to learn strong representations from such diverse yet meaningful pairs. Furthermore, we introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences in the temporal order. This enables the model to learn a rich temporal representation that compensates strongly to the video-level representation learned by the CACL. We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets, where our method achieves excellent performance, surpassing the state-of-the-art methods such as VideoMoCo and MoCo+BE by a large margin. The code is made available at https://github.com/guoshengcv/CACL., Comment: Accepted to CVPR2022
Published: 2022

25. InsCLR: Improving Instance Retrieval with Self-Supervision

Author: Deng, Zelu, Zhong, Yujie, Guo, Sheng, Huang, Weilin, Deng, Zelu, Zhong, Yujie, Guo, Sheng, and Huang, Weilin
Abstract: This work aims at improving instance retrieval with self-supervision. We find that fine-tuning using the recently developed self-supervised (SSL) learning methods, such as SimCLR and MoCo, fails to improve the performance of instance retrieval. In this work, we identify that the learnt representations for instance retrieval should be invariant to large variations in viewpoint and background etc., whereas self-augmented positives applied by the current SSL methods can not provide strong enough signals for learning robust instance-level representations. To overcome this problem, we propose InsCLR, a new SSL method that builds on the \textit{instance-level} contrast, to learn the intra-class invariance by dynamically mining meaningful pseudo positive samples from both mini-batches and a memory bank during training. Extensive experiments demonstrate that InsCLR achieves similar or even better performance than the state-of-the-art SSL methods on instance retrieval. Code is available at https://github.com/zeludeng/insclr., Comment: Accepted by AAAI 2022
Published: 2021

26. OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Author: Chen, Xianing, Xu, Chunlin, Cao, Qiong, Xu, Jialang, Zhong, Yujie, Xu, Jiale, Li, Zhengxin, Wang, Jingya, Gao, Shenghua, Chen, Xianing, Xu, Chunlin, Cao, Qiong, Xu, Jialang, Zhong, Yujie, Xu, Jiale, Li, Zhengxin, Wang, Jingya, and Gao, Shenghua
Abstract: Transformers have shown preferable performance on many vision tasks. However, for the task of person re-identification (ReID), vanilla transformers leave the rich contexts on high-order feature relations under-exploited and deteriorate local feature details, which are insufficient due to the dramatic variations of pedestrians. In this work, we propose an Omni-Relational High-Order Transformer (OH-Former) to model omni-relational features for ReID. First, to strengthen the capacity of visual representation, instead of obtaining the attention matrix based on pairs of queries and isolated keys at each spatial location, we take a step further to model high-order statistics information for the non-local mechanism. We share the attention weights in the corresponding layer of each order with a prior mixing mechanism to reduce the computation cost. Then, a convolution-based local relation perception module is proposed to extract the local relations and 2D position information. The experimental results of our model are superior promising, which show state-of-the-art performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.
Published: 2021

27. Exploring Classification Equilibrium in Long-Tailed Object Detection

Author: Feng, Chengjian, Zhong, Yujie, Huang, Weilin, Feng, Chengjian, Zhong, Yujie, and Huang, Weilin
Abstract: The conventional detectors tend to make imbalanced classification and suffer performance drop, when the distribution of the training data is severely skewed. In this paper, we propose to use the mean classification score to indicate the classification accuracy for each category during training. Based on this indicator, we balance the classification via an Equilibrium Loss (EBL) and a Memory-augmented Feature Sampling (MFS) method. Specifically, EBL increases the intensity of the adjustment of the decision boundary for the weak classes by a designed score-guided loss margin between any two classes. On the other hand, MFS improves the frequency and accuracy of the adjustment of the decision boundary for the weak classes through over-sampling the instance features of those classes. Therefore, EBL and MFS work collaboratively for finding the classification equilibrium in long-tailed detection, and dramatically improve the performance of tail classes while maintaining or even improving the performance of head classes. We conduct experiments on LVIS using Mask R-CNN with various backbones including ResNet-50-FPN and ResNet-101-FPN to show the superiority of the proposed method. It improves the detection performance of tail classes by 15.6 AP, and outperforms the most recent long-tailed object detectors by more than 1 AP. Code is available at https://github.com/fcjian/LOCE., Comment: ICCV2021
Published: 2021

28. TOOD: Task-aligned One-stage Object Detection

Author: Feng, Chengjian, Zhong, Yujie, Gao, Yu, Scott, Matthew R., Huang, Weilin, Feng, Chengjian, Zhong, Yujie, Gao, Yu, Scott, Matthew R., and Huang, Weilin
Abstract: One-stage object detection is commonly implemented by optimizing two sub-tasks: object classification and localization, using heads with two parallel branches, which might lead to a certain level of spatial misalignment in predictions between the two tasks. In this work, we propose a Task-aligned One-stage Object Detection (TOOD) that explicitly aligns the two tasks in a learning-based manner. First, we design a novel Task-aligned Head (T-Head) which offers a better balance between learning task-interactive and task-specific features, as well as a greater flexibility to learn the alignment via a task-aligned predictor. Second, we propose Task Alignment Learning (TAL) to explicitly pull closer (or even unify) the optimal anchors for the two tasks during training via a designed sample assignment scheme and a task-aligned loss. Extensive experiments are conducted on MS-COCO, where TOOD achieves a 51.1 AP at single-model single-scale testing. This surpasses the recent one-stage detectors by a large margin, such as ATSS (47.7 AP), GFL (48.2 AP), and PAA (49.0 AP), with fewer parameters and FLOPs. Qualitative results also demonstrate the effectiveness of TOOD for better aligning the tasks of object classification and localization. Code is available at https://github.com/fcjian/TOOD., Comment: ICCV2021 Oral
Published: 2021

29. Mutually-aware Sub-Graphs Differentiable Architecture Search

Author: Tan, Haoxian, Guo, Sheng, Zhong, Yujie, Scott, Matthew R., Huang, Weilin, Tan, Haoxian, Guo, Sheng, Zhong, Yujie, Scott, Matthew R., and Huang, Weilin
Abstract: Differentiable architecture search is prevalent in the field of NAS because of its simplicity and efficiency, where two paradigms, multi-path algorithms and single-path methods, are dominated. Multi-path framework (e.g. DARTS) is intuitive but suffers from memory usage and training collapse. Single-path methods (e.g.GDAS and ProxylessNAS) mitigate the memory issue and shrink the gap between searching and evaluation but sacrifice the performance. In this paper, we propose a conceptually simple yet efficient method to bridge these two paradigms, referred as Mutually-aware Sub-Graphs Differentiable Architecture Search (MSG-DAS). The core of our framework is a differentiable Gumbel-TopK sampler that produces multiple mutually exclusive single-path sub-graphs. To alleviate the severer skip-connect issue brought by multiple sub-graphs setting, we propose a Dropblock-Identity module to stabilize the optimization. To make best use of the available models (super-net and sub-graphs), we introduce a memory-efficient super-net guidance distillation to improve training. The proposed framework strikes a balance between flexible memory usage and searching quality. We demonstrate the effectiveness of our methods on ImageNet and CIFAR10, where the searched models show a comparable performance as the most recent approaches.
Published: 2021

30. Unchain the Search Space with Hierarchical Differentiable Architecture Search

Author: Liu, Guanting, Zhong, Yujie, Guo, Sheng, Scott, Matthew R., Huang, Weilin, Liu, Guanting, Zhong, Yujie, Guo, Sheng, Scott, Matthew R., and Huang, Weilin
Abstract: Differentiable architecture search (DAS) has made great progress in searching for high-performance architectures with reduced computational cost. However, DAS-based methods mainly focus on searching for a repeatable cell structure, which is then stacked sequentially in multiple stages to form the networks. This configuration significantly reduces the search space, and ignores the importance of connections between the cells. To overcome this limitation, in this paper, we propose a Hierarchical Differentiable Architecture Search (H-DAS) that performs architecture search both at the cell level and at the stage level. Specifically, the cell-level search space is relaxed so that the networks can learn stage-specific cell structures. For the stage-level search, we systematically study the architectures of stages, including the number of cells in each stage and the connections between the cells. Based on insightful observations, we design several search rules and losses, and mange to search for better stage-level architectures. Such hierarchical search space greatly improves the performance of the networks without introducing expensive search cost. Extensive experiments on CIFAR10 and ImageNet demonstrate the effectiveness of the proposed H-DAS. Moreover, the searched stage-level architectures can be combined with the cell structures searched by existing DAS methods to further boost the performance. Code is available at: https://github.com/MalongTech/research-HDAS, Comment: To appear in AAAI2021. Code is available
Published: 2021

31. Representation Sharing for Fast Object Detector Search and Beyond

Author: Zhong, Yujie, Deng, Zelu, Guo, Sheng, Scott, Matthew R., Huang, Weilin, Zhong, Yujie, Deng, Zelu, Guo, Sheng, Scott, Matthew R., and Huang, Weilin
Abstract: Region Proposal Network (RPN) provides strong support for handling the scale variation of objects in two-stage object detection. For one-stage detectors which do not have RPN, it is more demanding to have powerful sub-networks capable of directly capturing objects of unknown sizes. To enhance such capability, we propose an extremely efficient neural architecture search method, named Fast And Diverse (FAD), to better explore the optimal configuration of receptive fields and convolution types in the sub-networks for one-stage detectors. FAD consists of a designed search space and an efficient architecture search algorithm. The search space contains a rich set of diverse transformations designed specifically for object detection. To cope with the designed search space, a novel search algorithm termed Representation Sharing (RepShare) is proposed to effectively identify the best combinations of the defined transformations. In our experiments, FAD obtains prominent improvements on two types of one-stage detectors with various backbones. In particular, our FAD detector achieves 46.4 AP on MS-COCO (under single-scale testing), outperforming the state-of-the-art detectors, including the most recent NAS-based detectors, Auto-FPN (searched for 16 GPU-days) and NAS-FCOS (28 GPU-days), while significantly reduces the search cost to 0.6 GPU-days. Beyond object detection, we further demonstrate the generality of FAD on the more challenging instance segmentation, and expect it to benefit more tasks., Comment: ECCV 2020 accepted
Published: 2020

32. Compact Deep Aggregation for Set Retrieval

Author: Zhong, Yujie, Arandjelović, Relja, Zisserman, Andrew, Zhong, Yujie, Arandjelović, Relja, and Zisserman, Andrew
Abstract: The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors. We focus on a specific example of this general problem -- that of retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities, all but one, \etc To this end, we make the following contributions: first, we propose a CNN architecture -- {\em SetNet} -- to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval, and the score of an image is a count of the number of identities that match the query; second, we show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that -- far exceeding a number of baselines; third, we explore the speed vs.\ retrieval quality trade-off for set retrieval using this compact descriptor; and, finally, we collect and annotate a large dataset of images containing various number of celebrities, which we use for evaluation and is publicly released., Comment: 20 pages
Published: 2020

33. Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

Author: Zhong, Yujie, Xie, Linhai, Wang, Sen, Specia, Lucia, Miao, Yishu, Zhong, Yujie, Xie, Linhai, Wang, Sen, Specia, Lucia, and Miao, Yishu
Abstract: In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset can be downloaded at https://github.com/zyj-13/WAL., Comment: NeurIPS 2020 Self-Supervised Learning Workshop
Published: 2020

34. Preparation of a poly(acrylic acid) based hydrogel with fast adsorption rate and high adsorption capacity for the removal of cationic dyes

Author: Yuan, Zhenyu (author), Wang, Jie (author), Wang, Y. (author), Liu, Q. (author), Zhong, Yujie (author), Wang, Yu (author), Li, Li (author), Lincoln, Stephen F. (author), Guo, Xuhong (author), Yuan, Zhenyu (author), Wang, Jie (author), Wang, Y. (author), Liu, Q. (author), Zhong, Yujie (author), Wang, Yu (author), Li, Li (author), Lincoln, Stephen F. (author), and Guo, Xuhong (author)
Abstract: A biocompatible Dex-MA/PAA hydrogel was prepared through copolymerization of glycidyl methacrylate substituted dextran (Dex-MA) with acrylic acid (AA), which was applied as the adsorbent to remove cationic dyes from aqueous solutions. Dex-MA/PAA hydrogel presented a fast adsorption rate and the removal efficiency of Methylene Blue (MB) and Crystal Violet (CV) reached 93.9% and 86.4%, respectively within one minute at an initial concentration of 50 mg L-1. The adsorption equilibrium data fitted the Sips isotherm model well with high adsorption capacities of 1994 mg g-1 for MB and 2390 mg g-1 for CV. Besides, dye adsorption occurred efficiently over the pH range 3-10 and the temperature range 20-60 °C. Moreover, the removal efficiencies for MB and CV were still >95% even after five adsorption/desorption cycles which indicates the robust nature of the Dex-MA/PAA hydrogel and its potential as an eco-friendly adsorbent for water treatment., ChemE/Advanced Soft Matter
Published: 2019
Full Text: View/download PDF

35. GhostVLAD for set-based face recognition

Author: Zhong, Yujie, Arandjelović, Relja, Zisserman, Andrew, Zhong, Yujie, Arandjelović, Relja, and Zisserman, Andrew
Abstract: The objective of this paper is to learn a compact representation of image sets for template-based face recognition. We make the following contributions: first, we propose a network architecture which aggregates and embeds the face descriptors produced by deep convolutional neural networks into a compact fixed-length representation. This compact representation requires minimal memory storage and enables efficient similarity computation. Second, we propose a novel GhostVLAD layer that includes {\em ghost clusters}, that do not contribute to the aggregation. We show that a quality weighting on the input faces emerges automatically such that informative images contribute more than those with low quality, and that the ghost clusters enhance the network's ability to deal with poor quality images. Third, we explore how input feature dimension, number of clusters and different training techniques affect the recognition performance. Given this analysis, we train a network that far exceeds the state-of-the-art on the IJB-B face recognition dataset. This is currently one of the most challenging public benchmarks, and we surpass the state-of-the-art on both the identification and verification protocols., Comment: Accepted by ACCV 2018
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

35 results on '"Zhong, Yujie"'

1. Visual retrieval for compound queries

2. InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

3. LaSagnA: Language-based Segmentation Assistant for Complex Queries

4. UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

5. Matten: Video Generation with Mamba-Attention

6. Genomic profiling of post-transplant lymphoproliferative disorders using cell-free DNA

7. TriDet: Temporal Action Detection with Relative Boundary Modeling

8. MotionTrack: Learning Motion Predictor for Multiple Object Tracking

9. Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models

10. Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking

11. Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network

12. Adaptive Sparse Pairwise Loss for Object Re-Identification

13. Genomic profiling of post-transplant lymphoproliferative disorders using cell-free DNA

14. SoccerNet 2023 Challenges Results

15. Temporal Action Localization with Enhanced Instant Discriminability

16. SoccerNet 2022 Challenges Results

17. DiP: Learning Discriminative Implicit Parts for Person Re-Identification

18. AeDet: Azimuth-invariant Multi-view 3D Object Detection

19. Contrastive Video-Language Learning with Fine-grained Frame Sampling

20. CounTR: Transformer-based Generalised Visual Counting

21. ReAct: Temporal Action Detection with Relational Queries

22. DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

23. PromptDet: Towards Open-vocabulary Detection using Uncurated Images

24. Cross-Architecture Self-supervised Video Representation Learning

25. InsCLR: Improving Instance Retrieval with Self-Supervision

26. OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

27. Exploring Classification Equilibrium in Long-Tailed Object Detection

28. TOOD: Task-aligned One-stage Object Detection

29. Mutually-aware Sub-Graphs Differentiable Architecture Search

30. Unchain the Search Space with Hierarchical Differentiable Architecture Search

31. Representation Sharing for Fast Object Detector Search and Beyond

32. Compact Deep Aggregation for Set Retrieval

33. Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

34. Preparation of a poly(acrylic acid) based hydrogel with fast adsorption rate and high adsorption capacity for the removal of cationic dyes

35. GhostVLAD for set-based face recognition

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Language

Publication Type

Database

Publisher

35 results on '"Zhong, Yujie"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources