Author: "Sun, Peize" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Sun, Peize"' showing total 39 results

Start Over Author "Sun, Peize"

39 results on '"Sun, Peize"'

1. Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Author: Chen, Huayu, Su, Hang, Sun, Peize, and Zhu, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose \textit{Condition Contrastive Alignment} (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ($\sim$ 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.
Published: 2024

2. ControlAR: Controllable Image Generation with Autoregressive Models

Author: Li, Zongming, Cheng, Tianheng, Chen, Shoufa, Sun, Peize, Shen, Haocheng, Ran, Longjin, Chen, Xiaoxin, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at https://github.com/hustvl/ControlAR., Comment: Preprint. Work in progress
Published: 2024

3. IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Author: Ji, Yatai, Zhang, Shilong, Wu, Jie, Sun, Peize, Chen, Weifeng, Xiao, Xuefeng, Yang, Sidi, Yang, Yujiu, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.
Published: 2024

4. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Author: Sun, Peize, Jiang, Yi, Chen, Shoufa, Zhang, Shilong, Peng, Bingyue, Luo, Ping, and Yuan, Zehuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models., Comment: Codes and models: \url{https://github.com/FoundationVision/LlamaGen}
Published: 2024

5. RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

Author: Mu, Yao, Chen, Junting, Zhang, Qinglong, Chen, Shoufa, Yu, Qiaojun, Ge, Chongjian, Chen, Runjian, Liang, Zhixuan, Hu, Mengkang, Tao, Chaofan, Sun, Peize, Yu, Haibao, Yang, Chao, Shao, Wenqi, Wang, Wenhai, Dai, Jifeng, Qiao, Yu, Ding, Mingyu, and Luo, Ping
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.
Published: 2024

6. Enhancing Your Trained DETRs with Box Refinement

Author: Chen, Yiqun, Chen, Qiang, Sun, Peize, Chen, Shoufa, Wang, Jingdong, and Cheng, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a conceptually simple, efficient, and general framework for localization problems in DETR-like models. We add plugins to well-trained models instead of inefficiently designing new models and training them from scratch. The method, called RefineBox, refines the outputs of DETR-like detectors by lightweight refinement networks. RefineBox is easy to implement and train as it only leverages the features and predicted boxes from the well-trained detection models. Our method is also efficient as we freeze the trained detectors during training. In addition, we can easily generalize RefineBox to various trained detection models without any modification. We conduct experiments on COCO and LVIS $1.0$. Experimental results indicate the effectiveness of our RefineBox for DETR and its representative variants (Figure 1). For example, the performance gains for DETR, Conditinal-DETR, DAB-DETR, and DN-DETR are 2.4 AP, 2.5 AP, 1.9 AP, and 1.6 AP, respectively. We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework. Code and models will be publicly available at: \href{https://github.com/YiqunChen1999/RefineBox}{https://github.com/YiqunChen1999/RefineBox}.
Published: 2023

7. Semantic-SAM: Segment and Recognize Anything at Any Granularity

Author: Li, Feng, Zhang, Hao, Sun, Peize, Zou, Xueyan, Liu, Shilong, Yang, Jianwei, Li, Chunyuan, Zhang, Lei, and Gao, Jianfeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our model offers two key advantages: semantic-awareness and granularity-abundance. To achieve semantic-awareness, we consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts. This allows our model to capture rich semantic information. For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels that correspond to multiple ground-truth masks. Notably, this work represents the first attempt to jointly train a model on SA-1B, generic, and part segmentation datasets. Experimental results and visualizations demonstrate that our model successfully achieves semantic-awareness and granularity-abundance. Furthermore, combining SA-1B training with other segmentation tasks, such as panoptic and part segmentation, leads to performance improvements. We will provide code and a demo for further exploration and evaluation.
Published: 2023

8. GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Author: Zhang, Shilong, Sun, Peize, Chen, Shoufa, Xiao, Min, Shao, Wenqi, Zhang, Wenwei, Liu, Yu, Chen, Kai, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code, dataset, and demo can be found at https://github.com/jshilong/GPT4RoI., Comment: Code has been released at https://github.com/jshilong/GPT4RoI
Published: 2023

9. Going Denser with Open-Vocabulary Part Segmentation

Author: Sun, Peize, Chen, Shoufa, Zhu, Chenchen, Xiao, Fanyi, Luo, Ping, Xie, Saining, and Yan, Zhicheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation. This ability comes from two designs. First, we train the detector on the joint of part-level, object-level and image-level data to build the multi-granularity alignment between language and image. Second, we parse the novel object into its parts by its dense semantic correspondence with the base object. These two designs enable the detector to largely benefit from various data sources and foundation models. In open-vocabulary part segmentation experiments, our method outperforms the baseline by 3.3$\sim$7.3 mAP in cross-dataset generalization on PartImageNet, and improves the baseline by 7.3 novel AP$_{50}$ in cross-category generalization on Pascal Part. Finally, we train a detector that generalizes to a wide range of part segmentation datasets while achieving better performance than dataset-specific training., Comment: Code is available at \url{https://github.com/facebookresearch/VLPart}
Published: 2023

10. ByteTrackV2: 2D and 3D Multi-Object Tracking by Associating Every Detection Box

Author: Zhang, Yifu, Wang, Xinggang, Ye, Xiaoqing, Zhang, Wei, Lu, Jincheng, Tan, Xiao, Ding, Errui, Sun, Peize, and Wang, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects across video frames. Detection boxes serve as the basis of both 2D and 3D MOT. The inevitable changing of detection scores leads to object missing after tracking. We propose a hierarchical data association strategy to mine the true objects in low-score detection boxes, which alleviates the problems of object missing and fragmented trajectories. The simple and generic data association strategy shows effectiveness under both 2D and 3D settings. In 3D scenarios, it is much easier for the tracker to predict object velocities in the world coordinate. We propose a complementary motion prediction strategy that incorporates the detected velocities with a Kalman filter to address the problem of abrupt motion and short-term disappearing. ByteTrackV2 leads the nuScenes 3D MOT leaderboard in both camera (56.4% AMOTA) and LiDAR (70.1% AMOTA) modalities. Furthermore, it is nonparametric and can be integrated with various detectors, making it appealing in real applications. The source code is released at https://github.com/ifzhang/ByteTrack-V2., Comment: Code is available at https://github.com/ifzhang/ByteTrack-V2. arXiv admin note: text overlap with arXiv:2110.06864; substantial text overlap with arXiv:2203.06424 by other authors
Published: 2023

11. Learning Object-Language Alignments for Open-Vocabulary Object Detection

Author: Lin, Chuang, Sun, Peize, Jiang, Yi, Luo, Ping, Qu, Lizhen, Haffari, Gholamreza, Yuan, Zehuan, and Cai, Jianfei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet., Comment: Technical Report
Published: 2022

12. DiffusionDet: Diffusion Model for Object Detection

Author: Chen, Shoufa, Sun, Peize, Song, Yibing, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes. During the training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process. In inference, the model refines a set of randomly generated boxes to the output results in a progressive way. Our work possesses an appealing property of flexibility, which enables the dynamic number of boxes and iterative evaluation. The extensive experiments on the standard benchmarks show that DiffusionDet achieves favorable performance compared to previous well-established detectors. For example, DiffusionDet achieves 5.3 AP and 4.8 AP gains when evaluated with more boxes and iteration steps, under a zero-shot transfer setting from COCO to CrowdHuman. Our code is available at https://github.com/ShoufaChen/DiffusionDet., Comment: ICCV2023 (Oral), Camera-ready
Published: 2022

13. Towards Grand Unification of Object Tracking

Author: Yan, Bin, Jiang, Yi, Sun, Peize, Wang, Dong, Yuan, Zehuan, Luo, Ping, and Lu, Huchuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters. Due to the fragmented definitions of the object tracking problem itself, most existing trackers are developed to address a single or part of tasks and overspecialize on the characteristics of specific tasks. By contrast, Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks. For the first time, we accomplish the great unification of the tracking network architecture and learning paradigm. Unicorn performs on-par or better than its task-specific counterparts in 8 tracking datasets, including LaSOT, TrackingNet, MOT17, BDD100K, DAVIS16-17, MOTS20, and BDD100K MOTS. We believe that Unicorn will serve as a solid step towards the general vision model. Code is available at https://github.com/MasterBin-IIAU/Unicorn., Comment: ECCV2022 Oral
Published: 2022

14. Language as Queries for Referring Video Object Segmentation

Author: Wu, Jiannan, Jiang, Yi, Sun, Peize, Yuan, Zehuan, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames. In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer. It views the language as queries and directly attends to the most relevant regions in the video frames. Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only. They are eventually transformed into dynamic kernels which capture the crucial object-level information, and play the role of convolution filters to generate the segmentation masks from feature maps. The object tracking is achieved naturally by linking the corresponding queries across frames. This mechanism greatly simplifies the pipeline and the end-to-end framework is significantly different from the previous methods. Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer. On Ref-Youtube-VOS, Refer-Former achieves 55.6J&F with a ResNet-50 backbone without bells and whistles, which exceeds the previous state-of-the-art performance by 8.4 points. In addition, with the strong Swin-Large backbone, ReferFormer achieves the best J&F of 64.2 among all existing methods. Moreover, we show the impressive results of 55.0 mAP and 43.7 mAP on A2D-Sentences andJHMDB-Sentences respectively, which significantly outperforms the previous methods by a large margin. Code is publicly available at https://github.com/wjn922/ReferFormer., Comment: 14 pages, accepted by CVPR2022
Published: 2022

15. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Author: Sun, Peize, Cao, Jinkun, Jiang, Yi, Yuan, Zehuan, Bai, Song, Kitani, Kris, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association. This pipeline is partially motivated by recent progress in both object detection and re-ID, and partially motivated by biases in existing tracking datasets, where most objects tend to have distinguishing appearance and re-ID models are sufficient for establishing associations. In response to such bias, we would like to re-emphasize that methods for multi-object tracking should also work when object appearance is not sufficiently discriminative. To this end, we propose a large-scale dataset for multi-human tracking, where humans have similar appearance, diverse motion and extreme articulation. As the dataset contains mostly group dancing videos, we name it "DanceTrack". We expect DanceTrack to provide a better platform to develop more MOT algorithms that rely less on visual discrimination and depend more on motion analysis. We benchmark several state-of-the-art trackers on our dataset and observe a significant performance drop on DanceTrack when compared against existing benchmarks. The dataset, project code and competition server are released at: \url{https://github.com/DanceTrack}., Comment: add change log
Published: 2021

16. ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Author: Zhang, Yifu, Sun, Peize, Jiang, Yi, Yu, Dongdong, Weng, Fucheng, Yuan, Zehuan, Luo, Ping, Liu, Wenyu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, tracking by associating almost every detection box instead of only the high score ones. For the low score detection boxes, we utilize their similarities with tracklets to recover true objects and filter out the background detections. When applied to 9 different state-of-the-art trackers, our method achieves consistent improvement on IDF1 score ranging from 1 to 10 points. To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack. For the first time, we achieve 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU. ByteTrack also achieves state-of-the-art performance on MOT20, HiEve and BDD100K tracking benchmarks. The source code, pre-trained models with deploy versions and tutorials of applying to other trackers are released at https://github.com/ifzhang/ByteTrack.
Published: 2021

17. Objects in Semantic Topology

Author: Yang, Shuo, Sun, Peize, Jiang, Yi, Xia, Xiaobo, Zhang, Ruiheng, Yuan, Zehuan, Wang, Changhu, Luo, Ping, and Xu, Min
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently. A qualified open-world object detector can not only identify objects of known categories, but also discover unknown objects, and incrementally learn to categorize them when their annotations progressively arrive. Previous works rely on independent modules to recognize unknown categories and perform incremental learning, respectively. In this paper, we provide a unified perspective: Semantic Topology. During the life-long learning of an open-world object detector, all object instances from the same category are assigned to their corresponding pre-defined node in the semantic topology, including the `unknown' category. This constraint builds up discriminative feature representations and consistent relationships among objects, thus enabling the detector to distinguish unknown objects out of the known categories, as well as making learned features of known objects undistorted when learning new categories incrementally. Extensive experiments demonstrate that semantic topology, either randomly-generated or derived from a well-trained language model, could outperform the current state-of-the-art open-world object detectors by a large margin, e.g., the absolute open-set error is reduced from 7832 to 2546, exhibiting the inherent superiority of semantic topology on open-world object detection., Comment: ICLR 2022
Published: 2021

18. Towards High-Quality Temporal Action Detection with Sparse Proposals

Author: Wu, Jiannan, Sun, Peize, Chen, Shoufa, Yang, Jiewen, Qi, Zihao, Ma, Lan, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal Action Detection (TAD) is an essential and challenging topic in video understanding, aiming to localize the temporal segments containing human action instances and predict the action categories. The previous works greatly rely upon dense candidates either by designing varying anchors or enumerating all the combinations of boundaries on video sequences; therefore, they are related to complicated pipelines and sensitive hand-crafted designs. Recently, with the resurgence of Transformer, query-based methods have tended to become the rising solutions for their simplicity and flexibility. However, there still exists a performance gap between query-based methods and well-established methods. In this paper, we identify the main challenge lies in the large variants of action duration and the ambiguous boundaries for short action instances; nevertheless, quadratic-computational global attention prevents query-based methods to build multi-scale feature maps. Towards high-quality temporal action detection, we introduce Sparse Proposals to interact with the hierarchical features. In our method, named SP-TAD, each proposal attends to a local segment feature in the temporal feature pyramid. The local interaction enables utilization of high-resolution features to preserve action instances details. Extensive experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds. E.g., we achieve the state-of-the-art performance on THUMOS14 (45.7% on mAP@0.6, 33.4% on mAP@0.7 and 53.5% on mAP@Avg) and competitive results on ActivityNet-1.3 (32.99% on mAP@Avg). Code will be made available at https://github.com/wjn922/SP-TAD., Comment: 10 pages, 5 figures
Published: 2021

19. DetCo: Unsupervised Contrastive Learning for Object Detection

Author: Xie, Enze, Ding, Jian, Wang, Wenhai, Zhan, Xiaohang, Xu, Hang, Sun, Peize, Li, Zhenguo, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Unsupervised contrastive learning achieves great success in learning image representations with CNN. Unlike most recent methods that focused on improving accuracy of image classification, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo has several appealing benefits. (1) It is carefully designed by investigating the weaknesses of current self-supervised methods, which discard important representations for object detection. (2) DetCo builds hierarchical intermediate contrastive losses between global image and local patches to improve object detection, while maintaining global representations for image recognition. Theoretical analysis shows that the local patches actually remove the contextual information of an image, improving the lower bound of mutual information for better contrastive learning. (3) Extensive experiments on PASCAL VOC, COCO and Cityscapes demonstrate that DetCo not only outperforms state-of-the-art methods on object detection, but also on segmentation, pose estimation, and 3D shape prediction, while it is still competitive on image classification. For example, on PASCAL VOC, DetCo-100ep achieves 57.4 mAP, which is on par with the result of MoCov2-800ep. Moreover, DetCo consistently outperforms supervised method by 1.6/1.2/1.0 AP on Mask RCNN-C4/FPN/RetinaNet with 1x schedule. Code will be released at \href{https://github.com/xieenze/DetCo}{\color{blue}{\tt github.com/xieenze/DetCo}}., Comment: ICCV 2021
Published: 2021

20. Segmenting Transparent Object in the Wild with Transformer

Author: Xie, Enze, Wang, Wenjia, Wang, Wenhai, Sun, Peize, Xu, Hang, Liang, Ding, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset. Unlike Trans10K-v1 that only has two limited categories, our new dataset has several appealing benefits. (1) It has 11 fine-grained categories of transparent objects, commonly occurring in the human domestic environment, making it more practical for real-world application. (2) Trans10K-v2 brings more challenges for the current advanced segmentation methods than its former version. Furthermore, a novel transformer-based segmentation pipeline termed Trans2Seg is proposed. Firstly, the transformer encoder of Trans2Seg provides the global receptive field in contrast to CNN's local receptive field, which shows excellent advantages over pure CNN architectures. Secondly, by formulating semantic segmentation as a problem of dictionary look-up, we design a set of learnable prototypes as the query of Trans2Seg's transformer decoder, where each prototype learns the statistics of one category in the whole dataset. We benchmark more than 20 recent semantic segmentation methods, demonstrating that Trans2Seg significantly outperforms all the CNN-based methods, showing the proposed algorithm's potential ability to solve transparent object segmentation., Comment: Tech. Report
Published: 2021

21. TransTrack: Multiple Object Tracking with Transformer

Author: Sun, Peize, Cao, Jinkun, Jiang, Yi, Zhang, Rufeng, Xie, Enze, Yuan, Zehuan, Wang, Changhu, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this work, we propose TransTrack, a simple but efficient scheme to solve the multiple object tracking problems. TransTrack leverages the transformer architecture, which is an attention-based query-key mechanism. It applies object features from the previous frame as a query of the current frame and introduces a set of learned object queries to enable detecting new-coming objects. It builds up a novel joint-detection-and-tracking paradigm by accomplishing object detection and object association in a single shot, simplifying complicated multi-step settings in tracking-by-detection methods. On MOT17 and MOT20 benchmark, TransTrack achieves 74.5\% and 64.5\% MOTA, respectively, competitive to the state-of-the-art methods. We expect TransTrack to provide a novel perspective for multiple object tracking. The code is available at: \url{https://github.com/PeizeSun/TransTrack}., Comment: update MOT17 and MOT20
Published: 2020

22. What Makes for End-to-End Object Detection?

Author: Sun, Peize, Jiang, Yi, Xie, Enze, Shao, Wenqi, Yuan, Zehuan, Wang, Changhu, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Object detection has recently achieved a breakthrough for removing the last one non-differentiable component in the pipeline, Non-Maximum Suppression (NMS), and building up an end-to-end system. However, what makes for its one-to-one prediction has not been well understood. In this paper, we first point out that one-to-one positive sample assignment is the key factor, while, one-to-many assignment in previous detectors causes redundant predictions in inference. Second, we surprisingly find that even training with one-to-one assignment, previous detectors still produce redundant predictions. We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference. We introduce the concept of score gap to explore the effect of matching cost. Classification cost enlarges the score gap by choosing positive samples as those of highest score in the training iteration and reducing noisy positive samples brought by only location cost. Finally, we demonstrate the advantages of end-to-end object detection on crowded scenes. The code is available at: \url{https://github.com/PeizeSun/OneNet}., Comment: ICML version
Published: 2020

23. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

Author: Sun, Peize, Zhang, Rufeng, Jiang, Yi, Kong, Tao, Xu, Chenfeng, Zhan, Wei, Tomizuka, Masayoshi, Li, Lei, Yuan, Zehuan, Wang, Changhu, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as $k$ anchor boxes pre-defined on all grids of image feature map of size $H\times W$. In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location. By eliminating $HWk$ (up to hundreds of thousands) hand-designed object candidates to $N$ (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard $3\times$ training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN., Comment: add test-dev; add crowdhuman
Published: 2020

24. PolarMask: Single Shot Instance Segmentation with Polar Representation

Author: Xie, Enze, Sun, Peize, Song, Xiaoge, Wang, Wenhai, Liang, Ding, Shen, Chunhua, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we introduce an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used as a mask prediction module for instance segmentation, by easily embedding it into most off-the-shelf detection methods. Our method, termed PolarMask, formulates the instance segmentation problem as instance center classification and dense distance regression in a polar coordinate. Moreover, we propose two effective approaches to deal with sampling high-quality center examples and optimization for dense distance regression, respectively, which can significantly improve the performance and simplify the training process. Without any bells and whistles, PolarMask achieves 32.9% in mask mAP with single-model and single-scale training/testing on challenging COCO dataset. For the first time, we demonstrate a much simpler and flexible instance segmentation framework achieving competitive accuracy. We hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single shot instance segmentation tasks. Code is available at: github.com/xieenze/PolarMask., Comment: Accepted to Proc. IEEE Conf. Computer Vision and Pattern Recognition 2020
Published: 2019

25. Double Anchor R-CNN for Human Detection in a Crowd

Author: Zhang, Kevin, Xiong, Feng, Sun, Peize, Hu, Li, Li, Boxun, and Yu, Gang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Detecting human in a crowd is a challenging problem due to the uncertainties of occlusion patterns. In this paper, we propose to handle the crowd occlusion problem in human detection by leveraging the head part. Double Anchor RPN is developed to capture body and head parts in pairs. A proposal crossover strategy is introduced to generate high-quality proposals for both parts as a training augmentation. Features of coupled proposals are then aggregated efficiently to exploit the inherent relationship. Finally, a Joint NMS module is developed for robust post-processing. The proposed framework, called Double Anchor R-CNN, is able to detect the body and head for each person simultaneously in crowded scenarios. State-of-the-art results are reported on challenging human detection datasets. Our model yields log-average miss rates (MR) of 51.79pp on CrowdHuman, 55.01pp on COCOPersons~(crowded sub-dataset) and 40.02pp on CrowdPose~(crowded sub-dataset), which outperforms previous baseline detectors by 3.57pp, 3.82pp, and 4.24pp, respectively. We hope our simple and effective approach will serve as a solid baseline and help ease future research in crowded human detection.
Published: 2019

26. TextSR: Content-Aware Text Super-Resolution Guided by Recognition

Author: Wang, Wenjia, Xie, Enze, Sun, Peize, Wang, Wenhai, Tian, Lixun, Shen, Chunhua, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene text recognition has witnessed rapid development with the advance of convolutional neural networks. Nonetheless, most of the previous methods may not work well in recognizing text with low resolution which is often seen in natural scene images. An intuitive solution is to introduce super-resolution techniques as pre-processing. However, conventional super-resolution methods in the literature mainly focus on reconstructing the detailed texture of natural images, which typically do not work well for text due to the unique characteristics of text. To tackle these problems, in this work, we propose a content-aware text super-resolution network to generate the information desired for text recognition. In particular, we design an end-to-end network that can perform super-resolution and text recognition simultaneously. Different from previous super-resolution methods, we use the loss of text recognition as the Text Perceptual Loss to guide the training of the super-resolution network, and thus it pays more attention to the text content, rather than the irrelevant background area. Extensive experiments on several challenging benchmarks demonstrate the effectiveness of our proposed method in restoring a sharp high-resolution image from a small blurred one, and show that the recognition performance clearly boosts up the performance of text recognizer. To our knowledge, this is the first work focusing on text super-resolution. Code will be released in https://github.com/xieenze/TextSR.
Published: 2019

27. ByteTrack: Multi-object Tracking by Associating Every Detection Box

Author: Zhang, Yifu, Sun, Peize, Jiang, Yi, Yu, Dongdong, Weng, Fucheng, Yuan, Zehuan, Luo, Ping, Liu, Wenyu, Wang, Xinggang, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

28. Towards Grand Unification of Object Tracking

Author: Yan, Bin, primary, Jiang, Yi, additional, Sun, Peize, additional, Wang, Dong, additional, Yuan, Zehuan, additional, Luo, Ping, additional, and Lu, Huchuan, additional
Published: 2022
Full Text: View/download PDF

29. Sparse R-CNN: An End-to-End Framework for Object Detection

Author: Sun, Peize, primary, Zhang, Rufeng, additional, Jiang, Yi, additional, Kong, Tao, additional, Xu, Chenfeng, additional, Zhan, Wei, additional, Tomizuka, Masayoshi, additional, Yuan, Zehuan, additional, and Luo, Ping, additional
Published: 2023
Full Text: View/download PDF

30. A deep-learning-based approach for fast and robust steel surface defects classification

Author: Fu, Guizhong, Sun, Peize, Zhu, Wenbin, Yang, Jiangxin, Cao, Yanlong, Yang, Michael Ying, and Cao, Yanpeng
Published: 2019
Full Text: View/download PDF

31. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Author: Sun, Peize, primary, Cao, Jinkun, additional, Jiang, Yi, additional, Yuan, Zehuan, additional, Bai, Song, additional, Kitani, Kris, additional, and Luo, Ping, additional
Published: 2022
Full Text: View/download PDF

32. Language as Queries for Referring Video Object Segmentation

Author: Wu, Jiannan, primary, Jiang, Yi, additional, Sun, Peize, additional, Yuan, Zehuan, additional, and Luo, Ping, additional
Published: 2022
Full Text: View/download PDF

33. DetCo: Unsupervised Contrastive Learning for Object Detection

Author: Xie, Enze, primary, Ding, Jian, additional, Wang, Wenhai, additional, Zhan, Xiaohang, additional, Xu, Hang, additional, Sun, Peize, additional, Li, Zhenguo, additional, and Luo, Ping, additional
Published: 2021
Full Text: View/download PDF

34. Domain-Invariant Disentangled Network for Generalizable Object Detection

Author: Lin, Chuang, primary, Yuan, Zehuan, additional, Zhao, Sicheng, additional, Sun, Peize, additional, Wang, Changhu, additional, and Cai, Jianfei, additional
Published: 2021
Full Text: View/download PDF

35. Watch Only Once: An End-to-End Video Action Detection Framework

Author: Chen, Shoufa, primary, Sun, Peize, additional, Xie, Enze, additional, Ge, Chongjian, additional, Wu, Jiannan, additional, Ma, Lan, additional, Shen, Jiajun, additional, and Luo, Ping, additional
Published: 2021
Full Text: View/download PDF

36. A deep-learning-based approach for fast and robust steel surface defects classification

Author: Yanpeng Cao, Yanlong Cao, Sun Peize, Zhu Wenbin, Michael Ying Yang, Jiangxin Yang, Guizhong Fu, Department of Earth Observation Science, UT-I-ITC-ACQUAL, and Faculty of Geo-Information Science and Earth Observation
Subjects: Surface (mathematics), Computer science, Feature extraction, Graphics processing unit, Convolutional neural network, 02 engineering and technology, 01 natural sciences, Multi-receptive field, 010309 optics, 0103 physical sciences, Image noise, Computer vision, Electrical and Electronic Engineering, ComputingMethodologies_COMPUTERGRAPHICS, business.industry, Mechanical Engineering, Deep learning, Motion blur, Surface inspection, 021001 nanoscience & nanotechnology, Atomic and Molecular Physics, and Optics, Electronic, Optical and Magnetic Materials, Titan (supercomputer), ITC-ISI-JOURNAL-ARTICLE, Artificial intelligence, 0210 nano-technology, business, Defect classification
Abstract: Automatic visual recognition of steel surface defects provides critical functionality to facilitate quality control of steel strip production. In this paper, we present a compact yet effective convolutional neural network (CNN) model, which emphasizes the training of low-level features and incorporates multiple receptive fields, to achieve fast and accurate steel surface defect classification. Our proposed method adopts the pre-trained SqueezeNet as the backbone architecture. It only requires a small amount of defect-specific training samples to achieve high-accuracy recognition on a diversity-enhanced testing dataset of steel surface defects which contains severe non-uniform illumination, camera noise, and motion blur. Moreover, our proposed light-weight CNN model can meet the requirement of real-time online inspection, running over 100 fps on a computer equipped with a single NVIDIA TITAN X Graphics Processing Unit (12G memory). Codes and a diversity-enhanced testing dataset will be made publicly available.
Published: 2019

37. Segmenting Transparent Objects in the Wild with Transformer

Author: Xie, Enze, primary, Wang, Wenjia, additional, Wang, Wenhai, additional, Sun, Peize, additional, Xu, Hang, additional, Liang, Ding, additional, and Luo, Ping, additional
Published: 2021
Full Text: View/download PDF

38. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

Author: Sun, Peize, primary, Zhang, Rufeng, additional, Jiang, Yi, additional, Kong, Tao, additional, Xu, Chenfeng, additional, Zhan, Wei, additional, Tomizuka, Masayoshi, additional, Li, Lei, additional, Yuan, Zehuan, additional, Wang, Changhu, additional, and Luo, Ping, additional
Published: 2021
Full Text: View/download PDF

39. PolarMask: Single Shot Instance Segmentation With Polar Representation

Author: Xie, Enze, primary, Sun, Peize, additional, Song, Xiaoge, additional, Wang, Wenhai, additional, Liu, Xuebo, additional, Liang, Ding, additional, Shen, Chunhua, additional, and Luo, Ping, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

39 results on '"Sun, Peize"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources