Author: "Cao Liangliang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cao Liangliang"' showing total 519 results

Start Over Author "Cao Liangliang"

519 results on '"Cao Liangliang"'

1. Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

Author: Xu, Dejia, Jiang, Yifan, Huang, Chen, Song, Liangchen, Gernoth, Thorsten, Cao, Liangliang, Wang, Zhangyang, and Tang, Hao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: https://ir1d.github.io/Cavia/, Comment: Project Page: https://ir1d.github.io/Cavia/
Published: 2024

2. MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Author: Hua, Hang, Tang, Yunlong, Zeng, Ziyun, Cao, Liangliang, Yang, Zhengyuan, He, Hangfeng, Xu, Chenliang, and Luo, Jiebo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: https://hanghuacs.github.io/MMComposition/, Comment: 21 pages, 15 figures
Published: 2024

3. Apple Intelligence Foundation Language Models

Author: Gunter, Tom, Wang, Zirui, Wang, Chong, Pang, Ruoming, Narayanan, Andy, Zhang, Aonan, Zhang, Bowen, Chen, Chen, Chiu, Chung-Cheng, Qiu, David, Gopinath, Deepak, Yap, Dian Ang, Yin, Dong, Nan, Feng, Weers, Floris, Yin, Guoli, Huang, Haoshuo, Wang, Jianyu, Lu, Jiarui, Peebles, John, Ye, Ke, Lee, Mark, Du, Nan, Chen, Qibin, Keunebroek, Quentin, Wiseman, Sam, Evans, Syd, Lei, Tao, Rathod, Vivek, Kong, Xiang, Du, Xianzhi, Li, Yanghao, Wang, Yongqiang, Gao, Yuan, Ahmed, Zaid, Xu, Zhaoyang, Lu, Zhiyun, Rashid, Al, Jose, Albin Madappally, Doane, Alec, Bencomo, Alfredo, Vanderby, Allison, Hansen, Andrew, Jain, Ankur, Anupama, Anupama Mann, Kamal, Areeba, Wu, Bugu, Brum, Carolina, Maalouf, Charlie, Erdenebileg, Chinguun, Dulhanty, Chris, Moritz, Dominik, Kang, Doug, Jimenez, Eduardo, Ladd, Evan, Shi, Fangping, Bai, Felix, Chu, Frank, Hohman, Fred, Kotek, Hadas, Coleman, Hannah Gillis, Li, Jane, Bigham, Jeffrey, Cao, Jeffery, Lai, Jeff, Cheung, Jessica, Shan, Jiulong, Zhou, Joe, Li, John, Qin, Jun, Singh, Karanjeet, Vega, Karla, Zou, Kelvin, Heckman, Laura, Gardiner, Lauren, Bowler, Margit, Cordell, Maria, Cao, Meng, Hay, Nicole, Shahdadpuri, Nilesh, Godwin, Otto, Dighe, Pranay, Rachapudi, Pushyami, Tantawi, Ramsey, Frigg, Roman, Davarnia, Sam, Shah, Sanskruti, Guha, Saptarshi, Sirovica, Sasha, Ma, Shen, Ma, Shuang, Wang, Simon, Kim, Sulgi, Jayaram, Suma, Shankar, Vaishaal, Paidi, Varsha, Kumar, Vivek, Wang, Xin, Zheng, Xin, Cheng, Walker, Shrager, Yael, Ye, Yang, Tanaka, Yasu, Guo, Yihao, Meng, Yunsong, Luo, Zhao Tang, Ouyang, Zhi, Aygar, Alp, Wan, Alvin, Walkingshaw, Andrew, Lin, Antonie, Farooq, Arsalan, Ramerth, Brent, Reed, Colorado, Bartels, Chris, Chaney, Chris, Riazati, David, Yang, Eric Liang, Feldman, Erin, Hochstrasser, Gabriel, Seguin, Guillaume, Belousova, Irina, Pelemans, Joris, Yang, Karen, Vahid, Keivan Alizadeh, Cao, Liangliang, Najibi, Mahyar, Zuliani, Marco, Horton, Max, Cho, Minsik, Bhendawade, Nikhil, Dong, Patrick, Maj, Piotr, Agrawal, Pulkit, Shan, Qi, Fu, Qichen, Poston, Regan, Xu, Sam, Liu, Shuangning, Rao, Sushma, Heeramun, Tashweena, Merth, Thomas, Rayala, Uday, Cui, Victor, Sridhar, Vivek Rangarajan, Zhang, Wencong, Zhang, Wenqi, Wu, Wentao, Zhou, Xingyu, Liu, Xinwen, Zhao, Yang, Xia, Yin, Ren, Zhile, and Ren, Zhongzheng
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.
Published: 2024

4. Diffusion Model-Based Image Editing: A Survey

Author: Huang, Yi, Huang, Jiancheng, Liu, Yifan, Yan, Mingfu, Lv, Jiaxi, Liu, Jianzhuang, Xiong, Wei, Zhang, He, Chen, Shifeng, and Cao, Liangliang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse the process of gradually adding noise to images, allowing them to generate high-quality samples from a complex distribution. In this survey, we provide an exhaustive overview of existing methods using diffusion models for image editing, covering both theoretical and practical aspects in the field. We delve into a thorough analysis and categorization of these works from multiple perspectives, including learning strategies, user-input conditions, and the array of specific editing tasks that can be accomplished. In addition, we pay special attention to image inpainting and outpainting, and explore both earlier traditional context-driven and current multimodal conditional methods, offering a comprehensive analysis of their methodologies. To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval, featuring an innovative metric, LMM Score. Finally, we address current limitations and envision some potential directions for future research. The accompanying repository is released at https://github.com/SiatMMLab/Awesome-Diffusion-Model-Based-Image-Editing-Methods.
Published: 2024

5. Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models

Author: Song, Liangchen, Cao, Liangliang, Gu, Jiatao, Jiang, Yifan, Yuan, Junsong, and Tang, Hao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The advancement of text-driven 3D content editing has been blessed by the progress from 2D generative diffusion models. However, a major obstacle hindering the widespread adoption of 3D content editing is its time-intensive processing. This challenge arises from the iterative and refining steps required to achieve consistent 3D outputs from 2D image-based generative models. Recent state-of-the-art methods typically require optimization time ranging from tens of minutes to several hours to edit a 3D scene using a single GPU. In this work, we propose that by incorporating correspondence regularization into diffusion models, the process of 3D editing can be significantly accelerated. This approach is inspired by the notion that the estimated samples during diffusion should be multiview-consistent during the diffusion generation process. By leveraging this multiview consistency, we can edit 3D content at a much faster speed. In most scenarios, our proposed technique brings a 10$\times$ speed-up compared to the baseline method and completes the editing of a 3D scene in 2 minutes with comparable quality., Comment: Project page: https://lsongx.github.io/projects/en2n.html
Published: 2023

6. Ferret: Refer and Ground Anything Anywhere at Any Granularity

Author: You, Haoxuan, Zhang, Haotian, Gan, Zhe, Du, Xianzhi, Zhang, Bowen, Wang, Zirui, Cao, Liangliang, Chang, Shih-Fu, and Yang, Yinfei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret, Comment: 30 pages, 10 figures. Code/Project Website: https://github.com/apple/ml-ferret
Published: 2023

7. Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Author: Jiang, Yifan, Tang, Hao, Chang, Jen-Hao Rick, Song, Liangchen, Wang, Zhangyang, and Cao, Liangliang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The task of novel view synthesis aims to generate unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance fields. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs. To tackle this issue, we propose Efficient-3DiM, a simple but effective framework to learn a single-image novel-view synthesizer. Motivated by our in-depth analysis of the inference process of diffusion models, we propose several pragmatic strategies to reduce the training overhead to a manageable scale, including a crafted timestep sampling strategy, a superior 3D feature extractor, and an enhanced training scheme. When combined, our framework is able to reduce the total training time from 10 days to less than 1 day, significantly accelerating the training process under the same computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method.
Published: 2023

8. Instruction-Following Speech Recognition

Author: Lai, Cheng-I Jeff, Lu, Zhiyun, Cao, Liangliang, and Pang, Ruoming
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.
Published: 2023

9. RNF126-mediated ubiquitination of FSP1 affects its subcellular localization and ferroptosis

Author: Xie, Wanqun, Wang, Jiajia, Tian, Shuaiwei, Zhao, Heng, Cao, Liangliang, Liang, Zhuangzhuang, Yang, Jian, Zhao, Yang, Wang, Baocheng, Jiang, Feng, and Ma, Jie
Published: 2024
Full Text: View/download PDF

10. RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

Author: Song, Liangchen, Cao, Liangliang, Xu, Hongyu, Kang, Kai, Tang, Feng, Yuan, Junsong, and Zhao, Yang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The techniques for 3D indoor scene capturing are widely used, but the meshes produced leave much to be desired. In this paper, we propose "RoomDreamer", which leverages powerful natural language to synthesize a new room with a different style. Unlike existing image synthesis methods, our work addresses the challenge of synthesizing both geometry and texture aligned to the input scene structure and prompt simultaneously. The key insight is that a scene should be treated as a whole, taking into account both scene texture and geometry. The proposed framework consists of two significant components: Geometry Guided Diffusion and Mesh Optimization. Geometry Guided Diffusion for 3D Scene guarantees the consistency of the scene style by applying the 2D prior to the entire scene simultaneously. Mesh Optimization improves the geometry and texture jointly and eliminates the artifacts in the scanned scene. To validate the proposed method, real indoor scenes scanned with smartphones are used for extensive experiments, through which the effectiveness of our method is demonstrated., Comment: Video results: https://youtu.be/p4xgwj4QJcQ
Published: 2023

11. Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness

Author: Cao, Liangliang, Zhang, Bowen, Chen, Chen, Yang, Yinfei, Du, Xianzhi, Zhang, Wencong, Lu, Zhiyun, and Zheng, Yantao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The CLIP (Contrastive Language-Image Pre-training) model and its variants are becoming the de facto backbone in many applications. However, training a CLIP model from hundreds of millions of image-text pairs can be prohibitively expensive. Furthermore, the conventional CLIP model doesn't differentiate between the visual semantics and meaning of text regions embedded in images. This can lead to non-robustness when the text in the embedded region doesn't match the image's visual appearance. In this paper, we discuss two effective approaches to improve the efficiency and robustness of CLIP training: (1) augmenting the training dataset while maintaining the same number of optimization steps, and (2) filtering out samples that contain text regions in the image. By doing so, we significantly improve the classification and retrieval accuracy on public benchmarks like ImageNet and CoCo. Filtering out images with text regions also protects the model from typographic attacks. To verify this, we build a new dataset named ImageNet with Adversarial Text Regions (ImageNet-Attr). Our filter-based CLIP model demonstrates a top-1 accuracy of 68.78\%, outperforming previous models whose accuracy was all below 50\%., Comment: 10 pages, 8 figures
Published: 2023

12. STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

Author: Chen, Chen, Zhang, Bowen, Cao, Liangliang, Shen, Jiguang, Gunter, Tom, Jose, Albin Madappally, Toshev, Alexander, Shlens, Jonathon, Pang, Ruoming, and Yang, Yinfei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +$4.9\%$ and +$4.3\%$ absolute Recall@1 improvement on COCO-5k text$\rightarrow$image and image$\rightarrow$text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP.
Published: 2023

13. Exploiting Category Names for Few-Shot Classification with Vision-Language Models

Author: Xiao, Taihong, Wang, Zirui, Cao, Liangliang, Yu, Jiahui, Dai, Shengyang, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37% on ImageNet and 96.08% on Stanford Cars, both using five-shot learning).
Published: 2022

14. Fabrication of ultra-high strength MWCNTs/CI /PI rigid composite foam with excellent microwave absorption performance by pressure foaming method

Author: Cao, Liangliang, Li, Binbin, Shao, Luwei, Liu, Qianli, Gao, Jingmin, Yuan, Shuaichao, Bu, Hengchang, and Zhan, Xiaohong
Published: 2024
Full Text: View/download PDF

15. PriFit: Learning to Fit Primitives Improves Few Shot Point Cloud Segmentation

Author: Sharma, Gopal, Dash, Bidya, RoyChowdhury, Aruni, Gadelha, Matheus, Loizou, Marios, Cao, Liangliang, Wang, Rui, Learned-Miller, Erik, Maji, Subhransu, and Kalogerakis, Evangelos
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present PriFit, a semi-supervised approach for label-efficient learning of 3D point cloud segmentation networks. PriFit combines geometric primitive fitting with point-based representation learning. Its key idea is to learn point representations whose clustering reveals shape regions that can be approximated well by basic geometric primitives, such as cuboids and ellipsoids. The learned point representations can then be re-used in existing network architectures for 3D point cloud segmentation, and improves their performance in the few-shot setting. According to our experiments on the widely used ShapeNet and PartNet benchmarks, PriFit outperforms several state-of-the-art methods in this setting, suggesting that decomposability into primitives is a useful prior for learning representations predictive of semantic parts. We present a number of ablative experiments varying the choice of geometric primitives and downstream tasks to demonstrate the effectiveness of the method.
Published: 2021

16. Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition

Author: Li, Qiujia, Zhang, Yu, Qiu, David, He, Yanzhang, Cao, Liangliang, and Woodland, Philip C.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions, the ASR performance and the corresponding confidence estimators may exhibit severe degradation. Since confidence models are often trained on the same in-domain data as the ASR, generalising to out-of-domain (OOD) scenarios is challenging. By keeping the ASR model untouched, this paper proposes two approaches to improve the model-based confidence estimators on OOD data: using pseudo transcriptions and an additional OOD language model. With an ASR model trained on LibriSpeech, experiments show that the proposed methods can greatly improve the confidence metrics on TED-LIUM and Switchboard datasets while preserving in-domain performance. Furthermore, the improved confidence estimators are better calibrated on OOD data and can provide a much more reliable criterion for data selection., Comment: Accepted as a conference paper at ICASSP 2022
Published: 2021

17. Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition

Author: Lu, Zhiyun, Pan, Yanwei, Doutre, Thibault, Haghani, Parisa, Cao, Liangliang, Prabhavalkar, Rohit, Zhang, Chao, and Strohman, Trevor
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word error rate (WER) for RNN-transducer (RNN-T) model. We compare two widely used training objectives, log loss (or RNN-T loss) and minimum word error rate (MWER) loss. We conduct experiments on telephony datasets in four languages. Our experiments show that for both losses, the WER on long-form speech reduces substantially as the training utterance length increases. The average relative WER gain is 15.7% for log loss and 8.8% for MWER loss. When training on short utterances, MWER loss leads to a lower WER than the log loss. Such difference between the two losses diminishes when the input length increases., Comment: submitted to INTERSPEECH 2022
Published: 2021

18. Effect of different ternary lithium oxides on the properties of pressureless sintered porous Si3N4

Author: Huang, Xiaofeng, Meng, Fancheng, Cao, Liangliang, Feng, Guoqing, Peng, Haiyi, Xie, Tianyi, Ren, Haishen, Yao, Xiaogang, Jin, Ye, Lin, Huixing, and Li, Hongtao
Published: 2024
Full Text: View/download PDF

19. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Author: Zhang, Yu, Park, Daniel S., Han, Wei, Qin, James, Gulati, Anmol, Shor, Joel, Jansen, Aren, Xu, Yuanzhong, Huang, Yanping, Wang, Shibo, Zhou, Zongwei, Li, Bo, Ma, Min, Chan, William, Yu, Jiahui, Wang, Yongqiang, Cao, Liangliang, Sim, Khe Chai, Ramabhadran, Bhuvana, Sainath, Tara N., Beaufays, Françoise, Chen, Zhifeng, Le, Quoc V., Chiu, Chung-Cheng, Pang, Ruoming, and Wu, Yonghui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks., Comment: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated
Published: 2021
Full Text: View/download PDF

20. Multiple encrypted dynamic fluorescent anti-counterfeiting and optical temperature detection of Ca2Ge7O16:Cr3+, Mn2+

Author: Fang, Fei, Jin, Ye, Chen, Hongtao, Lin, Huayan, Li, Yuyan, Xiong, Yanbin, Meng, Fancheng, Cao, Liangliang, Huang, Fuxiang, Ma, Li, Wang, Xiao-jun, and Ren, Haishen
Published: 2024
Full Text: View/download PDF

21. Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction

Author: Qiu, David, He, Yanzhang, Li, Qiujia, Zhang, Yu, Cao, Liangliang, and McGraw, Ian
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Confidence scores are very useful for downstream applications of automatic speech recognition (ASR) systems. Recent works have proposed using neural networks to learn word or utterance confidence scores for end-to-end ASR. In those studies, word confidence by itself does not model deletions, and utterance confidence does not take advantage of word-level training signals. This paper proposes to jointly learn word confidence, word deletion, and utterance confidence. Empirical results show that multi-task learning with all three objectives improves confidence metrics (NCE, AUC, RMSE) without the need for increasing the model size of the confidence estimation module. Using the utterance-level confidence for rescoring also decreases the word error rates on Google's Voice Search and Long-tail Maps datasets by 3-5% relative, without needing a dedicated neural rescorer., Comment: Submitted to Interspeech 2021
Published: 2021

22. Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

Author: Doutre, Thibault, Han, Wei, Chiu, Chung-Cheng, Pang, Ruoming, Siohan, Olivier, and Cao, Liangliang
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest teachers. The resulting student models drastically improve upon streaming models of previous work [1]: the WER decreases by 41% on Spanish, 27% on Portuguese, and 13% on French.
Published: 2021

23. Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models

Author: Lu, Zhiyun, Han, Wei, Zhang, Yu, and Cao, Liangliang
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Although end-to-end automatic speech recognition (e2e ASR) models are widely deployed in many applications, there have been very few studies to understand models' robustness against adversarial perturbations. In this paper, we explore whether a targeted universal perturbation vector exists for e2e ASR models. Our goal is to find perturbations that can mislead the models to predict the given targeted transcript such as "thank you" or empty string on any input utterance. We study two different attacks, namely additive and prepending perturbations, and their performances on the state-of-the-art LAS, CTC and RNN-T models. We find that LAS is the most vulnerable to perturbations among the three models. RNN-T is more robust against additive perturbations, especially on long utterances. And CTC is robust against both additive and prepending perturbations. To attack RNN-T, we find prepending perturbation is more effective than the additive perturbation, and can mislead the models to predict the same short target on utterances of arbitrary length., Comment: Submitted to INTERSPEECH 2021
Published: 2021

24. Residual Energy-Based Models for End-to-End Speech Recognition

Author: Li, Qiujia, Zhang, Yu, Li, Bo, Cao, Liangliang, and Woodland, Philip C.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the model distribution differs from the underlying data distribution. In this paper, the residual energy-based model (R-EBM) is proposed to complement the auto-regressive ASR model to close the gap between the two distributions. Meanwhile, R-EBMs can also be regarded as utterance-level confidence estimators, which may benefit many downstream tasks. Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word error rates (WERs) by 8.2%/6.7% while improving areas under precision-recall curves of confidence scores by 12.6%/28.4% on test-clean/test-other sets. Furthermore, on a state-of-the-art model using self-supervised learning (wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence estimation performance., Comment: To appear in Proc. Interspeech 2021
Published: 2021

25. Learning Word-Level Confidence For Subword End-to-End ASR

Author: Qiu, David, Li, Qiujia, He, Yanzhang, Zhang, Yu, Li, Bo, Cao, Liangliang, Prabhavalkar, Rohit, Bhatia, Deepti, Li, Wei, Hu, Ke, Sainath, Tara N., and McGraw, Ian
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model., Comment: To appear in ICASSP 2021
Published: 2021

26. Analysis of the permeable and retainable components of Cayratia japonica ointment through intact or broken skin after topical application by UPLC-Q-TOF-MS/MS combined with in vitro transdermal assay

Author: Zhao, Xuelong, Dai, Ruixue, Wang, Jing, Cao, Liangliang, Chen, Peidong, Yao, Weifeng, Cheng, Fangfang, Bao, Beihua, and Zhang, Li
Published: 2024
Full Text: View/download PDF

27. Up-conversion phosphor Na2MoO4:Er3+/Yb3+ for the optical temperature sensing and anti-counterfeiting

Author: Chen, Hongtao, Jin, Ye, Fang, Fei, Lin, Huayan, Li, Yuyan, Feng, Guoqing, Xiong, Yanbin, Meng, Fancheng, Cao, Liangliang, and Ren, Haishen
Published: 2024
Full Text: View/download PDF

28. Spatial-Temporal Alignment Network for Action Recognition and Detection

Author: Liang, Junwei, Cao, Liangliang, Xiong, Xuehan, Yu, Ting, and Hauptmann, Alexander
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection. Although we have witnessed great progress of action recognition in the past decade, it remains challenging yet interesting how to efficiently model the geometric variations in large scale datasets. This paper proposes a novel Spatial-Temporal Alignment Network (STAN) that aims to learn geometric invariant representations for action recognition and action detection. The STAN model is very light-weighted and generic, which could be plugged into existing action recognition models like ResNet3D and the SlowFast with a very low extra computational cost. We test our STAN model extensively on AVA, Kinetics-400, AVA-Kinetics, Charades, and Charades-Ego datasets. The experimental results show that the STAN model can consistently improve the state of the arts in both action detection and action recognition tasks. We will release our data, models and code.
Published: 2020

29. Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

Author: Doutre, Thibault, Han, Wei, Ma, Min, Lu, Zhiyun, Chiu, Chung-Cheng, Pang, Ruoming, Narayanan, Arun, Misra, Ananya, Zhang, Yu, and Cao, Liangliang
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline.
Published: 2020

30. Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition

Author: Li, Qiujia, Qiu, David, Zhang, Yu, Li, Bo, He, Yanzhang, Woodland, Philip C., Cao, Liangliang, and Strohman, Trevor
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based sequence-to-sequence model, computing word posteriors is difficult. An obvious alternative is to use the decoder softmax probability as the model confidence. In this paper, we first examine how some commonly used regularisation methods influence the softmax-based confidence scores and study the overconfident behaviour of end-to-end models. Then we propose a lightweight and effective approach named confidence estimation module (CEM) on top of an existing end-to-end ASR model. Experiments on LibriSpeech show that CEM can mitigate the overconfidence problem and can produce more reliable confidence scores with and without shallow fusion of a language model. Further analysis shows that CEM generalises well to speech from a moderately mismatched domain and can potentially improve downstream tasks such as semi-supervised learning., Comment: Submitted to ICASSP 2021
Published: 2020

31. Zero-shot Entity Linking with Efficient Long Range Sequence Modeling

Author: Yao, Zonghai, Cao, Liangliang, and Pan, Huapu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: This paper considers the problem of zero-shot entity linking, in which a link in the test time may not present in training. Following the prevailing BERT-based research efforts, we find a simple yet effective way is to expand the long-range sequence modeling. Unlike many previous methods, our method does not require expensive pre-training of BERT with long position embedding. Instead, we propose an efficient position embeddings initialization method called Embedding-repeat, which initializes larger position embeddings based on BERT-Base. On Wikia's zero-shot EL dataset, our method improves the SOTA from 76.06% to 79.08%, and for its long data, the corresponding improvement is from 74.57% to 82.14%. Our experiments suggest the effectiveness of long-range sequence modeling without retraining the BERT model., Comment: 6 pages, 6 figures, Findings of EMNLP2020
Published: 2020
Full Text: View/download PDF

32. RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

Author: Chiu, Chung-Cheng, Narayanan, Arun, Han, Wei, Prabhavalkar, Rohit, Zhang, Yu, Jaitly, Navdeep, Pang, Ruoming, Sainath, Tara N., Nguyen, Patrick, Cao, Liangliang, and Wu, Yonghui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perform poorly when evaluated on longer utterances. In this work, we analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference. On a long-form YouTube test set, when the nonstreaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22.3% to 14.8%; when the streaming RNN-T model trained on short Search queries, the proposed techniques improve WER on the YouTube set from 67.0% to 25.3%. Finally, when trained on Librispeech, we find that dynamic overlapping inference improves WER on YouTube from 99.8% to 33.0%., Comment: SLT camera-ready version
Published: 2020

33. Label-Efficient Learning on Point Clouds using Approximate Convex Decompositions

Author: Gadelha, Matheus, RoyChowdhury, Aruni, Sharma, Gopal, Kalogerakis, Evangelos, Cao, Liangliang, Learned-Miller, Erik, Wang, Rui, and Maji, Subhransu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: The problems of shape classification and part segmentation from 3D point clouds have garnered increasing attention in the last few years. Both of these problems, however, suffer from relatively small training sets, creating the need for statistically efficient methods to learn 3D shape representations. In this paper, we investigate the use of Approximate Convex Decompositions (ACD) as a self-supervisory signal for label-efficient learning of point cloud representations. We show that using ACD to approximate ground truth segmentation provides excellent self-supervision for learning 3D point cloud representations that are highly effective on downstream tasks. We report improvements over the state-of-the-art for unsupervised representation learning on the ModelNet40 shape classification dataset and significant gains in few-shot part segmentation on the ShapeNetPart dataset.Code available at https://github.com/matheusgadelha/PointCloudLearningACD, Comment: First two authors had equal contribution. ECCV'20 version. 19 pages, 5 figures
Published: 2020

34. Identification and complete genome sequence of a novel sadwavirus discovered in chrysanthemum (Chrysanthemum morifolium Ramat.)

Author: Chen, Jie, Dong, Yafeng, Wang, Hui, zhang, Jie, Ma, Changnian, Cao, Liangliang, Shen, Leiding, Cao, Kuirong, and Fan, Xudong
Published: 2023
Full Text: View/download PDF

35. Clinical value of classification in the treatment of children with suprasellar arachnoid cysts

Author: Zhao, Heng, Cao, Liangliang, Zhao, Yang, Wang, BaoCheng, Tian, ShauiWei, and Ma, Jie
Published: 2023
Full Text: View/download PDF

36. Progressive Learning Algorithm for Efficient Person Re-Identification

Author: Li, Zhen, Shao, Hanyang, Xue, Nian, Niu, Liang, and Cao, LiangLiang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This paper studies the problem of Person Re-Identification (ReID)for large-scale applications. Recent research efforts have been devoted to building complicated part models, which introduce considerably high computational cost and memory consumption, inhibiting its practicability in large-scale applications. This paper aims to develop a novel learning strategy to find efficient feature embeddings while maintaining the balance of accuracy and model complexity. More specifically, we find by enhancing the classical triplet loss together with cross-entropy loss, our method can explore the hard examples and build a discriminant feature embedding yet compact enough for large-scale applications. Our method is carried out progressively using Bayesian optimization, and we call it the Progressive Learning Algorithm (PLA). Extensive experiments on three large-scale datasets show that our PLA is comparable or better than the-state-of-the-arts. Especially, on the challenging Market-1501 dataset, we achieve Rank-1=94.7\%/mAP=89.4\% while saving at least 30\% parameters than strong part models., Comment: ICPR2020
Published: 2019

37. Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models

Author: Lu, Zhiyun, Cao, Liangliang, Zhang, Yu, Chiu, Chung-Cheng, and Fan, James
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to help interpret model predictions. We use well benchmarked IEMOCAP dataset and a new large-scale speech sentiment dataset SWBD-sentiment for evaluation. Our approach improves the-state-of-the-art accuracy on IEMOCAP from 66.6% to 71.7%, and achieves an accuracy of 70.10% on SWBD-sentiment with more than 49,500 utterances.
Published: 2019

38. Product Image Recognition with Guidance Learning and Noisy Supervision

Author: Li, Qing, Peng, Xiaojiang, Cao, Liangliang, Du, Wenbin, Xing, Hao, and Qiao, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper considers recognizing products from daily photos, which is an important problem in real-world applications but also challenging due to background clutters, category diversities, noisy labels, etc. We address this problem by two contributions. First, we introduce a novel large-scale product image dataset, termed as Product-90. Instead of collecting product images by labor-and time-intensive image capturing, we take advantage of the web and download images from the reviews of several e-commerce websites where the images are casually captured by consumers. Labels are assigned automatically by the categories of e-commerce websites. Totally the Product-90 consists of more than 140K images with 90 categories. Due to the fact that consumers may upload unrelated images, it is inevitable that our Product-90 introduces noisy labels. As the second contribution, we develop a simple yet efficient \textit{guidance learning} (GL) method for training convolutional neural networks (CNNs) with noisy supervision. The GL method first trains an initial teacher network with the full noisy dataset, and then trains a target/student network with both large-scale noisy set and small manually-verified clean set in a multi-task manner. Specifically, in the stage of student network training, the large-scale noisy data is supervised by its guidance knowledge which is the combination of its given noisy label and the soften label from the teacher network. We conduct extensive experiments on our Products-90 and public datasets, namely Food101, Food-101N, and Clothing1M. Our guidance learning method achieves performance superior to state-of-the-art methods on these datasets., Comment: 10 pages
Published: 2019

39. Accurate and Robust Pulmonary Nodule Detection by 3D Feature Pyramid Network with Self-supervised Feature Learning

Author: Liu, Jingya, Cao, Liangliang, Akin, Oguz, and Tian, Yingli
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurate detection of pulmonary nodules with high sensitivity and specificity is essential for automatic lung cancer diagnosis from CT scans. Although many deep learning-based algorithms make great progress for improving the accuracy of nodule detection, the high false positive rate is still a challenging problem which limits the automatic diagnosis in routine clinical practice. Moreover, the CT scans collected from multiple manufacturers may affect the robustness of Computer-aided diagnosis (CAD) due to the differences in intensity scales and machine noises. In this paper, we propose a novel self-supervised learning assisted pulmonary nodule detection framework based on a 3D Feature Pyramid Network (3DFPN) to improve the sensitivity of nodule detection by employing multi-scale features to increase the resolution of nodules, as well as a parallel top-down path to transit the high-level semantic features to complement low-level general features. Furthermore, a High Sensitivity and Specificity (HS2) network is introduced to eliminate the false positive nodule candidates by tracking the appearance changes in continuous CT slices of each nodule candidate on Location History Images (LHI). In addition, in order to improve the performance consistency of the proposed framework across data captured by different CT scanners without using additional annotations, an effective self-supervised learning schema is applied to learn spatiotemporal features of CT scans from large-scale unlabeled data. The performance and robustness of our method are evaluated on several publicly available datasets with significant performance improvements. The proposed framework is able to accurately detect pulmonary nodules with high sensitivity and specificity and achieves 90.6% sensitivity with 1/8 false positive per scan which outperforms the state-of-the-art results 15.8% on LUNA16 dataset., Comment: 15 pages, 8 figures, 5 tables, under review by Medical Image Analysis. arXiv admin note: substantial text overlap with arXiv:1906.03467
Published: 2019

40. 3DFPN-HS$^2$: 3D Feature Pyramid Network Based High Sensitivity and Specificity Pulmonary Nodule Detection

Author: Liu, Jingya, Cao, Liangliang, Akin, Oguz, and Tian, Yingli
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurate detection of pulmonary nodules with high sensitivity and specificity is essential for automatic lung cancer diagnosis from CT scans. Although many deep learning-based algorithms make great progress for improving the accuracy of nodule detection, the high false positive rate is still a challenging problem which limited the automatic diagnosis in routine clinical practice. In this paper, we propose a novel pulmonary nodule detection framework based on a 3D Feature Pyramid Network (3DFPN) to improve the sensitivity of nodule detection by employing multi-scale features to increase the resolution of nodules, as well as a parallel top-down path to transit the high-level semantic features to complement low-level general features. Furthermore, a High Sensitivity and Specificity (HS$^2$) network is introduced to eliminate the falsely detected nodule candidates by tracking the appearance changes in continuous CT slices of each nodule candidate. The proposed framework is evaluated on the public Lung Nodule Analysis (LUNA16) challenge dataset. Our method is able to accurately detect lung nodules at high sensitivity and specificity and achieves $90.4\%$ sensitivity with 1/8 false positive per scan which outperforms the state-of-the-art results $15.6\%$., Comment: 8 pages, 3 figures. Accepted to MICCAI 2019
Published: 2019

41. Automatic adaptation of object detectors to new domains using self-training

Author: RoyChowdhury, Aruni, Chakrabarty, Prithvijit, Singh, Ashish, Jin, SouYoung, Jiang, Huaizu, Cao, Liangliang, and Learned-Miller, Erik
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This work addresses the unsupervised adaptation of an existing object detector to a new target domain. We assume that a large number of unlabeled videos from this domain are readily available. We automatically obtain labels on the target data by using high-confidence detections from the existing detector, augmented with hard (misclassified) examples acquired by exploiting temporal cues using a tracker. These automatically-obtained labels are then used for re-training the original model. A modified knowledge distillation loss is proposed, and we investigate several ways of assigning soft-labels to the training examples from the target domain. Our approach is empirically evaluated on challenging face and pedestrian detection tasks: a face detector trained on WIDER-Face, which consists of high-quality images crawled from the web, is adapted to a large-scale surveillance data set; a pedestrian detector trained on clear, daytime images from the BDD-100K driving data set is adapted to all other scenarios such as rainy, foggy, night-time. Our results demonstrate the usefulness of incorporating hard examples obtained from tracking, the advantage of using soft-labels via distillation loss versus hard-labels, and show promising performance as a simple method for unsupervised domain adaptation of object detectors, with minimal dependence on hyper-parameters., Comment: Accepted at CVPR 2019
Published: 2019

42. Learning Deterministic Policy with Target for Power Control in Wireless Networks

Author: Lu, Yujiao, Lu, Hancheng, Cao, Liangliang, Wu, Feng, and Zhu, Daren
Subjects: Computer Science - Networking and Internet Architecture, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Inter-Cell Interference Coordination (ICIC) is a promising way to improve energy efficiency in wireless networks, especially where small base stations are densely deployed. However, traditional optimization based ICIC schemes suffer from severe performance degradation with complex interference pattern. To address this issue, we propose a Deep Reinforcement Learning with Deterministic Policy and Target (DRL-DPT) framework for ICIC in wireless networks. DRL-DPT overcomes the main obstacles in applying reinforcement learning and deep learning in wireless networks, i.e. continuous state space, continuous action space and convergence. Firstly, a Deep Neural Network (DNN) is involved as the actor to obtain deterministic power control actions in continuous space. Then, to guarantee the convergence, an online training process is presented, which makes use of a dedicated reward function as the target rule and a policy gradient descent algorithm to adjust DNN weights. Experimental results show that the proposed DRL-DPT framework consistently outperforms existing schemes in terms of energy efficiency and throughput under different wireless interference scenarios. More specifically, it improves up to 15% of energy efficiency with faster convergence rate., Comment: 7 pages, 7 figures, GlobeCom2018
Published: 2019

43. The photoluminescence properties of a blue phosphor—Eu2+ doped silicate lutetium strontium with triple sites

Author: Zhou, Zhen, Zhang, Qian, Jin, Ye, Lin, Huayan, Li, Yuyan, Fang, Fei, Chen, Hongtao, Wei, Qiang, Lin, Hong, Feng, Guoqing, Meng, Fancheng, Cao, Liangliang, and Ren, Haishen
Published: 2023
Full Text: View/download PDF

44. All-solid-state printable supercapacitors based on bimetallic sulfide NiCo2S4 with in-plane interdigital electrode architecture

Author: Tian, Zhongqing, Wang, Dandan, Zhang, Chunyan, Meng, Fancheng, Cao, Liangliang, and Lin, Huixing
Published: 2022
Full Text: View/download PDF

45. Effect of BN addition on mechanical and electrical properties of oxidative sintered porous Si3N4-BN composites

Author: Lin, Hong, Feng, Guoqing, Wang, Hongyu, Peng, Haiyi, Cao, Liangliang, Lin, Huixing, Li, Hongtao, and Meng, Fancheng
Published: 2023
Full Text: View/download PDF

46. Matrix Factorization on GPUs with Memory Optimization and Approximate Computing

Author: Tan, Wei, Chang, Shiyu, Fong, Liana, Li, Cheng, Wang, Zijun, and Cao, Liangliang
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability e.g. easy to handle positive-unlabeled inputs, fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck. To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both memory optimization and approximate computing. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on large-scale datasets show that our solution not only outperforms the competing CPU solutions by a large margin but also has a 2x-4x performance gain compared to the state-of-the-art GPU solutions. Our implementations are open-sourced and publicly available.
Published: 2018

47. Focal Visual-Text Attention for Visual Question Answering

Author: Liang, Junwei, Jiang, Lu, Cao, Liangliang, Li, Li-Jia, and Hauptmann, Alexander
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photos, we have to look at whole collections with sequences of photos or videos. When answering questions from a large collection, a natural problem is to identify snippets to support the answer. In this paper, we describe a novel neural network called Focal Visual-Text Attention network (FVTA) for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented. FVTA introduces an end-to-end approach that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question. FVTA can not only answer the questions well but also provides the justifications which the system results are based upon to get the answers. FVTA achieves state-of-the-art performance on the MemexQA dataset and competitive results on the MovieQA dataset., Comment: In CVPR 2018. Code, models and dataset are available here: https://memexqa.cs.cmu.edu/
Published: 2018
Full Text: View/download PDF

48. Pluronic F127-modified Ba[TiO.sub.3] for ceramic/polymer nanocomposite dielectric capacitor with enhanced energy storage performance

Author: Chen, Jian, Zhou, Chuang, Cai, Wei, Huang, Fuxiang, Zhang, Chunyan, Cao, Liangliang, and Meng, Fancheng
Subjects: Electrical conductivity -- Analysis, Energy storage -- Methods, Dielectrics -- Usage, Capacitors -- Design and construction -- Materials, Engineering and manufacturing industries, Science and technology
Abstract: Ceramic/polymer nanocomposites have been extensively explored for dielectric capacitor application due to their high energy storage performance and ease of processing, flexibility, and low cost. However, improving compatibility between inorganic and organic materials is of great significance and has been a long-standing challenge. In this work, the polymer surfactant Pluronic[TM] F127 was employed to perform surface modification of Ba[TiO.sub.3] through a facile solution process. The compatibility between Ba[TiO.sub.3] nanoparticles and P (VDF-HFP) can be remarkably strengthened by the affinity between F127 polymer chains and the organic matrix. Structural defects such as pores, cracks, and agglomeration of inorganic particles were obviously reduced, and interfacial polarization has been significantly enhanced. The discharged energy density [U.sub.d] and charge-discharge efficiency [eta] reach up to 5.0J/[cm.sup.3] and 58.1% under the breakdown strength of 3740 kV/cm for the nanocomposite containing 1 vol% BT@F127. These dielectric properties are clearly better than the nanocomposite using unmodified BT, as well as similar hybrids reported previously. KEYWORDS ceramic/polymer nanocomposite, dielectric property, energy density, surface modification, 1 | INTRODUCTION Dielectric capacitors are the basic components of energy storage in modem electronic and electrical systems. They have been widely used in the fields of power transmission, hybrid [...]
Published: 2022
Full Text: View/download PDF

49. Enhanced energy storage density in poly(vinylidene fluoride-hexafluoropropylene) nanocomposites by filling with core-shell structured BaTiO3@MgO nanoparticals

Author: Chen, Jian, Huang, Fuxiang, Zhang, Chunyan, Meng, Fancheng, Cao, Liangliang, and Lin, Huixing
Published: 2022
Full Text: View/download PDF

50. Structure and dielectric properties of BaTi1-x(Sb0.5Nb0.5)xO3 ceramics

Author: Yang, Shu, Zhao, Kai, Tian, Zhongqing, Cao, Liangliang, Zhang, Chunyan, and Meng, Fancheng
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

519 results on '"Cao Liangliang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources