Author: "Shu, Fangxun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shu, Fangxun"' showing total 10 results

Start Over Author "Shu, Fangxun"

10 results on '"Shu, Fangxun"'

1. SAG: Style-Aligned Article Generation via Model Collaboration

Author: Xu, Chenning, Shu, Fangxun, Jin, Dian, Wei, Jinghao, and Jiang, Hao
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have increased the demand for personalized and stylish content generation. However, closed-source models like GPT-4 present limitations in optimization opportunities, while the substantial training costs and inflexibility of open-source alternatives, such as Qwen-72B, pose considerable challenges. Conversely, small language models (SLMs) struggle with understanding complex instructions and transferring learned capabilities to new contexts, often exhibiting more pronounced limitations. In this paper, we present a novel collaborative training framework that leverages the strengths of both LLMs and SLMs for style article generation, surpassing the performance of either model alone. We freeze the LLMs to harness their robust instruction-following capabilities and subsequently apply supervised fine-tuning on the SLM using style-specific data. Additionally, we introduce a self-improvement method to enhance style consistency. Our new benchmark, NoteBench, thoroughly evaluates style-aligned generation. Extensive experiments show that our approach achieves state-of-the-art performance, with improvements of 0.78 in ROUGE-L and 0.55 in BLEU-4 scores compared to GPT-4, while maintaining a low hallucination rate regarding factual and faithfulness.
Published: 2024

2. LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Author: Shu, Fangxun, Liao, Yue, Zhuo, Le, Xu, Chenning, Zhang, Lei, Zhang, Guanghao, Shi, Haonan, Chen, Long, Zhong, Tao, He, Wanggui, Fu, Siming, Li, Haoyuan, Li, Bolin, Yu, Zhelun, Liu, Si, Li, Hongsheng, and Jiang, Hao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.
Published: 2024

3. MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Author: He, Wanggui, Fu, Siming, Liu, Mushui, Wang, Xierui, Xiao, Wenyi, Shu, Fangxun, Wang, Yi, Zhang, Lei, Yu, Zhelun, Li, Haoyuan, Huang, Ziwei, Gan, LeiLei, and Jiang, Hao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications., Comment: 14 pages, 9 figures
Published: 2024

4. Autoregressive Pretraining with Mamba in Vision

Author: Ren, Sucheng, Li, Xianhang, Tu, Haoqin, Wang, Feng, Shu, Fangxun, Zhang, Lei, Mei, Jieru, Yang, Linjie, Wang, Peng, Wang, Heng, Yuille, Alan, and Xie, Cihang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.
Published: 2024

5. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Author: Zhang, Wenqiao, Lin, Tianwei, Liu, Jiang, Shu, Fangxun, Li, Haoyuan, Zhang, Lei, Wanggui, He, Zhou, Hao, Lv, Zheqi, Jiang, Hao, Li, Juncheng, Tang, Siliang, and Zhuang, Yueting
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.
Published: 2024

6. Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data

Author: Zhang, Lei, Shu, Fangxun, Liu, Tianyang, Ren, Sucheng, Jiang, Hao, and Xie, Cihang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The increasing availability of image-text pairs has largely fueled the rapid advancement in vision-language foundation models. However, the vast scale of these datasets inevitably introduces significant variability in data quality, which can adversely affect the model performance. This highlights the critical role of data filtering, not only to enhance training efficiency but also to improve overall data quality. Existing methods typically rely on metrics such as CLIP Score and BLIP Score, which are derived from pre-trained models. However, these models are often trained on uncurated, noisy datasets, which can perpetuate errors and misalignments in the filtered dataset. We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets into a compact and high-quality form. To systemically capture human preferences on image-text alignments, we collect a diverse image-text dataset where each image is associated with multiple captions from various sources, and establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Additionally, we train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter image-text pairs. Extensive experiments demonstrate that we can maintain, sometimes even improve, model performance while compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to only 15.5M, our BLIP-B/16 models consistently show an average improvement of 2.9% on retrieval tasks and 11.5% on captioning tasks compared to full-size-dataset counterparts.
Published: 2023

7. Audio-Visual LLM for Video Understanding

Author: Shu, Fangxun, Zhang, Lei, Jiang, Hao, and Xie, Cihang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding. A key design is the modality-augmented training, which involves the integration of modality-specific tokens engineered to activate the appropriate visual and/or auditory encoder selectively. This mechanism is pivotal in enabling end-to-end joint training with video data at different modalities, including visual-only, audio-only, and audio-visual formats. Moreover, we introduce a high-quality video instruction dataset, derived from GPT-4. This dataset allows Audio-Visual LLM to adeptly process a variety of task-oriented video instructions, ranging from multi-turn conversations and audio-visual narratives to complex reasoning tasks. Extensive experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a range of video understanding tasks. For example, Audio-Visual LLM achieves an accuracy of 53.7% on MSRVTT-QA, outperforming non-LLM-based InterVideo by 6.6% and LLM-based Valley by 4.4%, respectively. Additionally, our Audio-Visual LLM also achieves competitive performance on audio tasks (e.g., AudioCaps).
Published: 2023

8. Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Author: Shu, Fangxun, Chen, Biaolong, Liao, Yue, Xiao, Shuwen, Sun, Wenyu, Li, Xiaobo, Zhu, Yousong, Wang, Jinqiao, and Liu, Si
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance. Our MAC achieves state-of-the-art results on various video-text retrieval datasets, including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks., Comment: Technical Report
Published: 2022

9. Compress & Align: Curating Image-Text Data with Human Knowledge

Author: Zhang, Lei, Shu, Fangxun, Ren, Sucheng, Zhao, Bingchen, Jiang, Hao, Xie, Cihang, Zhang, Lei, Shu, Fangxun, Ren, Sucheng, Zhao, Bingchen, Jiang, Hao, and Xie, Cihang
Abstract: The massive growth of image-text data through web crawling inherently presents the challenge of variability in data quality. This paper introduces a novel algorithm, rooted in human knowledge, to compress this vast corpus of web-crawled image-text datasets to a compact and high-quality form. Our method unfolds in three major steps. First, we collect an image-text dataset, wherein each image is associated with multiple captions sourced from diverse origins. Then, to systemically capture human preferences regarding the best caption paired with each image, we establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Lastly, we train a reward model on the annotated dataset to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter misaligned/low-quality image-text pairs. Extensive experiments demonstrate that we are able to secure (or even improve) model performance by compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to 15.5M (e.g., ~9x smaller), our BLIP-B/16 models still consistently show superior performance compared with the full-size-dataset counterpart on image-text retrieval (Flickr30K, COCO) by ~2.5% in Recall@1, and on image-captioning (Nocaps, COCO) by ~10.0% in CIDEr and ~2.7% in SPICE.
Published: 2023

10. Lossless Image Compression Using Dynamic Block Encoding

Author: Shu, Fangxun, primary and Yang, Xuelin, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

10 results on '"Shu, Fangxun"'

1. SAG: Style-Aligned Article Generation via Model Collaboration

2. LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

3. MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

4. Autoregressive Pretraining with Mamba in Vision

5. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

6. Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data

7. Audio-Visual LLM for Video Understanding

8. Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

9. Compress & Align: Curating Image-Text Data with Human Knowledge

10. Lossless Image Compression Using Dynamic Block Encoding

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

10 results on '"Shu, Fangxun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources