Author: "Liu Fenglin" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Liu Fenglin"' showing total 901 results

Start Over Author "Liu Fenglin"

901 results on '"Liu Fenglin"'

1. SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

Author: Wu, Jinge, Kim, Yunsoo, Shi, Daqian, Cliffton, David, Liu, Fenglin, and Wu, Honghan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low-income countries. We propose an open-source Small Language and Vision Assistant (SLaVA-CXR) that can be used for Chest X-Ray report automation. To efficiently train a small assistant, we first propose the Re$^3$Training method, which simulates the cognitive development of radiologists and optimizes the model in the Recognition, Reasoning, and Reporting training manner. Then, we introduce a data synthesis method, RADEX, which can generate a high-quality and diverse training corpus with privacy regulation compliance. The extensive experiments show that our SLaVA-CXR built on a 2.7B backbone not only outperforms but also achieves 6 times faster inference efficiency than previous state-of-the-art larger models.
Published: 2024

2. Applying and Evaluating Large Language Models in Mental Health Care: A Scoping Review of Human-Assessed Generative Tasks

Author: Hua, Yining, Na, Hongbin, Li, Zehan, Liu, Fenglin, Fang, Xiao, Clifton, David, and Torous, John
Subjects: Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) are emerging as promising tools for mental health care, offering scalable support through their ability to generate human-like responses. However, the effectiveness of these models in clinical settings remains unclear. This scoping review aimed to assess the current generative applications of LLMs in mental health care, focusing on studies where these models were tested with human participants in real-world scenarios. A systematic search across APA PsycNet, Scopus, PubMed, and Web of Science identified 726 unique articles, of which 17 met the inclusion criteria. These studies encompassed applications such as clinical assistance, counseling, therapy, and emotional support. However, the evaluation methods were often non-standardized, with most studies relying on ad hoc scales that limit comparability and robustness. Privacy, safety, and fairness were also frequently underexplored. Moreover, reliance on proprietary models, such as OpenAI's GPT series, raises concerns about transparency and reproducibility. While LLMs show potential in expanding mental health care access, especially in underserved areas, the current evidence does not fully support their use as standalone interventions. More rigorous, standardized evaluations and ethical oversight are needed to ensure these tools can be safely and effectively integrated into clinical practice.
Published: 2024

3. MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Author: Gu, Zishan, Yin, Changchang, Liu, Fenglin, and Zhang, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.
Published: 2024

4. DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime

Author: Luo, Zhiyao, Zhu, Mingcheng, Liu, Fenglin, Li, Jiali, Pan, Yangchen, Zhou, Jiandong, and Zhu, Tingting
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Reinforcement learning (RL) has garnered increasing recognition for its potential to optimise dynamic treatment regimes (DTRs) in personalised medicine, particularly for drug dosage prescriptions and medication recommendations. However, a significant challenge persists: the absence of a unified framework for simulating diverse healthcare scenarios and a comprehensive analysis to benchmark the effectiveness of RL algorithms within these contexts. To address this gap, we introduce \textit{DTR-Bench}, a benchmarking platform comprising four distinct simulation environments tailored to common DTR applications, including cancer chemotherapy, radiotherapy, glucose management in diabetes, and sepsis treatment. We evaluate various state-of-the-art RL algorithms across these settings, particularly highlighting their performance amidst real-world challenges such as pharmacokinetic/pharmacodynamic (PK/PD) variability, noise, and missing data. Our experiments reveal varying degrees of performance degradation among RL algorithms in the presence of noise and patient variability, with some algorithms failing to converge. Additionally, we observe that using temporal observation representations does not consistently lead to improved performance in DTR settings. Our findings underscore the necessity of developing robust, adaptive RL algorithms capable of effectively managing these complexities to enhance patient-specific healthcare. We have open-sourced our benchmark and code at https://github.com/GilesLuo/DTR-Bench., Comment: 13 pages for main content
Published: 2024

5. Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning

Author: Gu, Zishan, Liu, Fenglin, Yin, Changchang, and Zhang, Ping
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The adoption of large language models (LLMs) in healthcare has attracted significant research interest. However, their performance in healthcare remains under-investigated and potentially limited, due to i) they lack rich domain-specific knowledge and medical reasoning skills; and ii) most state-of-the-art LLMs are unimodal, text-only models that cannot directly process multimodal inputs. To this end, we propose a multimodal medical collaborative reasoning framework \textbf{MultiMedRes}, which incorporates a learner agent to proactively gain essential information from domain-specific expert models, to solve medical multimodal reasoning problems. Our method includes three steps: i) \textbf{Inquire}: The learner agent first decomposes given complex medical reasoning problems into multiple domain-specific sub-problems; ii) \textbf{Interact}: The agent then interacts with domain-specific expert models by repeating the ``ask-answer'' process to progressively obtain different domain-specific knowledge; iii) \textbf{Integrate}: The agent finally integrates all the acquired domain-specific knowledge to accurately address the medical reasoning problem. We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments demonstrate that our zero-shot prediction achieves state-of-the-art performance, and even outperforms the fully supervised methods. Besides, our approach can be incorporated into various LLMs and multimodal LLMs to significantly boost their performance.
Published: 2024

6. Neuropilin-1 Expression in Treg and PDC Cells in Peripheral Blood of Non-small Cell Lung Cancer Patients and Clinical Significance

Author: SHI Xingyue, CHEN Chong, LIU Fenglin, LYU Di, XU Jie, and HAN Zhengxiang
Subjects: non-small cell lung cancer(nsclc), neuropilin-1(nrp-1), treg, pdc, flow cytology, Neoplasms. Tumors. Oncology. Including cancer and carcinogens, RC254-282
Abstract: Objective To investigate Neuropilin-1 expression in Treg and PDC cells in peripheral blood of non-small cell lung cancer patients and clinical significance. Methods We collected the peripheral blood of 49 diagnosed non-small cell lung cancer patients, and 33 healthy subjects were taken as control group. The expression of NRP-1 on Treg and PDC cells in peripheral blood were detected by flow cytometry. Results The expression rate of NRP-1 in Treg cells of NSCLC patients was significantly higher than that of healthy subjects (P < 0.05). The expression of NRP-1 in Treg cells of NSCLC patients was correlated with the maximum diameter of tumor, lymph node metastasis, TNM stage and differentiation degree (P < 0.05); the expression of PDC was related to the maximum diameter of the tumor, lymph node metastasis and TNM staging(P < 0.05). Conclusion The expression of NRP-1 in peripheral blood Treg and PDC cells may be associated with immunoregulation of non-small cell lung cancer. It may be a tumor marker for predicting the prognosis of NSCLC patients, and a new target for lung cancer treatment.
Published: 2020
Full Text: View/download PDF

7. Female Objects and Feminist Consciousness for the Purpose to Awake Readers’ Awareness: A Comparative Analysis between Angela Carter’s The Bloody Chamber and Anne Sexton’s Transformations

Author: Liu Fenglin
Subjects: feminism, female object, feminist consciousness, fairy tale, psychoanalysis., Technology (General), T1-995, Social sciences (General), H1-99
Abstract: The female object, as a symbolic image created by male authors to reduce the threat brought by females towards patriarchy, has become a method to express male sexual and domestic fantasies. However, in the fairy tale adaptation by two mid-twentieth century female authors-Angela Carter and Anne Sexton, the female object is used to evoke feminist consciousness. Although former studies have covered some feminism issues, for instance, the feminist awareness through the mirror image in Angela Carter’s The Bloody Chamber, and the direct metaphors such as “doll” and “soap pop” which lead to female objectification in Anne Sexton’s Transformations, little research has compared the distinctive psychological impacts that the narrative forms between the two mentioned texts have on readers. In the first section of this paper, how both authors deconstruct the female stereotypes and how they reinterpret modes of female agency in the original Grimm’s fairy tales have been examined. Based on the writers’ perspective, the first section would also explore the expression of female objects in their works. As for the second section, I would mainly focus on the psychoanalysis of Lacan’s mirror stages, and yet cover the awakening processes presented in the mirror images and symbols composed in the two adaptations. In the third section, the different narrative strategies utilized by Carter and Sexton in order to stimulate readers’ responses towards feminist consciousness would be illustrated.
Published: 2020
Full Text: View/download PDF

8. Large Language Models in Mental Health Care: a Scoping Review

Author: Hua, Yining, Liu, Fenglin, Yang, Kailai, Li, Zehan, Na, Hongbin, Sheu, Yi-han, Zhou, Peilin, Moran, Lauren V., Ananiadou, Sophia, Beam, Andrew, and Torous, John
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: The integration of large language models (LLMs) in mental health care is an emerging field. There is a need to systematically review the application outcomes and delineate the advantages and limitations in clinical settings. This review aims to provide a comprehensive overview of the use of LLMs in mental health care, assessing their efficacy, challenges, and potential for future applications. A systematic search was conducted across multiple databases including PubMed, Web of Science, Google Scholar, arXiv, medRxiv, and PsyArXiv in November 2023. All forms of original research, peer-reviewed or not, published or disseminated between October 1, 2019, and December 2, 2023, are included without language restrictions if they used LLMs developed after T5 and directly addressed research questions in mental health care settings. From an initial pool of 313 articles, 34 met the inclusion criteria based on their relevance to LLM application in mental health care and the robustness of reported outcomes. Diverse applications of LLMs in mental health care are identified, including diagnosis, therapy, patient engagement enhancement, etc. Key challenges include data availability and reliability, nuanced handling of mental states, and effective evaluation methods. Despite successes in accuracy and accessibility improvement, gaps in clinical applicability and ethical considerations were evident, pointing to the need for robust data, standardized evaluations, and interdisciplinary collaboration. LLMs hold substantial promise for enhancing mental health care. For their full potential to be realized, emphasis must be placed on developing robust datasets, development and evaluation frameworks, ethical guidelines, and interdisciplinary collaborations to address current limitations.
Published: 2024

9. A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

Author: Zhou, Hongjian, Liu, Fenglin, Gu, Boyang, Zou, Xinyu, Huang, Jinfa, Wu, Jinge, Li, Yiru, Chen, Sam S., Zhou, Peilin, Liu, Junling, Hua, Yining, Mao, Chengfeng, You, Chenyu, Wu, Xian, Zheng, Yefeng, Clifton, Lei, Li, Zheng, Luo, Jiebo, and Clifton, David A.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in supporting different medical tasks (e.g., enhancing clinical diagnostics and providing medical education), a review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce. Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks, and further compare them with state-of-the-art lightweight models, aiming to provide an understanding of the advantages and limitations of LLMs in medicine. Overall, in this review, we address the following questions: 1) What are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this review aims to provide insights into the opportunities for LLMs in medicine and serve as a practical resource. We also maintain a regularly updated list of practical guides on medical LLMs at https://github.com/AI-in-Health/MedLLMsPracticalGuide, Comment: Preprint. Version 6. Update Figures 1-5; Tables 2-3; 31 pages
Published: 2023

10. Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

Author: Ye, Qichen, Liu, Junling, Chong, Dading, Zhou, Peilin, Hua, Yining, Liu, Fenglin, Cao, Meng, Wang, Ziming, Cheng, Xuxin, Lei, Zhu, and Guo, Zhenhua
Subjects: Computer Science - Computation and Language
Abstract: Integrating large language models (LLMs) into healthcare holds great potential but faces challenges. Pre-training LLMs from scratch for domains like medicine is resource-heavy and often unfeasible. On the other hand, sole reliance on Supervised Fine-tuning (SFT) can result in overconfident predictions and may not tap into domain-specific insights. In response, we present a multi-stage training method combining Domain-specific Continued Pre-training (DCPT), SFT, and Direct Preference Optimization (DPO). In addition, we publish a 3Gb Chinese Medicine (ChiMed) dataset, encompassing medical question answering, plain texts, knowledge graphs, and dialogues, segmented into three training stages. The medical LLM trained with our pipeline, Qilin-Med, shows substantial performance improvement. In the CPT and SFT phases, Qilin-Med achieved 38.4% and 40.0% accuracy on the CMExam test set, respectively. It outperformed the basemodel Baichuan-7B (accuracy: 33.5%), by 7.5%. In the DPO phase, it scored 16.66 in BLEU-1 and 27.44 in ROUGE-1 on the Huatuo-26M test set, bringing further improvement to the SFT phase (12.69 in BLEU-1 and 24.21 in ROUGE-1). Additionally, we have further enhanced the model's performance through the Retrieval Augmented Generation (RAG) approach. Experiments demonstrate that Qilin-Med-RAG achieves an accuracy rate of 42.8% on CMExam. These results highlight the contribution of our novel training approach in building LLMs for medical applications.
Published: 2023

11. OSNet & MNetO: Two Types of General Reconstruction Architectures for Linear Computed Tomography in Multi-Scenarios

Author: Wang, Zhisheng, Deng, Zihan, Liu, Fenglin, Huang, Yixing, Yu, Haijun, and Cui, Junning
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, 68T07(Primary) 68U10, 68T20(Secondary)
Abstract: Recently, linear computed tomography (LCT) systems have actively attracted attention. To weaken projection truncation and image the region of interest (ROI) for LCT, the backprojection filtration (BPF) algorithm is an effective solution. However, in BPF for LCT, it is difficult to achieve stable interior reconstruction, and for differentiated backprojection (DBP) images of LCT, multiple rotation-finite inversion of Hilbert transform (Hilbert filtering)-inverse rotation operations will blur the image. To satisfy multiple reconstruction scenarios for LCT, including interior ROI, complete object, and exterior region beyond field-of-view (FOV), and avoid the rotation operations of Hilbert filtering, we propose two types of reconstruction architectures. The first overlays multiple DBP images to obtain a complete DBP image, then uses a network to learn the overlying Hilbert filtering function, referred to as the Overlay-Single Network (OSNet). The second uses multiple networks to train different directional Hilbert filtering models for DBP images of multiple linear scannings, respectively, and then overlays the reconstructed results, i.e., Multiple Networks Overlaying (MNetO). In two architectures, we introduce a Swin Transformer (ST) block to the generator of pix2pixGAN to extract both local and global features from DBP images at the same time. We investigate two architectures from different networks, FOV sizes, pixel sizes, number of projections, geometric magnification, and processing time. Experimental results show that two architectures can both recover images. OSNet outperforms BPF in various scenarios. For the different networks, ST-pix2pixGAN is superior to pix2pixGAN and CycleGAN. MNetO exhibits a few artifacts due to the differences among the multiple models, but any one of its models is suitable for imaging the exterior edge in a certain direction., Comment: 13 pages, 13 figures
Published: 2023

12. MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Author: Yang, Bang, Liu, Fenglin, Wu, Xian, Wang, Yaowei, Sun, Xu, and Zou, Yuexian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Supervised visual captioning models typically require a large scale of images or videos paired with descriptions in a specific language (i.e., the vision-caption pairs) for training. However, collecting and labeling large-scale datasets is time-consuming and expensive for many scenarios and languages. Therefore, sufficient labeled pairs are usually not available. To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. In the training stage, MultiCapCLIP only requires text data for input. Then it conducts two main steps: 1) retrieving concept prompts that preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding the prompts to learn writing styles to output captions in a desired language. In the testing stage, MultiCapCLIP instead takes visual data as input directly to retrieve the concept prompts to generate the final visual descriptions. The extensive experiments on image and video captioning across four benchmarks and four languages (i.e., English, Chinese, German, and French) confirm the effectiveness of our approach. Compared with state-of-the-art zero-shot and weakly-supervised methods, our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics. Our code is available at https://github.com/yangbang18/MultiCapCLIP., Comment: ACL'2023, 13 pages, 4 figures
Published: 2023
Full Text: View/download PDF

13. Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels

Author: Yang, Bang, Liu, Fenglin, Li, Zheng, Yin, Qingyu, You, Chenyu, Yin, Bing, and Zou, Yuexian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating an informative and attractive title for the product is a crucial task for e-commerce. Most existing works follow the standard multimodal natural language generation approaches, e.g., image captioning, and employ the large scale of human-labelled datasets to train desirable models. However, for novel products, especially in a different domain, there are few existing labelled data. In this paper, we propose a prompt-based approach, i.e., the Multimodal Prompt Learning framework, to accurately and efficiently generate titles for novel products with limited labels. We observe that the core challenges of novel product title generation are the understanding of novel product characteristics and the generation of titles in a novel writing style. To this end, we build a set of multimodal prompts from different modalities to preserve the corresponding characteristics and writing styles of novel products. As a result, with extremely limited labels for training, the proposed method can retrieve the multimodal prompts to generate desirable titles for novel products. The experiments and analyses are conducted on five novel product categories under both the in-domain and out-of-domain experimental settings. The results show that, with only 1% of downstream labelled data for training, our proposed approach achieves the best few-shot results and even achieves competitive results with fully-supervised methods trained on 100% of training data; With the full labelled data for training, our method achieves state-of-the-art results., Comment: accepted by ACL Findings 2023
Published: 2023

14. BPF Algorithms for Multiple Source-Translation Computed Tomography Reconstruction

Author: Wang, Zhisheng, Yu, Haijun, Huang, Yixing, Wang, Shunli, Ni, Song, Li, Zongfeng, Liu, Fenglin, and Cui, Junning
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Micro-computed tomography (micro-CT) is a widely used state-of-the-art instrument employed to study the morphological structures of objects in various fields. However, its small field-of-view (FOV) cannot meet the pressing demand for imaging relatively large objects at high spatial resolutions. Recently, we devised a novel scanning mode called multiple source translation CT (mSTCT) that effectively enlarges the FOV of the micro-CT and correspondingly developed a virtual projection-based filtered backprojection (V-FBP) algorithm for reconstruction. Although V-FBP skillfully solves the truncation problem in mSTCT, it requires densely sampled projections to arrive at high-resolution reconstruction, which reduces imaging efficiency. In this paper, we developed two backprojection-filtration (BPF)-based algorithms for mSTCT, i.e., S-BPF (derivatives along source) and D-BPF (derivatives along detector). D-BPF can achieve high-resolution reconstruction with fewer projections than V-FBP and S-BPF. Through simulated and real experiments conducted in this paper, we demonstrate that D-BPF can reduce source sampling by 75% compared with V-FBP at the same spatial resolution, which makes mSTCT more feasible in practice. Meanwhile, S-BPF can yield more stable results than D-BPF, which is similar to V-FBP., Comment: 23 pages, 13 figures
Published: 2023

15. Enhancing human behavior recognition with spatiotemporal graph convolutional neural networks and skeleton sequences

Author: Xu, Jianmin, Liu, Fenglin, Wang, Qinghui, Zou, Ruirui, Wang, Ying, Zheng, Junling, Du, Shaoyi, and Zeng, Wei
Published: 2024
Full Text: View/download PDF

16. Ubiquilin-4 induces immune escape in gastric cancer by activating the notch signaling pathway

Author: Jiang, Quan, Chen, Hao, Zhou, Shixin, Zhu, Tao, Liu, Wenshuai, Wu, Hao, Zhang, Yong, Liu, Fenglin, and Sun, Yihong
Published: 2024
Full Text: View/download PDF

17. ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Author: Yang, Bang, Liu, Fenglin, Zou, Yuexian, Wu, Xian, Wang, Yaowei, and Clifton, David A.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods., Comment: Accepted by TPAMI (Our code and data are available at https://github.com/yangbang18/ZeroNLG)
Published: 2023

18. Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective

Author: You, Chenyu, Dai, Weicheng, Min, Yifei, Liu, Fenglin, Clifton, David A., Zhou, S Kevin, Staib, Lawrence Hamilton, and Duncan, James S
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: For medical image segmentation, contrastive learning is the dominant practice to improve the quality of visual representations by contrasting semantically similar and dissimilar pairs of samples. This is enabled by the observation that without accessing ground truth labels, negative examples with truly dissimilar anatomical features, if sampled, can significantly improve the performance. In reality, however, these samples may come from similar anatomical regions and the models may struggle to distinguish the minority tail-class samples, making the tail classes more prone to misclassification, both of which typically lead to model collapse. In this paper, we propose ARCO, a semi-supervised contrastive learning (CL) framework with stratified group theory for medical image segmentation. In particular, we first propose building ARCO through the concept of variance-reduced estimation and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks with extremely limited labels. Furthermore, we theoretically prove these sampling techniques are universal in variance reduction. Finally, we experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings, and our methods consistently outperform state-of-the-art semi-supervised methods. Additionally, we augment the CL frameworks with these sampling techniques and demonstrate significant gains over previous methods. We believe our work is an important step towards semi-supervised medical image segmentation by quantifying the limitation of current self-supervision objectives for accomplishing such challenging safety-critical tasks., Comment: Accepted by Advances in Neural Information Processing Systems (NeurIPS 2023)
Published: 2023

19. Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Author: Liu, Fenglin, Wu, Xian, You, Chenyu, Ge, Shen, Zou, Yuexian, and Sun, Xu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively., Comment: Published at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Published: 2022

20. Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Author: Jin, Peng, Huang, Jinfa, Liu, Fenglin, Wu, Xian, Ge, Shen, Song, Guoli, Clifton, David A., and Chen, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods., Comment: Accepted to NeurIPS 2022
Published: 2022

21. DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Author: Liu, Fenglin, Wu, Xian, Ge, Shen, Ren, Xuancheng, Fan, Wei, Sun, Xu, and Zou, Yuexian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image-sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts., Comment: Published in ACM TKDD2022 (ACM Transactions on Knowledge Discovery from Data)
Published: 2022

22. Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine

Author: Liu, Fenglin, Yang, Bang, You, Chenyu, Wu, Xian, Ge, Shen, Liu, Zhangdaihong, Sun, Xu, Yang, Yang, and Clifton, David A.
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Language models (LMs), including large language models (such as ChatGPT), have the potential to assist clinicians in generating various clinical notes. However, LMs are prone to produce ``hallucinations'', i.e., generated content that is not aligned with facts and knowledge. In this paper, we propose the Re$^3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning to enable LMs to generate faithful clinical texts. We demonstrate the effectiveness of our method in generating patient discharge instructions. It requires the LMs not to only understand the patients' long clinical documents, i.e., the health records during hospitalization, but also to generate critical instructional information provided both to carers and to the patient at the time of discharge. The proposed Re$^3$Writer imitates the working patterns of physicians to first \textbf{re}trieve related working experience from historical instructions written by physicians, then \textbf{re}ason related medical knowledge. Finally, it \textbf{re}fines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the discharge instructions for previously-unseen patients. Our experiments show that, using our method, the performance of five representative LMs can be substantially boosted across all metrics. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of fluency, faithfulness, and comprehensiveness.
Published: 2022

23. Prophet Attention: Predicting Attention with Future Attention for Image Captioning

Author: Liu, Fenglin, Ren, Xuancheng, Wu, Xian, Fan, Wei, Zou, Yuexian, and Sun, Xu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a "deviated focus" problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the "ideal" attention weights towards image regions. These calculated "ideal" weights are further used to regularize the "deviated" attention. In this manner, image regions are grounded with the correct words. The proposed Prophet Attention can be easily incorporated into existing image captioning models to improve their performance of both grounding and captioning. The experiments on the Flickr30k Entities and the MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in terms of the default ranking score, i.e., CIDEr-c40., Comment: Accepted by NeurIPS 2020
Published: 2022

24. Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels

Author: You, Chenyu, Dai, Weicheng, Liu, Fenglin, Min, Yifei, Dvornek, Nicha C., Li, Xiaoxiao, Clifton, David A., Staib, Lawrence, and Duncan, James S.
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent studies on contrastive learning have achieved remarkable performance solely by leveraging few labels in the context of medical image segmentation. Existing methods mainly focus on instance discrimination and invariant mapping. However, they face three common pitfalls: (1) tailness: medical image data usually follows an implicit long-tail class distribution. Blindly leveraging all pixels in training hence can lead to the data imbalance issues, and cause deteriorated performance; (2) consistency: it remains unclear whether a segmentation model has learned meaningful and yet consistent anatomical features due to the intra-class variations between different anatomical features; and (3) diversity: the intra-slice correlations within the entire dataset have received significantly less attention. This motivates us to seek a principled approach for strategically making use of the dataset itself to discover similar yet distinct samples from different anatomical views. In this paper, we introduce a novel semi-supervised 2D medical image segmentation framework termed Mine yOur owN Anatomy (MONA), and make three contributions. First, prior work argues that every pixel equally matters to the model training; we observe empirically that this alone is unlikely to define meaningful anatomical features, mainly due to lacking the supervision signal. We show two simple solutions towards learning invariances - through the use of stronger data augmentations and nearest neighbors. Second, we construct a set of objectives that encourage the model to be capable of decomposing medical images into a collection of anatomical features in an unsupervised manner. Lastly, we both empirically and theoretically, demonstrate the efficacy of our MONA on three benchmark datasets with different labeled settings, achieving new state-of-the-art under different labeled semi-supervised settings., Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE-TPAMI)
Published: 2022

25. Competence-based Multimodal Curriculum Learning for Medical Report Generation

Author: Liu, Fenglin, Ge, Shen, Zou, Yuexian, and Wu, Xian
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias and 2) the limited medical data. To alleviate the data bias and make best use of available data, we propose a Competence-based Multimodal Curriculum Learning framework (CMCL). Specifically, CMCL simulates the learning process of radiologists and optimizes the model in a step by step manner. Firstly, CMCL estimates the difficulty of each training instance and evaluates the competence of current model; Secondly, CMCL selects the most suitable batch of training instances considering current model competence. By iterating above two steps, CMCL can gradually improve the model's performance. The experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance., Comment: Accepted by ACL 2021 (Oral)
Published: 2022

26. Graph-in-Graph Network for Automatic Gene Ontology Description Generation

Author: Liu, Fenglin, Yang, Bang, You, Chenyu, Wu, Xian, Ge, Shen, Woicik, Adelaide, and Wang, Sheng
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Gene Ontology (GO) is the primary gene function knowledge base that enables computational tasks in biomedicine. The basic element of GO is a term, which includes a set of genes with the same function. Existing research efforts of GO mainly focus on predicting gene term associations. Other tasks, such as generating descriptions of new terms, are rarely pursued. In this paper, we propose a novel task: GO term description generation. This task aims to automatically generate a sentence that describes the function of a GO term belonging to one of the three categories, i.e., molecular function, biological process, and cellular component. To address this task, we propose a Graph-in-Graph network that can efficiently leverage the structural information of GO. The proposed network introduces a two-layer graph: the first layer is a graph of GO terms where each node is also a graph (gene graph). Such a Graph-in-Graph network can derive the biological functions of GO terms and generate proper descriptions. To validate the effectiveness of the proposed network, we build three large-scale benchmark datasets. By incorporating the proposed Graph-in-Graph network, the performances of seven different sequence-to-sequence models can be substantially boosted across all evaluation metrics, with up to 34.7%, 14.5%, and 39.1% relative improvements in BLEU, ROUGE-L, and METEOR, respectively., Comment: Accepted by KDD 2022 (Research Track)
Published: 2022

27. End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Author: You, Chenyu, Chen, Nuo, Liu, Fenglin, Ge, Shen, Wu, Xian, and Zou, Yuexian
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks., Comment: In Findings of NAACL 2022. arXiv admin note: substantial text overlap with arXiv:2010.08923
Published: 2022

28. AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation

Author: You, Di, Liu, Fenglin, Ge, Shen, Xie, Xiaoxia, Zhang, Jing, and Wu, Xian
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, medical report generation, which aims to automatically generate a long and coherent descriptive paragraph of a given medical image, has received growing research interests. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias: the normal visual regions dominate the dataset over the abnormal visual regions, and 2) the very long sequence. To alleviate above two problems, we propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules: 1) AHA module first predicts the disease tags from the input image and then learns the multi-grained visual features by hierarchically aligning the visual regions and disease tags. The acquired disease-grounded visual features can better represent the abnormal regions of the input image, which could alleviate data bias problem; 2) MGT module effectively uses the multi-grained features and Transformer framework to generate the long medical report. The experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets. Moreover, the human evaluation conducted by professional radiologists further proves the effectiveness of our approach., Comment: Accepted by MICCAI 2021 (the 24th International Conference on Medical Image Computing and Computer Assisted Intervention)
Published: 2022

29. Class-Aware Adversarial Transformers for Medical Image Segmentation

Author: You, Chenyu, Zhao, Ruihan, Liu, Fenglin, Dong, Siyuan, Chinchali, Sandeep, Topcu, Ufuk, Staib, Lawrence, and Duncan, James S.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.
Published: 2022

30. Pig-DTpV: A prior information guided directional TpV algorithm for orthogonal translation computed laminography

Author: Xi, Yarui, Qiao, Zhiwei, Wang, Ao, Fang, Chenyun, and Liu, Fenglin
Published: 2024
Full Text: View/download PDF

31. Impacts of information quantity and display formats on driving behaviors in a connected vehicle environment

Author: Zhao, Wenjing, Gong, Siyuan, Zhao, Dezong, Liu, Fenglin, Sze, N.N., Quddus, Mohammed, Huang, Helai, and Zhao, Xiangmo
Published: 2024
Full Text: View/download PDF

32. A physical perspective to understand myelin. II. The physical origin of myelin development

Author: Liu, Yonghong, Zhang, Yapeng, Yue, Wenji, Zhu, Ran, Guo, Tianruo, Liu, Fenglin, Huang, Yubin, Wu, Tianzhun, and Wang, Hao
Subjects: Physics - Biological Physics
Abstract: The physical principle of myelin development is obtained from our previous study by explaining Peter's quadrant mystery: an external applied negative and positive E-field can promote and inhibit the growth of the inner tongue of the myelin sheath, respectively. In this study, this principle is considered as a fundamental hypothesis, named Hypothesis-E, to explain more phenomena about myelin development systematically. Specifically, the g-ratio and the fate of the Schwann cell's differentiation are explained in terms of E-field. Moreover, an experiment is proposed to validate this theory.
Published: 2021

33. A physical perspective to understand myelin. I. Peters quadrant mystery

Author: Liu, Yonghong, Zhang, Yapeng, Yue, Wenji, Zhu, Ran, Guo, Tianruo, Liu, Fenglin, Huang, Yubin, Wu, Tianzhun, and Wang, Hao
Subjects: Physics - Biological Physics
Abstract: In the development of oligodendrocytes in the central nervous systems, the inner and outer tongue of the myelin sheath tend to be located within the same quadrant, which was named as Peters quadrant mystery. In this study, we conduct in silico investigations to explore the possible mechanisms underlying the Peters quadrant mystery. A biophysically detailed model of oligodendrocytes was used to simulate the effect of the actional potential-induced electric field across the myelin sheath. Our simulation suggests that the paranodal channel connecting the inner and outer tongue forms a low impedance route, inducing two high-current zones at the area around the inner and outer tongue. When the inner tongue and outer tongue are located within the same quadrant, the interaction of these two high-current-zones will induce a maximum amplitude and a polarity reverse of the voltage upon the inner tongue, resulting in the same quadrant phenomenon. This model indicates that the growth of myelin follows a simple principle: an external negative or positive E-field can promote or inhibit the growth of the inner tongue, respectively.
Published: 2021

34. Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation

Author: Liu, Fenglin, You, Chenyu, Wu, Xian, Ge, Shen, Wang, Sheng, and Sun, Xu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Medical report generation, which aims to automatically generate a long and coherent report of a given medical image, has been receiving growing research interests. Existing approaches mainly adopt a supervised manner and heavily rely on coupled image-report pairs. However, in the medical domain, building a large-scale image-report paired dataset is both time-consuming and expensive. To relax the dependency on paired data, we propose an unsupervised model Knowledge Graph Auto-Encoder (KGAE) which accepts independent sets of images and reports in training. KGAE consists of a pre-constructed knowledge graph, a knowledge-driven encoder and a knowledge-driven decoder. The knowledge graph works as the shared latent space to bridge the visual and textual domains; The knowledge-driven encoder projects medical images and reports to the corresponding coordinates in this latent space and the knowledge-driven decoder generates a medical report given a coordinate in this space. Since the knowledge-driven encoder and decoder can be trained with independent sets of images and reports, KGAE is unsupervised. The experiments show that the unsupervised KGAE generates desirable medical reports without using any image-report training pairs. Moreover, KGAE can also work in both semi-supervised and supervised settings, and accept paired images and reports in training. By further fine-tuning with image-report pairs, KGAE consistently outperforms the current state-of-the-art models on two datasets.
Published: 2021

35. O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Author: Liu, Fenglin, Ren, Xuancheng, Wu, Xian, Yang, Bang, Ge, Shen, Zou, Yuexian, and Sun, Xu
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Video captioning combines video understanding and language generation. Different from image captioning that describes a static image with details of almost every object, video captioning usually considers a sequence of frames and biases towards focused objects, e.g., the objects that stay in focus regardless of the changing background. Therefore, detecting and properly accommodating focused objects is critical in video captioning. To enforce the description of focused objects and achieve controllable video captioning, we propose an Object-Oriented Non-Autoregressive approach (O2NA), which performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Since the focused objects are generated and located ahead of other words, it is difficult to apply the word-by-word autoregressive generation process; instead, we adopt a non-autoregressive approach. The experiments on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate the effectiveness of O2NA, which achieves results competitive with the state-of-the-arts but with both higher diversity and higher inference speed., Comment: Accepted by Findings of ACL 2021 (The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing)
Published: 2021

36. Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model

Author: Huang, Zhiqi, Liu, Fenglin, Wu, Xian, Ge, Shen, Wang, Helin, Fan, Wei, and Zou, Yuexian
Subjects: Computer Science - Computation and Language
Abstract: While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Machine Reading Comprehension task which mines textual inputs (paragraphs and questions) to predict the answers (choices or text spans). However, there are a lot of MC tasks that accept audio input in addition to the textual input, e.g. English listening comprehension test. In this paper, we target the problem of Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer questions based on the given audio and textual information. To solve this problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to effectively fuse the two modalities (audio and textual). DIIA can work as an independent component and thus be easily integrated into existing MC models. Moreover, we further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio. As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models. Experimental results and analysis prove the effectiveness of the proposed approaches. First, the proposed DIIA boosts the baseline models by up to 21.08% in terms of accuracy; Second, under the unimodal scenarios, the MKD module allows our multimodal MC model to significantly outperform the unimodal models by up to 18.87%, which are trained and tested with only audio or textual data., Comment: AAAI 2021
Published: 2021

37. Enhancing medical image object detection with collaborative multi-agent deep Q-networks and multi-scale representation

Author: Wang, Qinghui, Liu, Fenglin, Zou, Ruirui, Wang, Ying, Zheng, Chenyang, Tian, Zhiqiang, Du, Shaoyi, and Zeng, Wei
Published: 2023
Full Text: View/download PDF

38. A medical multimodal large language model for future pandemics

Author: Liu, Fenglin, Zhu, Tingting, Wu, Xian, Yang, Bang, You, Chenyu, Wang, Chenyang, Lu, Lei, Liu, Zhangdaihong, Zheng, Yefeng, Sun, Xu, Yang, Yang, Clifton, Lei, and Clifton, David A.
Published: 2023
Full Text: View/download PDF

39. Does resection after neoadjuvant chemotherapy of docetaxel, oxaliplatin, and S-1 (DOS regimen) benefit for gastric cancer patients with single non-curable factor? a multicenter, prospective cohort study (Neo-REGATTA)

Author: Cui, Yuehong, Yu, Yiyi, Zheng, Song, Ying, Jie’er, Du, Yi’an, Wang, Yan, Wang, Xuefei, Shen, Zhenbin, Liu, Fenglin, Lv, Minzhi, Sun, Yihong, and Liu, Tianshu
Published: 2023
Full Text: View/download PDF

40. Exploring Semantic Relationships for Unpaired Image Captioning

Author: Liu, Fenglin, Gao, Meng, Zhang, Tianhao, and Zou, Yuexian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, image captioning has aroused great interest in both academic and industrial worlds. Most existing systems are built upon large-scale datasets consisting of image-sentence pairs, which, however, are time-consuming to construct. In addition, even for the most advanced image captioning systems, it is still difficult to realize deep image understanding. In this work, we achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information. The motivation stems from the fact that the semantic concepts with the same modality can be extracted from both images and descriptions. To further improve the quality of captions generated by the model, we propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image. Extensive experiments on MSCOCO dataset show that we can generate desirable captions without paired datasets. Furthermore, the proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%, demonstrating that it is effective and generalizes well to a wide range of models.
Published: 2021

41. Contrastive Attention for Automatic Chest X-ray Report Generation

Author: Liu, Fenglin, Yin, Changchang, Wu, Xian, Ge, Shen, Zou, Yuexian, Zhang, Ping, and Sun, Xu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recently, chest X-ray report generation, which aims to automatically generate descriptions of given chest X-ray images, has received growing research interests. The key challenge of chest X-ray report generation is to accurately capture and describe the abnormal regions. In most cases, the normal regions dominate the entire chest X-ray image, and the corresponding descriptions of these normal regions dominate the final report. Due to such data bias, learning-based models may fail to attend to abnormal regions. In this work, to effectively capture and describe abnormal regions, we propose the Contrastive Attention (CA) model. Instead of solely focusing on the current input image, the CA model compares the current input image with normal images to distill the contrastive information. The acquired contrastive information can better represent the visual features of abnormal regions. According to the experiments on the public IU-X-ray and MIMIC-CXR datasets, incorporating our CA into several existing models can boost their performance across most metrics. In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis. Specifically, we achieve the state-of-the-art results on the two public datasets., Comment: Appear in Findings of ACL 2021 (The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021))
Published: 2021

42. Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation

Author: Liu, Fenglin, Wu, Xian, Ge, Shen, Fan, Wei, and Zou, Yuexian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Automatically generating radiology reports can improve current clinical practice in diagnostic radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; On the other hand, it can remind radiologists of abnormalities and avoid the misdiagnosis and missed diagnosis. Yet, this task remains a challenging job for data-driven neural networks, due to the serious visual and textual data biases. To this end, we propose a Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) to imitate the working patterns of radiologists, who will first examine the abnormal regions and assign the disease topic tags to the abnormal regions, and then rely on the years of prior medical knowledge and prior working experience accumulations to write reports. Thus, the PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD). In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias. The explored knowledge is distilled by the MKD to generate the final reports. Evaluated on MIMIC-CXR and IU-Xray datasets, our method is able to outperform previous state-of-the-art models on these two datasets., Comment: Accepted by CVPR 2021 (2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2021))
Published: 2021

43. Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Author: Liu, Fenglin, Ren, Xuancheng, Zhang, Zhiyuan, Sun, Xu, and Zou, Yuexian
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Skip connection, is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets., Comment: Accepted by COLING2020 (The 28th International Conference on Computational Linguistics (COLING 2020))
Published: 2021

44. Is Pathologic Complete Response a Good Predictor for the Long-Term, Clinical Outcome in Patients with Gastric Cancer After Neoadjuvant Chemotherapy? A Retrospective, Multi-institution Study in China

Author: Lin, Chao, Ma, Junjun, Zhu, Chunchao, Zhao, Xuan, Chen, Yueda, Zang, Lu, and Liu, Fenglin
Published: 2023
Full Text: View/download PDF

45. A novel technique for the detection of myocardial dysfunction using ECG signals based on CEEMD, DWT, PSR and neural networks

Author: Zeng, Wei, Yuan, Jian, Yuan, Chengzhi, Wang, Qinghui, Liu, Fenglin, and Wang, Ying
Published: 2023
Full Text: View/download PDF

46. Developing a new integrated advanced driver assistance system in a connected vehicle environment

Author: Zhao, Wenjing, Gong, Siyuan, Zhao, Dezong, Liu, Fenglin, Sze, N.N., Quddus, Mohammed, and Huang, Helai
Published: 2024
Full Text: View/download PDF

47. Adaptive Bi-directional Attention: Exploring Multi-Granularity Representations for Machine Reading Comprehension

Author: Chen, Nuo, Liu, Fenglin, You, Chenyu, Zhou, Peilin, and Zou, Yuexian
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recently, the attention-enhanced multi-layer encoder, such as Transformer, has been extensively studied in Machine Reading Comprehension (MRC). To predict the answer, it is common practice to employ a predictor to draw information only from the final encoder layer which generates the \textit{coarse-grained} representations of the source sequences, i.e., passage and question. Previous studies have shown that the representation of source sequence becomes more \textit{coarse-grained} from \textit{fine-grained} as the encoding layer increases. It is generally believed that with the growing number of layers in deep neural networks, the encoding process will gather relevant information for each location increasingly, resulting in more \textit{coarse-grained} representations, which adds the likelihood of similarity to other locations (referring to homogeneity). Such a phenomenon will mislead the model to make wrong judgments so as to degrade the performance. To this end, we propose a novel approach called Adaptive Bidirectional Attention, which adaptively exploits the source representations of different levels to the predictor. Experimental results on the benchmark dataset, SQuAD 2.0 demonstrate the effectiveness of our approach, and the results are better than the previous state-of-the-art model by 2.5$\%$ EM and 2.3$\%$ F1 scores., Comment: five paes, four figures
Published: 2020

48. Towards Data Distillation for End-to-end Spoken Conversational Question Answering

Author: You, Chenyu, Chen, Nuo, Liu, Fenglin, Yang, Dongchao, and Zou, Yuexian
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: In spoken question answering, QA systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora. In this task, our main objective is to build a QA system to deal with conversational questions both in spoken and text forms, and to explore the plausibility of providing more cues in spoken documents with systems in information gathering. To this end, instead of adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which directly fuse audio-text features to reduce the misalignment between automatic speech recognition hypotheses and the reference transcriptions. In addition, to evaluate the capacity of QA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 120k question-answer pairs. Experiments demonstrate that our proposed method achieves superior performance in spoken conversational question answering.
Published: 2020

49. PIN: A Novel Parallel Interactive Network for Spoken Language Understanding

Author: Zhou, Peilin, Huang, Zhiqi, Liu, Fenglin, and Zou, Yuexian
Subjects: Computer Science - Computation and Language
Abstract: Spoken Language Understanding (SLU) is an essential part of the spoken dialogue system, which typically consists of intent detection (ID) and slot filling (SF) tasks. Recently, recurrent neural networks (RNNs) based methods achieved the state-of-the-art for SLU. It is noted that, in the existing RNN-based approaches, ID and SF tasks are often jointly modeled to utilize the correlation information between them. However, we noted that, so far, the efforts to obtain better performance by supporting bidirectional and explicit information exchange between ID and SF are not well studied.In addition, few studies attempt to capture the local context information to enhance the performance of SF. Motivated by these findings, in this paper, Parallel Interactive Network (PIN) is proposed to model the mutual guidance between ID and SF. Specifically, given an utterance, a Gaussian self-attentive encoder is introduced to generate the context-aware feature embedding of the utterance which is able to capture local context information. Taking the feature embedding of the utterance, Slot2Intent module and Intent2Slot module are developed to capture the bidirectional information flow for ID and SF tasks. Finally, a cooperation mechanism is constructed to fuse the information obtained from Slot2Intent and Intent2Slot modules to further reduce the prediction bias.The experiments on two benchmark datasets, i.e., SNIPS and ATIS, demonstrate the effectiveness of our approach, which achieves a competitive result with state-of-the-art models. More encouragingly, by using the feature embedding of the utterance generated by the pre-trained language model BERT, our method achieves the state-of-the-art among all comparison approaches.
Published: 2020

50. Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding

Author: Liu, Fenglin, Ren, Xuancheng, Zhao, Guangxiang, You, Chenyu, Ma, Xuewei, Wu, Xian, and Sun, Xu
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: In sequence-to-sequence learning, e.g., natural language generation, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments and analyses show that we successfully address the hierarchy bypassing problem, require almost negligible parameter increase, and substantially improve the performance of sequence-to-sequence learning with deep representations on five diverse tasks, i.e., machine translation, abstractive summarization, image captioning, video captioning, medical report generation, and paraphrase generation. In particular, our approach achieves new state-of-the-art results on ten benchmark datasets, including a low-resource machine translation dataset and two low-resource medical report generation datasets.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

901 results on '"Liu Fenglin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources