Author: "Han, Song" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Han, Song"' showing total 7,874 results

Start Over Author "Han, Song"

7,874 results on '"Han, Song"'

1. High Fidelity High Resolution Processing and High Precision Prediction Techniques for Thin Deep Carbonate Reservoir: A Case Study of Maokou Formation Reservoir in The North Slope of Central Sichuan Basin

Author: He, Qing-lin, primary, Zeng, Hua-hui, additional, Liu, Xiao-bing, additional, Wang, Hai-long, additional, Han, Song, additional, Long, Long, additional, and Liu, Jie, additional
Published: 2024
Full Text: View/download PDF

2. Comparative Study on Membrane Concentration of Desulfurization Wastewater by Traditional Pretreatment Process and Ceramic Membrane Process

Author: Cao, Hongmei, primary, Pang, Li, additional, Han, Song, additional, Guo, Bingchuan, additional, Han, Bin, additional, Li, Haidong, additional, and Wu, Zhongjie, additional
Published: 2023
Full Text: View/download PDF

3. X-VILA: Cross-Modality Alignment for Large Language Model

Author: Ye, Hanrong, Huang, De-An, Lu, Yao, Yu, Zhiding, Ping, Wei, Tao, Andrew, Kautz, Jan, Han, Song, Xu, Dan, Molchanov, Pavlo, and Yin, Hongxu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source., Comment: Technical Report
Published: 2024

4. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Author: Lin, Yujun, Tang, Haotian, Yang, Shang, Zhang, Zhekai, Xiao, Guangxuan, Gan, Chuang, and Han, Song
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve., Comment: The first three authors contribute equally to this project and are listed in the alphabetical order. Yujun Lin leads the quantization algorithm, Haotian Tang and Shang Yang lead the GPU kernels and the serving system. Code is available at https://github.com/mit-han-lab/qserve
Published: 2024

5. A Survey on Industrial Internet of Things (IIoT) Testbeds for Connectivity Research

Author: Zhang, Tianyu, Xue, Chuanyu, Wang, Jiachen, Yun, Zelin, Lin, Natong, and Han, Song
Subjects: Computer Science - Networking and Internet Architecture
Abstract: Industrial Internet of Things (IIoT) technologies have revolutionized industrial processes, enabling smart automation, real-time data analytics, and improved operational efficiency across diverse industry sectors. IIoT testbeds play a critical role in advancing IIoT research and development (R&D) to provide controlled environments for technology evaluation before their real-world deployment. In this article, we conduct a comprehensive literature review on existing IIoT testbeds, aiming to identify benchmark performance, research gaps and explore emerging trends in IIoT systems. We first review the state-of-the-art resource management solutions proposed for IIoT applications. We then categorize the reviewed testbeds according to their deployed communication protocols (including TSN, IEEE 802.15.4, IEEE 802.11 and 5G) and discuss the design and usage of each testbed. Driven by the knowledge gained during this study, we present suggestions and good practices for researchers and practitioners who are planning to design and develop IIoT testbeds for connectivity research.
Published: 2024

6. Condition-Aware Neural Network for Controlled Image Generation

Author: Cai, Han, Li, Muyang, Zhang, Zhuoyang, Zhang, Qinsheng, Liu, Ming-Yu, and Han, Song
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We present Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. In parallel to prior conditional control methods, CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT. In particular, CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step., Comment: CVPR 2024
Published: 2024

7. Tiny Machine Learning: Progress and Futures

Author: Lin, Ji, Zhu, Ligeng, Chen, Wei-Ming, Wang, Wei-Chen, and Han, Song
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Tiny Machine Learning (TinyML) is a new frontier of machine learning. By squeezing deep learning models into billions of IoT devices and microcontrollers (MCUs), we expand the scope of AI applications and enable ubiquitous intelligence. However, TinyML is challenging due to hardware constraints: the tiny memory resource makes it difficult to hold deep learning models designed for cloud and mobile platforms. There is also limited compiler and inference engine support for bare-metal devices. Therefore, we need to co-design the algorithm and system stack to enable TinyML. In this review, we will first discuss the definition, challenges, and applications of TinyML. We then survey the recent progress in TinyML and deep learning on MCUs. Next, we will introduce MCUNet, showing how we can achieve ImageNet-scale AI applications on IoT devices with system-algorithm co-design. We will further extend the solution from inference to training and introduce tiny on-device training techniques. Finally, we present future directions in this area. Today's large model might be tomorrow's tiny model. The scope of TinyML should evolve and adapt over time., Comment: arXiv admin note: text overlap with arXiv:2206.15472
Published: 2024
Full Text: View/download PDF

8. DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Author: Li, Muyang, Cai, Tianle, Cao, Jiaxin, Zhang, Qinsheng, Cai, Han, Bai, Junjie, Jia, Yangqing, Liu, Ming-Yu, Li, Kai, and Han, Song
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, naively implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser., Comment: CVPR 2024 Highlight Code: https://github.com/mit-han-lab/distrifuser Website: https://hanlab.mit.edu/projects/distrifusion Blog: https://hanlab.mit.edu/blog/distrifusion
Published: 2024

9. BitDelta: Your Fine-Tune May Only Be Worth One Bit

Author: Liu, James, Xiao, Guangxuan, Li, Kai, Lee, Jason D., Han, Song, Dao, Tri, and Cai, Tianle
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
Published: 2024

10. EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

Author: Zhang, Zhuoyang, Cai, Han, and Han, Song
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit., Comment: CVPR 2024 Workshop (Efficient Large Vision Models)
Published: 2024

11. Qplacer: Frequency-Aware Component Placement for Superconducting Quantum Computers

Author: Zhang, Junyao, Wang, Hanrui, Ding, Qi, Gu, Jiaqi, Assouly, Reouven, Oliver, William D., Han, Song, Brown, Kenneth R., Li, Hai "Helen", and Chen, Yiran
Subjects: Quantum Physics, Computer Science - Hardware Architecture, Electrical Engineering and Systems Science - Systems and Control
Abstract: Noisy Intermediate-Scale Quantum (NISQ) computers face a critical limitation in qubit numbers, hindering their progression towards large-scale and fault-tolerant quantum computing. A significant challenge impeding scaling is crosstalk, characterized by unwanted interactions among neighboring components on quantum chips, including qubits, resonators, and substrate. We motivate a general approach to systematically resolving multifaceted crosstalks in a limited substrate area. We propose Qplacer, a frequency-aware electrostatic-based placement framework tailored for superconducting quantum computers, to alleviate crosstalk by isolating these components in spatial and frequency domains alongside compact substrate design. Qplacer commences with a frequency assigner that ensures frequency domain isolation for qubits and resonators. It then incorporates a padding strategy and resonator partitioning for layout flexibility. Central to our approach is the conceptualization of quantum components as charged particles, enabling strategic spatial isolation through a 'frequency repulsive force' concept. Our results demonstrate that Qplacer carefully crafts the physical component layout in mitigating various crosstalk impacts while maintaining a compact substrate size. On various device topologies and NISQ benchmarks, Qplacer improves fidelity by an average of 36.7x and reduces spatial violations (susceptible to crosstalk) by an average of 12.76x, compared to classical placement engines. Regarding area optimization, compared to manual designs, Qplacer can reduce the required layout area by 2.14x on average
Published: 2024

12. QuantumSEA: In-Time Sparse Exploration for Noise Adaptive Quantum Circuits

Author: Chen, Tianlong, Zhang, Zhenyu, Wang, Hanrui, Gu, Jiaqi, Li, Zirui, Pan, David Z., Chong, Frederic T., Han, Song, and Wang, Zhangyang
Subjects: Quantum Physics, Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Parameterized Quantum Circuits (PQC) have obtained increasing popularity thanks to their great potential for near-term Noisy Intermediate-Scale Quantum (NISQ) computers. Achieving quantum advantages usually requires a large number of qubits and quantum circuits with enough capacity. However, limited coherence time and massive quantum noises severely constrain the size of quantum circuits that can be executed reliably on real machines. To address these two pain points, we propose QuantumSEA, an in-time sparse exploration for noise-adaptive quantum circuits, aiming to achieve two key objectives: (1) implicit circuits capacity during training - by dynamically exploring the circuit's sparse connectivity and sticking a fixed small number of quantum gates throughout the training which satisfies the coherence time and enjoy light noises, enabling feasible executions on real quantum devices; (2) noise robustness - by jointly optimizing the topology and parameters of quantum circuits under real device noise models. In each update step of sparsity, we leverage the moving average of historical gradients to grow necessary gates and utilize salience-based pruning to eliminate insignificant gates. Extensive experiments are conducted with 7 Quantum Machine Learning (QML) and Variational Quantum Eigensolver (VQE) benchmarks on 6 simulated or real quantum computers, where QuantumSEA consistently surpasses noise-aware search, human-designed, and randomly generated quantum circuit baselines by a clear performance margin. For example, even in the most challenging on-chip training regime, our method establishes state-of-the-art results with only half the number of quantum gates and ~2x time saving of circuit executions. Codes are available at https://github.com/VITA-Group/QuantumSEA., Comment: IEEE International Conference on Quantum Computing and Engineering (QCE 2023)
Published: 2024

13. VILA: On Pre-training for Visual Language Models

Author: Lin, Ji, Yin, Hongxu, Ping, Wei, Lu, Yao, Molchanov, Pavlo, Tao, Andrew, Mao, Huizi, Kautz, Jan, Shoeybi, Mohammad, and Han, Song
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge., Comment: CVPR 2024
Published: 2023

14. DGR: Tackling Drifted and Correlated Noise in Quantum Error Correction via Decoding Graph Re-weighting

Author: Wang, Hanrui, Liu, Pengyu, Liu, Yilian, Gu, Jiaqi, Baker, Jonathan, Chong, Frederic T., and Han, Song
Subjects: Quantum Physics, Computer Science - Hardware Architecture, Computer Science - Emerging Technologies, Computer Science - Machine Learning
Abstract: Quantum hardware suffers from high error rates and noise, which makes directly running applications on them ineffective. Quantum Error Correction (QEC) is a critical technique towards fault tolerance which encodes the quantum information distributively in multiple data qubits and uses syndrome qubits to check parity. Minimum-Weight-Perfect-Matching (MWPM) is a popular QEC decoder that takes the syndromes as input and finds the matchings between syndromes that infer the errors. However, there are two paramount challenges for MWPM decoders. First, as noise in real quantum systems can drift over time, there is a potential misalignment with the decoding graph's initial weights, leading to a severe performance degradation in the logical error rates. Second, while the MWPM decoder addresses independent errors, it falls short when encountering correlated errors typical on real hardware, such as those in the 2Q depolarizing channel. We propose DGR, an efficient decoding graph edge re-weighting strategy with no quantum overhead. It leverages the insight that the statistics of matchings across decoding iterations offer rich information about errors on real quantum hardware. By counting the occurrences of edges and edge pairs in decoded matchings, we can statistically estimate the up-to-date probabilities of each edge and the correlations between them. The reweighting process includes two vital steps: alignment re-weighting and correlation re-weighting. The former updates the MWPM weights based on statistics to align with actual noise, and the latter adjusts the weight considering edge correlations. Extensive evaluations on surface code and honeycomb code under various settings show that DGR reduces the logical error rate by 3.6x on average-case noise mismatch with exceeding 5000x improvement under worst-case mismatch., Comment: 13 pages, 19 figures
Published: 2023

15. Q-Pilot: Field Programmable Qubit Array Compilation with Flying Ancillas

Author: Wang, Hanrui, Tan, Daniel Bochen, Liu, Pengyu, Liu, Yilian, Gu, Jiaqi, Cong, Jason, and Han, Song
Subjects: Quantum Physics, Computer Science - Hardware Architecture, Computer Science - Emerging Technologies
Abstract: Neutral atom arrays have become a promising platform for quantum computing, especially the field programmable qubit array (FPQA) endowed with the unique capability of atom movement. This feature allows dynamic alterations in qubit connectivity during runtime, which can reduce the cost of executing long-range gates and improve parallelism. However, this added flexibility introduces new challenges in circuit compilation. Inspired by the placement and routing strategies for FPGAs, we propose to map all data qubits to fixed atoms while utilizing movable atoms to route for 2-qubit gates between data qubits. Coined flying ancillas, these mobile atoms function as ancilla qubits, dynamically generated and recycled during execution. We present Q-Pilot, a scalable compiler for FPQA employing flying ancillas to maximize circuit parallelism. For two important quantum applications, quantum simulation and the Quantum Approximate Optimization Algorithm (QAOA), we devise domain-specific routing strategies. In comparison to alternative technologies such as superconducting devices or fixed atom arrays, Q-Pilot effectively harnesses the flexibility of FPQA, achieving reductions of 1.4x, 27.7x, and 6.3x in circuit depth for 100-qubit random, quantum simulation, and QAOA circuits, respectively., Comment: 10 pages, 16 figures; Published as a conference paper at DAC 2024
Published: 2023

16. Atomique: A Quantum Compiler for Reconfigurable Neutral Atom Arrays

Author: Wang, Hanrui, Liu, Pengyu, Tan, Daniel Bochen, Liu, Yilian, Gu, Jiaqi, Pan, David Z., Cong, Jason, Acar, Umut A., and Han, Song
Subjects: Quantum Physics, Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The neutral atom array has gained prominence in quantum computing for its scalability and operation fidelity. Previous works focus on fixed atom arrays (FAAs) that require extensive SWAP operations for long-range interactions. This work explores a novel architecture reconfigurable atom arrays (RAAs), also known as field programmable qubit arrays (FPQAs), which allows for coherent atom movements during circuit execution under some constraints. Such atom movements, which are unique to this architecture, could reduce the cost of long-range interactions significantly if the atom movements could be scheduled strategically. In this work, we introduce Atomique, a compilation framework designed for qubit mapping, atom movement, and gate scheduling for RAA. Atomique contains a qubit-array mapper to decide the coarse-grained mapping of the qubits to arrays, leveraging MAX k-Cut on a constructed gate frequency graph to minimize SWAP overhead. Subsequently, a qubit-atom mapper determines the fine-grained mapping of qubits to specific atoms in the array and considers load balance to prevent hardware constraint violations. We further propose a router that identifies parallel gates, schedules them simultaneously, and reduces depth. We evaluate Atomique across 20+ diverse benchmarks, including generic circuits (arbitrary, QASMBench, SupermarQ), quantum simulation, and QAOA circuits. Atomique consistently outperforms IBM Superconducting, FAA with long-range gates, and FAA with rectangular and triangular topologies, achieving significant reductions in depth and the number of two-qubit gates., Comment: 17 pages, 26 figures; Published as a conference paper at ISCA 2024
Published: 2023

17. Machine learning's own Industrial Revolution

Author: Luo, Yuan, Han, Song, and Liu, Jingjing
Subjects: Computer Science - Machine Learning
Abstract: Machine learning is expected to enable the next Industrial Revolution. However, lacking standardized and automated assembly networks, ML faces significant challenges to meet ever-growing enterprise demands and empower broad industries. In the Perspective, we argue that ML needs to first complete its own Industrial Revolution, elaborate on how to best achieve its goals, and discuss new opportunities to enable rapid translation from ML's innovation frontier to mass production and utilization.
Published: 2023

18. PockEngine: Sparse and Efficient Fine-tuning in a Pocket

Author: Zhu, Ligeng, Hu, Lanxiang, Lin, Ji, Wang, Wei-Chen, Chen, Wei-Ming, Gan, Chuang, and Han, Song
Subjects: Computer Science - Machine Learning
Abstract: On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 $\times$ speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 $\times$ memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$\times$ faster than the PyTorch.
Published: 2023
Full Text: View/download PDF

19. TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

Author: Tang, Haotian, Yang, Shang, Liu, Zhijian, Hong, Ke, Yu, Zhongming, Li, Xiuyu, Dai, Guohao, Wang, Yu, and Han, Song
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g.implicit GEMM) are highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9x, 3.3x, 2.2x and 1.7x measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3x faster than SpConv v2 in mixed precision training across seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6x faster inference speed compared with state-of-the-art graph deep learning libraries., Comment: MICRO 2023; Haotian Tang and Shang Yang contributed equally to this project
Published: 2023

20. A comprehensive evaluation of water saving in Weifang, Shandong Province, China

Author: Wang, Haijun, primary, Han, Song, additional, Lei, Linxi, additional, Cai, Zhenhua, additional, Cong, Xin, additional, and Liu, Yuyu, additional
Published: 2023
Full Text: View/download PDF

21. A Unified Startup Control Strategy for Modular Multilevel Converter with the Supercapacitor Energy Storage System

Author: Han, Song, primary, Deng, Tianbai, additional, Yuan, Tao, additional, Zhu, Qianlong, additional, Tao, Jun, additional, Zhang, Huaying, additional, and Wang, Qing, additional
Published: 2023
Full Text: View/download PDF

22. Preparation of foam material via co-sintering of NaCl and PTFE for oil/water separation

Author: Han, Song, Wang, Yanqing, Xu, Yanru, and Wu, Jinlong
Published: 2024
Full Text: View/download PDF

23. Electrospinning and electrospraying of polybutylene succinate/esterified cellulose nanofibril composites

Author: Kim, Jeong-Ki, Bandi, Rajkumar, Dadigala, Ramakrishna, Han, Song-Yi, Van Hai, Le, Cho, Seung-Woo, Ma, Seo-Young, Lee, Da-Young, Kwon, Gu-Joong, and Lee, Seung-Hwan
Published: 2024
Full Text: View/download PDF

24. Reduction-to-the-pole method for aeromagnetic three-component data in different latitudes

Author: Guo, Hua, Wang, Ming, Han, Song, Chang, Chang, and Yao, Yuyang
Published: 2024
Full Text: View/download PDF

25. Toward Carbon Neutral Road Transport: Development Strategies and New R&D Organizational Paradigms

Author: Hao, Xu, Wang, Hewu, Zheng, Yali, Lin, Yan, Han, Song, Zhong, Ruiheng, and Li, Jialin
Published: 2024
Full Text: View/download PDF

26. Figure S3 from Fibroblasts in the Aged Pancreas Drive Pancreatic Cancer Progression

Author: Zabransky, Daniel J., primary, Chhabra, Yash, primary, Fane, Mitchell E., primary, Kartalia, Emma, primary, Leatherman, James M., primary, Hüser, Laura, primary, Zimmerman, Jacquelyn W., primary, Delitto, Daniel, primary, Han, Song, primary, Armstrong, Todd D., primary, Charmsaz, Soren, primary, Guinn, Samantha, primary, Pramod, Sneha, primary, Thompson, Elizabeth D., primary, Hughes, Steven J., primary, O'Connell, Jennifer, primary, Egan, Josephine M., primary, Jaffee, Elizabeth M., primary, and Weeraratna, Ashani T., primary
Published: 2024
Full Text: View/download PDF

27. Supplementary Table S4 from Fibroblasts in the Aged Pancreas Drive Pancreatic Cancer Progression

Author: Zabransky, Daniel J., primary, Chhabra, Yash, primary, Fane, Mitchell E., primary, Kartalia, Emma, primary, Leatherman, James M., primary, Hüser, Laura, primary, Zimmerman, Jacquelyn W., primary, Delitto, Daniel, primary, Han, Song, primary, Armstrong, Todd D., primary, Charmsaz, Soren, primary, Guinn, Samantha, primary, Pramod, Sneha, primary, Thompson, Elizabeth D., primary, Hughes, Steven J., primary, O'Connell, Jennifer, primary, Egan, Josephine M., primary, Jaffee, Elizabeth M., primary, and Weeraratna, Ashani T., primary
Published: 2024
Full Text: View/download PDF

28. Table S2 from Fibroblasts in the Aged Pancreas Drive Pancreatic Cancer Progression

Author: Zabransky, Daniel J., primary, Chhabra, Yash, primary, Fane, Mitchell E., primary, Kartalia, Emma, primary, Leatherman, James M., primary, Hüser, Laura, primary, Zimmerman, Jacquelyn W., primary, Delitto, Daniel, primary, Han, Song, primary, Armstrong, Todd D., primary, Charmsaz, Soren, primary, Guinn, Samantha, primary, Pramod, Sneha, primary, Thompson, Elizabeth D., primary, Hughes, Steven J., primary, O'Connell, Jennifer, primary, Egan, Josephine M., primary, Jaffee, Elizabeth M., primary, and Weeraratna, Ashani T., primary
Published: 2024
Full Text: View/download PDF

29. Data from Fibroblasts in the Aged Pancreas Drive Pancreatic Cancer Progression

Author: Zabransky, Daniel J., primary, Chhabra, Yash, primary, Fane, Mitchell E., primary, Kartalia, Emma, primary, Leatherman, James M., primary, Hüser, Laura, primary, Zimmerman, Jacquelyn W., primary, Delitto, Daniel, primary, Han, Song, primary, Armstrong, Todd D., primary, Charmsaz, Soren, primary, Guinn, Samantha, primary, Pramod, Sneha, primary, Thompson, Elizabeth D., primary, Hughes, Steven J., primary, O'Connell, Jennifer, primary, Egan, Josephine M., primary, Jaffee, Elizabeth M., primary, and Weeraratna, Ashani T., primary
Published: 2024
Full Text: View/download PDF

30. The Verticillium dahliae effector VdPHB1 promotes pathogenicity in cotton and interacts with the immune protein GhMC4

Author: Song, Qingwei, primary, Han, Song, additional, Hu, Shi, additional, Xu, Yiyang, additional, and Zuo, Kaijing, additional
Published: 2024
Full Text: View/download PDF

31. Multiple Brillouin Zone Winding of Topological Chiral Edge States for Slow Light Applications

Author: Chen, Fujia, primary, Xue, Haoran, additional, Pan, Yuang, additional, Wang, Maoren, additional, Hu, Yuanhang, additional, Zhang, Li, additional, Chen, Qiaolu, additional, Han, Song, additional, Liu, Gui-geng, additional, Gao, Zhen, additional, Zhou, Peiheng, additional, Yin, Wenyan, additional, Chen, Hongsheng, additional, Zhang, Baile, additional, and Yang, Yihao, additional
Published: 2024
Full Text: View/download PDF

32. Evaluating the Utility of Atypical Central Neurocytoma Classification and Treatment Strategies

Author: Sun, Feixia, primary, Yang, Zuocheng, additional, Kong, Ronghua, additional, and Han, Song, additional
Published: 2024
Full Text: View/download PDF

33. Exploring the efficacy and mechanism of Bailing capsule to improve polycystic ovary syndrome in mice based on intestinal-derived LPS-TLR4 pathway

Author: Guan, Hao-ru, primary, Li, Bo, additional, Zhang, Ze-hua, additional, Wu, Han-song, additional, Wang, Ning, additional, Chen, Xian-fang, additional, Zhou, Cheng-liang, additional, Bian, Xue-ren, additional, Li, Lu, additional, Xu, Wan-feng, additional, He, Xing-lishang, additional, Dong, Ying-jie, additional, Jiang, Ning-hua, additional, Su, Jie, additional, Lv, Gui-yuan, additional, and Chen, Su-hong, additional
Published: 2024
Full Text: View/download PDF

34. Multi-hop relay selection for underwater acoustic sensor networks: A dynamic combinatorial multi-armed bandit learning approach

Author: Dai, Jun, primary, Li, Xinbin, additional, Han, Song, additional, Liu, Zhixin, additional, Zhao, Haihong, additional, and Yan, Lei, additional
Published: 2024
Full Text: View/download PDF

35. Efficient Streaming Language Models with Attention Sinks

Author: Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, and Lewis, Mike
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm., Comment: ICLR 2024
Published: 2023

36. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Author: Chen, Yukang, Qian, Shengju, Tang, Haotian, Lai, Xin, Liu, Zhijian, Han, Song, and Jia, Jiaya
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S^2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset., Comment: Code, models, dataset, and demo are available at https://github.com/dvlab-research/LongLoRA
Published: 2023

37. DISQ: Dynamic Iteration Skipping for Variational Quantum Algorithms

Author: Zhang, Junyao, Wang, Hanrui, Ravi, Gokul Subramanian, Chong, Frederic T., Han, Song, Mueller, Frank, and Chen, Yiran
Subjects: Quantum Physics, Electrical Engineering and Systems Science - Systems and Control
Abstract: This paper proposes DISQ to craft a stable landscape for VQA training and tackle the noise drift challenge. DISQ adopts a "drift detector" with a reference circuit to identify and skip iterations that are severely affected by noise drift errors. Specifically, the circuits from the previous training iteration are re-executed as a reference circuit in the current iteration to estimate noise drift impacts. The iteration is deemed compromised by noise drift errors and thus skipped if noise drift flips the direction of the ideal optimization gradient. To enhance noise drift detection reliability, we further propose to leverage multiple reference circuits from previous iterations to provide a well founded judge of current noise drift. Nevertheless, multiple reference circuits also introduce considerable execution overhead. To mitigate extra overhead, we propose Pauli-term subsetting (prime and minor subsets) to execute only observable circuits with large coefficient magnitudes (prime subset) during drift detection. Only this minor subset is executed when the current iteration is drift-free. Evaluations across various applications and QPUs demonstrate that DISQ can mitigate a significant portion of the noise drift impact on VQAs and achieve 1.51-2.24x fidelity improvement over the traditional baseline. DISQ's benefit is 1.1-1.9x over the best alternative approach while boosting average noise detection speed by 2.07x
Published: 2023

38. Inhibiting the NLRP3 Inflammasome with MCC950 Alleviates Neurological Impairment in the Brain of EAE Mice

Author: Hou, Baohua, Yin, Jun, Liu, Shuyan, Guo, Jincheng, Zhang, Baobao, Zhang, Zhenzhen, Yang, Lanping, Tan, Xiying, Long, Yijiao, Feng, Sijie, Zhou, Jingchun, Wu, Yifan, Wang, Xueyang, Han, Song, Wang, Zhenhui, and He, Xiaohua
Published: 2024
Full Text: View/download PDF

39. Research on the characteristics of total-field data converted from aeromagnetic vertical gradient data based on a continuation conversion filtering algorithm

Author: Guo, Hua, Xu, Xi, Han, Song, Zheng, Qiang, and Liu, Haojun
Published: 2024
Full Text: View/download PDF

40. Upregulation of Spinal MDGA1 in Rats After Nerve Injury Alters Interactions Between Neuroligin-2 and Postsynaptic Scaffolding Proteins and Increases GluR1 Subunit Surface Delivery in the Spinal Cord Dorsal Horn

Author: Li, Hui-Li, Guo, Rui-Juan, Ai, Zhang-Ran, Han, Song, Guan, Yun, Li, Jun-Fa, and Wang, Yun
Published: 2024
Full Text: View/download PDF

41. A rare case report of infratentorial cisternal angiolipoma with review of literature

Author: Li, Shi-Ze, Shen, Fang, Xu, Tao, Yang, Yue, Zhou, Ling-Li, Bai, Guang-Hui, and Sheng, Han-Song
Published: 2024
Full Text: View/download PDF

42. MicroRNA-124 conducts neuroprotective effect via inhibiting AK4/ATF3 after subarachnoid hemorrhage

Author: Jiang, Wei, Jia, Qingge, Ma, Hongxin, Han, Song, Bi, Shijun, Zhu, Kunyuan, Chen, Ligang, and Liang, Guobiao
Published: 2024
Full Text: View/download PDF

43. Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

Author: Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A., and Dally, William J.
Subjects: Computer Science - Hardware Architecture
Abstract: EIE proposed to accelerate pruned and compressed neural networks, exploiting weight sparsity, activation sparsity, and 4-bit weight-sharing in neural network accelerators. Since published in ISCA'16, it opened a new design space to accelerate pruned and sparse neural networks and spawned many algorithm-hardware co-designs for model compression and acceleration, both in academia and commercial AI chips. In retrospect, we review the background of this project, summarize the pros and cons, and discuss new opportunities where pruning, sparsity, and low precision can accelerate emerging deep learning workloads., Comment: Invited retrospective paper at ISCA 2023
Published: 2023

44. Time-Sensitive Networking (TSN) for Industrial Automation: A Survey

Author: Wang, Gang, Zhang, Tianyu, Xue, Chuanyu, Wang, Jiachen, Nixon, Mark, and Han, Song
Subjects: Computer Science - Networking and Internet Architecture
Abstract: With the introduction of Cyber-Physical Systems (CPS) and Internet of Things (IoT) into industrial applications, industrial automation is undergoing tremendous change, especially with regard to improving efficiency and reducing the cost of products. Industrial automation applications are often required to transmit time- and safety-critical data to monitor and control industrial processes, especially for critical control systems. There are a number of solutions to meet these requirements (e.g., priority-based real-time schedules and closed-loop feedback control systems). However, due to their different processing capabilities (e.g., in the end devices and network switches), different vendors may come out with distinct solutions, and this makes the large-scale integration of devices from different vendors difficult or impossible. IEEE 802.1 Time-Sensitive Networking (TSN) is a standardization group formed to enhance and optimize the IEEE 802.1 network standards, especially for Ethernet-based networks. These solutions can be evolved and adapted into a cross-industry scenario, such as a large-scale distributed industrial plant, which requires multiple industrial entities working collaboratively. This paper provides a comprehensive review on the current advances in TSN standards for industrial automation. We present the state-of-the-art IEEE TSN standards and discuss the opportunities and challenges when integrating each protocol into the industry domains. Finally, we discuss some promising research about applying the TSN technology to industrial automation applications.
Published: 2023

45. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Author: Lin, Ji, Tang, Jiaming, Tang, Haotian, Yang, Shang, Chen, Wei-Ming, Wang, Wei-Chen, Xiao, Guangxuan, Dang, Xingyu, Gan, Chuang, and Han, Song
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs., Comment: Code available at: https://github.com/mit-han-lab/llm-awq
Published: 2023

46. Lightening-Transformer: A Dynamically-operated Optically-interconnected Photonic Transformer Accelerator

Author: Zhu, Hanqing, Gu, Jiaqi, Wang, Hanrui, Jiang, Zixuan, Zhang, Zhekai, Tang, Rongxing, Feng, Chenghao, Han, Song, Chen, Ray T., and Pan, David Z.
Subjects: Computer Science - Emerging Technologies, Computer Science - Hardware Architecture, Physics - Optics
Abstract: The wide adoption and significant computing resource of attention-based transformers, e.g., Vision Transformers and large language models (LLM), have driven the demand for efficient hardware accelerators. There is a growing interest in exploring photonics as an alternative technology to digital electronics due to its high energy efficiency and ultra-fast processing speed. Photonic accelerators have shown promising results for CNNs, which mainly rely on weight-static linear operations. However, they encounter issues when efficiently supporting Transformer architectures, questioning the applicability of photonics to advanced ML tasks. The primary hurdle lies in their inefficiency in handling unique workloads in Transformers, i.e., dynamic and full-range tensor multiplication. In this work, we propose Lightening-Transformer, the first light-empowered, high-performance, and energy-efficient photonic Transformer accelerator. To overcome prior designs' fundamental limitations, we introduce a novel dynamically-operated photonic tensor core, DPTC, a crossbar array of interference-based optical vector dot-product engines supporting highly parallel, dynamic, and full-range matrix multiplication. Furthermore, we design a dedicated accelerator that integrates our novel photonic computing cores with photonic interconnects for inter-core data broadcast, fully unleashing the power of optics. Comprehensive evaluations show that ours achieves >2.6x energy and >12x latency reductions compared to prior photonic accelerators and delivers the lowest energy cost and 2 to 3 orders of magnitude lower energy-delay product compared to electronic Transformer accelerators, all while maintaining digital-comparable accuracy. Our work highlights the immense potential of photonics for advanced ML workloads, such as Transformer-backboned LLM. Our work is available at https://github.com/zhuhanqing/Lightening-Transformer., Comment: Published as a conference paper in HPCA 2024. Recieved the Reproducibility Badges at IEEE. Our implementation is available at https://github.com/zhuhanqing/Lightening-Transformer
Published: 2023

47. Real-Time Scheduling for 802.1Qbv Time-Sensitive Networking (TSN): A Systematic Review and Experimental Study

Author: Xue, Chuanyu, Zhang, Tianyu, Zhou, Yuanbin, Nixon, Mark, Loveless, Andrew, and Han, Song
Subjects: Computer Science - Networking and Internet Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Time-Sensitive Networking (TSN) has been recognized as one of the key enabling technologies for Industry 4.0 and has been deployed in many mission- and safety-critical applications e.g., automotive and aerospace systems. Given the stringent real-time requirements of these applications, the Time-Aware Shaper (TAS) draws special attention among TSN's many traffic shapers due to its ability to achieve deterministic timing guarantees. Many scheduling methods for TAS shapers have been recently developed that claim to improve system schedulability. However, these scheduling methods have yet to be thoroughly evaluated, especially through experimental comparisons, to provide a systematical understanding of their performance using different evaluation metrics in diverse application scenarios. In this paper, we fill this gap by presenting a systematic review and experimental study on existing TAS-based scheduling methods for TSN. We first categorize the system models employed in these works along with the specific problems they aim to solve, and outline the fundamental considerations in the designs of TAS-based scheduling methods. We then perform an extensive evaluation on 17 representative solutions using both high-fidelity simulations and a real-life TSN testbed, and compare their performance under both synthetic scenarios and real-life industrial use cases. Through these experimental studies, we identify the limitations of individual scheduling methods and highlight several important findings. We expect this work will provide foundational knowledge and performance benchmarks needed for future studies on real-time TSN scheduling., Comment: 21 pages, 6 authors, RTAS24 tech report
Published: 2023

48. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Author: Xiao, Guangxuan, Yin, Tianwei, Freeman, William T., Durand, Frédo, and Han, Song
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300$\times$-2500$\times$ speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer., Comment: The first two authors contributed equally to this work
Published: 2023

49. Ginseng extracts improve circadian clock gene expression and reduce inflammation directly and indirectly through gut microbiota and PI3K signaling pathway

Author: Zhang, Xue-Ying, primary, Khakisahneh, Saeid, additional, Han, Song-Yi, additional, Song, Eun-Ji, additional, Nam, Young-Do, additional, and Kim, Hojun, additional
Published: 2024
Full Text: View/download PDF

50. Micromechanical modeling for longitudinal tensile property of unidirectional CFRP considering dispersion of fiber properties

Author: Wang, Hao, primary, Zhong, Xiang-Yu, additional, Jia, He, additional, Zhang, Lian-Wang, additional, Liu, Han-Song, additional, Sun, Ming-Chen, additional, Liu, Tian-Wei, additional, Bai, Jiang-Bo, additional, Ge, Si-Cheng, additional, and Bao, Jian-Wen, additional
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

7,874 results on '"Han, Song"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources