Author: "Stoica, A." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Stoica, A."' showing total 32,659 results

Start Over Author "Stoica, A."

32,659 results on '"Stoica, A."'

1. BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Author: Zhao, Yilong, Yang, Shuo, Zhu, Kan, Zheng, Lianmin, Kasikci, Baris, Zhou, Yang, Xing, Jiarong, and Stoica, Ion
Subjects: Computer Science - Machine Learning
Abstract: Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to $1.44\times$ throughput boost compared to widely-used industry standards, vLLM and SGLang.
Published: 2024

2. MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Author: Cao, Shiyi, Liu, Shu, Griggs, Tyler, Schafhalter, Peter, Liu, Xiaoxuan, Sheng, Ying, Gonzalez, Joseph E., Zaharia, Matei, and Stoica, Ion
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compared with dense models. However, the large model size makes MoE models inaccessible to individuals without high-end GPUs. In this paper, we propose a high-throughput MoE batch inference system, that significantly outperforms past work. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization, and a performance model, HRM, based on a Hierarchical Roofline Model we introduce to help find policies with higher throughput than existing systems. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB). When the theoretical system throughput is bounded by the GPU memory, MoE-Lightning can reach the throughput upper bound with 2-3x less CPU memory, significantly increasing resource utilization. MoE-Lightning also supports efficient batch inference for much larger MoEs (e.g., Mixtral 8x22B and DBRX) on multiple low-cost GPUs (e.g., 2-4 T4).
Published: 2024

3. Pie: Pooling CPU Memory for LLM Inference

Author: Xu, Yi, Mao, Ziming, Mo, Xiangxi, Liu, Shu, and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput. This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions. Pie maintains low computation latency, high throughput, and high elasticity. Our experimental evaluation demonstrates that Pie achieves optimal swapping policy during cache warmup and effectively balances increased memory capacity with negligible impact on computation. With its extended capacity, Pie outperforms vLLM by up to 1.9X in throughput and 2X in latency. Additionally, Pie can reduce GPU memory usage by up to 1.67X while maintaining the same performance. Compared to FlexGen, an offline profiling-based swapping solution, Pie achieves magnitudes lower latency and 9.4X higher throughput.
Published: 2024

4. Min-Max Framework for Majorization-Minimization Algorithms in Signal Processing Applications: An Overview

Author: Saini, Astha, Stoica, Petre, Babu, Prabhu, and Arora, Aakash
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: This monograph presents a theoretical background and a broad introduction to the Min-Max Framework for Majorization-Minimization (MM4MM), an algorithmic methodology for solving minimization problems by formulating them as min-max problems and then employing majorization-minimization. The monograph lays out the mathematical basis of the approach used to reformulate a minimization problem as a min-max problem. With the prerequisites covered, including multiple illustrations of the formulations for convex and non-convex functions, this work serves as a guide for developing MM4MM-based algorithms for solving non-convex optimization problems in various areas of signal processing. As special cases, we discuss using the majorization-minimization technique to solve min-max problems encountered in signal processing applications and min-max problems formulated using the Lagrangian. Lastly, we present detailed examples of using MM4MM in ten signal processing applications such as phase retrieval, source localization, independent vector analysis, beamforming and optimal sensor placement in wireless sensor networks. The devised MM4MM algorithms are free of hyper-parameters and enjoy the advantages inherited from the use of the majorization-minimization technique such as monotonicity., Comment: 84 pages, no figures, published in Foundations and Trends in Signal Processing: Vol. 18: No. 4, pp 310-389. http://dx.doi.org/10.1561/2000000129
Published: 2024
Full Text: View/download PDF

5. Terahertz-permittivity of Carbon Nitrides: Revealing humidity-enhanced dielectric properties on the picosecond timescales relevant for charge carrier photogeneration

Author: Jahangir, Reehab, Podjaski, Filip, Alimard, Paransa, Hillman, Sam A. J., Davidson, Stuart, Stoica, Stefan, Kafizas, Andreas, Naftaly, Mira, and Durrant, James R.
Subjects: Condensed Matter - Materials Science, Condensed Matter - Soft Condensed Matter, Physics - Chemical Physics
Abstract: Organic based semiconductor materials offer emerging and sustainable solutions for solar energy conversion technologies and electronics. However, knowledge of their intrinsic (photo)physical properties and light-matter interactions is often limited, especially with respect to the frequency dependent dielectric properties on the relevant timescales of exciton separation and charge generation (fs-ps). By using terahertz-time-domain spectroscopy (THz-TDS), we show that the complex permittivity and THz conductivity of different polymer materials and graphitic carbon nitrides - melon and poly(heptazine imides) (K-PHI) - can be measured directly and accurately in different environmental humidities. Its effects are most strongly observed in the ionic and 2D carbon nitride, K-PHI, where both real permittivity and THz conductivity double from dry to humid conditions (~4-8 and 75-150 S/m, respectively) surpassing the intrinsic dielectric response from water or K-PHI through synergistic effects. Our findings are backed by fs-ps transient absorption spectroscopy (TAS), confirming the impact of humidity on light conversion behaviour on the ps-timescale in K-PHI. We highlight the importance of dielectric property characterization at THz frequencies as critical for understanding photophysical behaviour at exciton or charge separation time scales, especially in the presence of water and hydrated ions, which may also be beneficial for computation, exploring next generation photocatalysts, electronics or ionotronics.
Published: 2024

6. SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Author: Mao, Ziming, Xia, Tian, Wu, Zhanghao, Chiang, Wei-Lin, Griggs, Tyler, Bhardwaj, Romil, Yang, Zongheng, Shenker, Scott, and Stoica, Ion
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. To address this, we introduce SkyServe, a system that efficiently serves AI models over a mixture of spot and on-demand replicas across regions and clouds. SkyServe intelligently spreads spot replicas across different failure domains (e.g., regions or clouds) to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We compare SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by up to 44% while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by up to 2.6x, 3.1x, 2.7x compared to other research and production systems.
Published: 2024

7. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Author: Jiang, Xuanlin, Zhou, Yang, Cao, Shiyi, Stoica, Ion, and Yu, Minlan
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5$\times$, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.
Published: 2024

8. Model merging with SVD to tie the Knots

Author: Stoica, George, Ramesh, Pratik, Ecsedi, Boglarka, Choshen, Leshem, and Hoffman, Judy
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent model merging methods demonstrate that the parameters of fully-finetuned models specializing in distinct tasks can be combined into one model capable of solving all tasks without retraining. Yet, this success does not transfer well when merging LoRA finetuned models. We study this phenomenon and observe that the weights of LoRA finetuned models showcase a lower degree of alignment compared to their fully-finetuned counterparts. We hypothesize that improving this alignment is key to obtaining better LoRA model merges, and propose KnOTS to address this problem. KnOTS uses the SVD to jointly transform the weights of different LoRA models into an aligned space, where existing merging methods can be applied. In addition, we introduce a new benchmark that explicitly evaluates whether merged models are general models. Notably, KnOTS consistently improves LoRA merging by up to 4.3% across several vision and language benchmarks, including our new setting. We release our code at: https://github.com/gstoica27/KnOTS.
Published: 2024

9. Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Author: Krentsel, Alexander, Schafhalter, Peter, Gonzalez, Joseph E., Ratnasamy, Sylvia, Shenker, Scott, and Stoica, Ion
Subjects: Computer Science - Networking and Internet Architecture, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Systems and Control
Abstract: Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocating bandwidth to meet strict latency SLOs, while maximizing benefit to the car., Comment: 6 pages
Published: 2024

10. How to Evaluate Reward Models for RLHF

Author: Frick, Evan, Li, Tianle, Chen, Connor, Chiang, Wei-Lin, Angelopoulos, Anastasios N., Jiao, Jiantao, Zhu, Banghua, Gonzalez, Joseph E., and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we open-source for public use and further development. Our code and evaluations can be found at https://github.com/lmarena/PPE .
Published: 2024

11. JudgeBench: A Benchmark for Evaluating LLM-based Judges

Author: Tan, Sijun, Zhuang, Siyuan, Montgomery, Kyle, Tang, William Y., Cuadron, Alejandro, Wang, Chenguang, Popa, Raluca Ada, and Stoica, Ion
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench ., Comment: preprint
Published: 2024

12. Efficient LLM Scheduling by Learning to Rank

Author: Fu, Yichao, Zhu, Siqi, Su, Runlong, Qiao, Aurick, Stoica, Ion, and Zhang, Hao
Subjects: Computer Science - Machine Learning
Abstract: In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git
Published: 2024

13. The Fairness-Quality Trade-off in Clustering

Author: Hakim, Rashida, Stoica, Ana-Andreea, Papadimitriou, Christos H., and Yannakakis, Mihalis
Subjects: Computer Science - Machine Learning, Computer Science - Computers and Society
Abstract: Fairness in clustering has been considered extensively in the past; however, the trade-off between the two objectives -- e.g., can we sacrifice just a little in the quality of the clustering to significantly increase fairness, or vice-versa? -- has rarely been addressed. We introduce novel algorithms for tracing the complete trade-off curve, or Pareto front, between quality and fairness in clustering problems; that is, computing all clusterings that are not dominated in both objectives by other clusterings. Unlike previous work that deals with specific objectives for quality and fairness, we deal with all objectives for fairness and quality in two general classes encompassing most of the special cases addressed in previous work. Our algorithm must take exponential time in the worst case as the Pareto front itself can be exponential. Even when the Pareto front is polynomial, our algorithm may take exponential time, and we prove that this is inevitable unless P = NP. However, we also present a new polynomial-time algorithm for computing the entire Pareto front when the cluster centers are fixed, and for perhaps the most natural fairness objective: minimizing the sum, over all clusters, of the imbalance between the two groups in each cluster.
Published: 2024

14. Post-Training Sparse Attention with Double Sparsity

Author: Yang, Shuo, Sheng, Ying, Gonzalez, Joseph E., Stoica, Ion, and Zheng, Lianmin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces "Double Sparsity," a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve $\frac{1}{16}$ token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at https://github.com/andy-yang-1/DoubleSparse.
Published: 2024

15. MPC-Minimized Secure LLM Inference

Author: Rathee, Deevashwer, Li, Dacheng, Stoica, Ion, Zhang, Hao, and Popa, Raluca
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Many inference services based on large language models (LLMs) pose a privacy concern, either revealing user prompts to the service or the proprietary weights to the user. Secure inference offers a solution to this problem through secure multi-party computation (MPC), however, it is still impractical for modern LLM workload due to the large overhead imposed by MPC. To address this overhead, we propose Marill, a framework that adapts LLM fine-tuning to minimize MPC usage during secure inference. Marill introduces high-level architectural changes during fine-tuning that significantly reduce the number of expensive operations needed within MPC during inference, by removing some and relocating others outside MPC without compromising security. As a result, Marill-generated models are more efficient across all secure inference protocols and our approach complements MPC-friendly approximations for such operations. Compared to standard fine-tuning, Marill results in 3.6-11.3x better runtime and 2.4-6.9x better communication during secure inference across various MPC settings, while typically preserving over 90% performance across downstream tasks.
Published: 2024

16. Classical Many-Worlds Interpretation

Author: Stoica, Ovidiu Cristinel
Subjects: Quantum Physics, Physics - History and Philosophy of Physics
Abstract: I present a simple baby-steps reconstruction of quantum mechanics as a fully classical theory. The most radical conceptual leap required is that there are many coexisting classical worlds, but even this is justified by the necessity of objective probabilities. These baby steps lead to a version of the many-worlds interpretation of quantum mechanics with built-in probabilities, built-in classicality at the macroscopic level, and an explanation of the complex numbers in quantum mechanics. Despite its simplicity and minimalism of radical concepts, this is not a toy model, being equivalent with quantum field theory., Comment: Easy introduction, 21 pages, 15 figures
Published: 2024

17. Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design

Author: Davis, Jared Quincy, Hanin, Boris, Chen, Lingjiao, Bailis, Peter, Stoica, Ion, and Zaharia, Matei
Subjects: Computer Science - Artificial Intelligence
Abstract: As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in complexity theory that we show empirically extends to Language Models (LMs). We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems. Through experiments on synthetic tasks such as prime factorization, and core benchmarks such as the MMLU, we demonstrate notable performance gains. For instance, in factoring products of two 3-digit primes, a simple NoN improves accuracy from 3.7\% to 36.6\%. On MMLU, a verifier-based judge construction with only 3 generators boosts accuracy over individual GPT-4-Turbo calls by 2.8\%. Our analysis reveals that these gains are most pronounced in domains where verification is notably easier than generation--a characterization which we believe subsumes many reasoning and procedural knowledge tasks, but doesn't often hold for factual and declarative knowledge-based settings. For mathematical and formal logic reasoning-based subjects of MMLU, we observe a 5-8\% or higher gain, whilst no gain on others such as geography and religion. We provide key takeaways for ML practitioners, including the importance of considering verification complexity, the impact of witness format on verifiability, and a simple test to determine the potential benefit of this NoN approach for a given problem distribution. This work aims to inform future research and practice in the design of compound AI systems.
Published: 2024

18. Sentient observers and the ontology of spacetime

Author: Stoica, Ovidiu Cristinel
Subjects: Quantum Physics, Physics - History and Philosophy of Physics
Abstract: I show that, by the same criteria that led to Galilean and Special Relativity and gauge symmetries, there is no way to identify a unique set of observables that give the structure of space or spacetime. In some sense, space is lost in the state space itself. Moreover, the relationship between the observables and the physical properties they represent becomes relative. But we can verify that they are not relative, and the spacetime structure is unique. I show that this implies that not all structures isomorphic with observers can be observers, contradicting Structural Realism and Physicalism. This indicates a strong connection between spacetime and the sentience of the observers, as anticipated by some early contributors to Special and General Relativity., Comment: 13 pages, to appear in Eric Ling and Annachiara Piubello (Eds), Spacetime 1908-2023. Selected peer-reviewed papers presented at the Third Hermann Minkowski Meeting on the Foundations of Spacetime Physics, 11-14 September 2023, Albena, Bulgaria (Minkowski Institute Press, Montreal 2024). ISBN 978-1-998902-25-5 (softcover), ISBN 978-1-998902-26-2 (ebook)
Published: 2024

19. RouteLLM: Learning to Route LLMs with Preference Data

Author: Ong, Isaac, Almahairi, Amjad, Wu, Vincent, Chiang, Wei-Lin, Wu, Tianhao, Gonzalez, Joseph E., Kadous, M Waleed, and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
Published: 2024

20. Fairness in Social Influence Maximization via Optimal Transport

Author: Chowdhary, Shubham, De Pasquale, Giulia, Lanzetti, Nicolas, Stoica, Ana-Andreea, and Dorfler, Florian
Subjects: Computer Science - Social and Information Networks, Computer Science - Multiagent Systems
Abstract: We study fairness in social influence maximization, whereby one seeks to select seeds that spread a given information throughout a network, ensuring balanced outreach among different communities (e.g. demographic groups). In the literature, fairness is often quantified in terms of the expected outreach within individual communities. In this paper, we demonstrate that such fairness metrics can be misleading since they overlook the stochastic nature of information diffusion processes. When information diffusion occurs in a probabilistic manner, multiple outreach scenarios can occur. As such, outcomes such as ``In 50\% of the cases, no one in group 1 gets the information, while everyone in group 2 does, and in the other 50%, it is the opposite'', which always results in largely unfair outcomes, are classified as fair by a variety of fairness metrics in the literature. We tackle this problem by designing a new fairness metric, mutual fairness, that captures variability in outreach through optimal transport theory. We propose a new seed-selection algorithm that optimizes both outreach and mutual fairness, and we show its efficacy on several real datasets. We find that our algorithm increases fairness with only a minor decrease (and at times, even an increase) in efficiency.
Published: 2024

21. Optical Control of Adaptive Nanoscale Domain Networks

Author: Zajac, Marc, Zhou, Tao, Yang, Tiannan, Das, Sujit, Cao, Yue, Guzelturk, Burak, Stoica, Vladimir, Cherukara, Mathew, Freeland, John W., Gopalan, Venkatraman, Ramesh, Ramamoorthy, Martin, Lane W., Chen, Long-Qing, Holt, Martin, Hruszkewycz, Stephan, and Wen, Haidan
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics, Condensed Matter - Disordered Systems and Neural Networks, Physics - Applied Physics
Abstract: Adaptive networks can sense and adjust to dynamic environments to optimize their performance. Understanding their nanoscale responses to external stimuli is essential for applications in nanodevices and neuromorphic computing. However, it is challenging to image such responses on the nanoscale with crystallographic sensitivity. Here, the evolution of nanodomain networks in (PbTiO3)n/(SrTiO3)n superlattices was directly visualized in real space as the system adapts to ultrafast repetitive optical excitations that emulate controlled neural inputs. The adaptive response allows the system to explore a wealth of metastable states that were previously inaccessible. Their reconfiguration and competition were quantitatively measured by scanning x-ray nanodiffraction as a function of the number of applied pulses, in which crystallographic characteristics were quantitatively assessed by assorted diffraction patterns using unsupervised machine-learning methods. The corresponding domain boundaries and their connectivity were drastically altered by light, holding promise for light-programmable nanocircuits in analogy to neuroplasticity. Phase-field simulations elucidate that the reconfiguration of the domain networks is a result of the interplay between photocarriers and transient lattice temperature. The demonstrated optical control scheme and the uncovered nanoscopic insights open opportunities for remote control of adaptive nanoscale domain networks.
Published: 2024

22. Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Author: Liu, Xiaoxuan, Daniel, Cade, Hu, Langxiang, Kwon, Woosuk, Li, Zhuohan, Mo, Xiangxi, Cheung, Alvin, Deng, Zhijie, Stoica, Ion, and Zhang, Hao
Subjects: Computer Science - Artificial Intelligence, Computer Science - Performance
Abstract: Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.
Published: 2024

23. From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Author: Li, Tianle, Chiang, Wei-Lin, Frick, Evan, Dunlap, Lisa, Wu, Tianhao, Zhu, Banghua, Gonzalez, Joseph E., and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark's alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.
Published: 2024

24. High-precision measurements of the atomic mass and electron-capture decay $Q$ value of $^{95}$Tc

Author: Ge, Zhuang, Eronen, Tommi, Sevestrean, Vasile Alin, Niţescu, Ovidiu, Stoica, Sabin, Ramalho, Marlom, Suhonen, Jouni, de Roubin, Antoine, Nesterenko, Dmitrii, Kankainen, Anu, Ascher, Pauline, Andres, Samuel Ayet San, Beliuskina, Olga, Delahaye, Pierre, Flayol, Mathieu, Gerbaux, Mathias, Grévy, Stéphane, Hukkanen, Marjut, Jaries, Arthur, Jokinen, Ari, Husson, Audric, Kahl, Daid, Kostensalo, Joel, Kotila, Jenni, Moore, Iain, Nikas, Stylianos, Stryjczyk, Marek, and Virtanen, Ville
Subjects: Nuclear Experiment, Nuclear Theory
Abstract: A direct measurement of the ground-state-to-ground-state electron-capture decay $Q$ value of $^{95}$Tc has been performed utilizing the double Penning trap mass spectrometer JYFLTRAP. The $Q$ value was determined to be 1695.92(13) keV by taking advantage of the high resolving power of the phase-imaging ion-cyclotron-resonance technique to resolve the low-lying isomeric state of $^{95}$Tc (excitation energy of 38.910(40) keV) from the ground state. The mass excess of $^{95}$Tc was measured to be $-$86015.95(18) keV/c$^2$, exhibiting a precision of about 28 times higher and in agreement with the value from the newest Atomic Mass Evaluation (AME2020). Combined with the nuclear energy-level data for the decay-daughter $^{95}$Mo, two potential ultra-low $Q$-value transitions are identified for future long-term neutrino-mass determination experiments. The atomic self-consistent many-electron Dirac--Hartree--Fock--Slater method and the nuclear shell model have been used to predict the partial half-lives and energy-release distributions for the two transitions. The dominant correction terms related to those processes are considered, including the exchange and overlap corrections, and the shake-up and shake-off effects. The normalized distribution of the released energy in the electron-capture decay of $^{95}$Tc to excited states of $^{95}$Mo is compared to that of $^{163}$Ho currently being used for electron-neutrino-mass determination., Comment: 13 pages, 5 figures
Published: 2024
Full Text: View/download PDF

25. Causal Inference from Competing Treatments

Author: Stoica, Ana-Andreea, Nastl, Vivian Y., and Hardt, Moritz
Subjects: Computer Science - Computer Science and Game Theory
Abstract: Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes difficult, as the position at which a subject sees a treatment influences their response, and thus the treatment effect. In this paper, we build a game-theoretic model of agents who wish to estimate causal effects in the presence of competition, through a bidding system and a utility function that minimizes estimation error. Our main technical result establishes an approximation with a tractable objective that maximizes the sample value obtained through strategically allocating budget on subjects. This allows us to find an equilibrium in our model: we show that the tractable objective has a pure Nash equilibrium, and that any Nash equilibrium is an approximate equilibrium for our general objective that minimizes estimation error under broad conditions. Conceptually, our work successfully combines elements from causal inference and game theory to shed light on the equilibrium behavior of experimentation under competition., Comment: 37 pages, 3 figures, accepted at ICML'24
Published: 2024

26. OR-Bench: An Over-Refusal Benchmark for Large Language Models

Author: Cui, Justin, Chiang, Wei-Lin, Stoica, Ion, and Hsieh, Cho-Jui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/or-bench and the demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety aligned models., Comment: version 2, 10 pages main, 22 pages total
Published: 2024

27. Crafting Interpretable Embeddings by Asking LLMs Questions

Author: Benara, Vinamra, Singh, Chandan, Morris, John X., Antonello, Richard, Stoica, Ion, Huth, Alexander G., and Gao, Jianfeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Quantitative Biology - Neurons and Cognition
Abstract: Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. However, their opaqueness and proliferation into scientific domains such as neuroscience have created a growing need for interpretability. Here, we ask whether we can obtain interpretable embeddings through LLM prompting. We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM. Training QA-Emb reduces to selecting a set of underlying questions rather than learning model weights. We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli. QA-Emb significantly outperforms an established interpretable baseline, and does so while requiring very few questions. This paves the way towards building flexible feature spaces that can concretize and evaluate our understanding of semantic brain representations. We additionally find that QA-Emb can be effectively approximated with an efficient model, and we explore broader applications in simple NLP tasks.
Published: 2024

28. Certified Inventory Control of Critical Resources

Author: Hult, Ludvig, Zachariah, Dave, and Stoica, Petre
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Inventory control is subject to service-level requirements, in which sufficient stock levels must be maintained despite an unknown demand. We propose a data-driven order policy that certifies any prescribed service level under minimal assumptions on the unknown demand process. The policy achieves this using any online learning method along with integral action. We further propose an inference method that is valid in finite samples. The properties and theoretical guarantees of the method are illustrated using both synthetic and real-world data.
Published: 2024

29. Stylus: Automatic Adapter Selection for Diffusion Models

Author: Luo, Michael, Wong, Justin, Trabucco, Brandon, Huang, Yanping, Gonzalez, Joseph E., Chen, Zhifeng, Salakhutdinov, Ruslan, and Stoica, Ion
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more., Comment: Project Website: https://stylus-diffusion.github.io
Published: 2024

30. Stability of the regular $n$-gon rotating equilibria with logarithm interaction

Author: Muscas, Anna-Monika, Pasca, Daniel, and Stoica, Cristina
Subjects: Mathematics - Dynamical Systems, Astrophysics - Astrophysics of Galaxies, 70F10, 70F15
Abstract: We study the linear stability of regular $n$-gon rotating equilibria in the $n$-body problem with logarithm interaction. In the presence of a central mass $M$, linear stability is insured if $M$ is bounded below and above by constants depending on the number and mass of the (equal) outer $n$ bodies. Moreover, we provide explicit equations of these bounds. In the absence of a central mass we find that the regular $n$-gon is linearly stable for $n =2,3,\ldots 6$ only.
Published: 2024

31. M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Author: Griggs, Tyler, Liu, Xiaoxuan, Yu, Jiaxiang, Kim, Doyoung, Chiang, Wei-Lin, Cheung, Alvin, and Stoica, Ion
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.
Published: 2024

32. GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

Author: Patil, Shishir G., Zhang, Tianjun, Fang, Vivian, C., Noppapon, Huang, Roy, Hao, Aaron, Casado, Martin, Gonzalez, Joseph E., Popa, Raluca Ada, and Stoica, Ion
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) are evolving beyond their classical role of providing information within dialogue systems to actively engaging with tools and performing actions on real-world applications and services. Today, humans verify the correctness and appropriateness of the LLM-generated outputs (e.g., code, functions, or actions) before putting them into real-world execution. This poses significant challenges as code comprehension is well known to be notoriously difficult. In this paper, we study how humans can efficiently collaborate with, delegate to, and supervise autonomous LLMs in the future. We argue that in many cases, "post-facto validation" - verifying the correctness of a proposed action after seeing the output - is much easier than the aforementioned "pre-facto validation" setting. The core concept behind enabling a post-facto validation system is the integration of an intuitive undo feature, and establishing a damage confinement for the LLM-generated actions as effective strategies to mitigate the associated risks. Using this, a human can now either revert the effect of an LLM-generated output or be confident that the potential risk is bounded. We believe this is critical to unlock the potential for LLM agents to interact with applications and services with limited (post-facto) human involvement. We describe the design and implementation of our open-source runtime for executing LLM actions, Gorilla Execution Engine (GoEX), and present open research questions towards realizing the goal of LLMs and applications interacting with each other with minimal human supervision. We release GoEX at https://github.com/ShishirPatil/gorilla/.
Published: 2024

33. Can we accurately read or write quantum data?

Author: Stoica, Ovidiu Cristinel
Subjects: Quantum Physics, Mathematical Physics
Abstract: Applications of quantum mechanics rely on the accuracy of reading and writing data. This requires accurate measurements and preparations of the quantum states. I show that accurate measurements and preparations are impossible if the total Hamiltonian is bounded from below (as thought to be in our universe). This result invites a reevaluation of the limitations of quantum control, quantum computing, and other quantum technologies dependent on the accuracy of quantum preparations and measurements, and maybe of the assumption that the Hamiltonian is bounded from below., Comment: 6 pages, comments welcome!
Published: 2024

34. Trustless Audits without Revealing Data or Models

Author: Waiwitlikhit, Suppakit, Stoica, Ion, Sun, Yi, Hashimoto, Tatsunori, and Kang, Daniel
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Machine Learning
Abstract: There is an increasing conflict between business incentives to hide models and data as trade secrets, and the societal need for algorithmic transparency. For example, a rightsholder wishing to know whether their copyrighted works have been used during training must convince the model provider to allow a third party to audit the model and data. Finding a mutually agreeable third party is difficult, and the associated costs often make this approach impractical. In this work, we show that it is possible to simultaneously allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties. We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights, alongside a zero-knowledge proof (ZKP) certifying that published commitments are derived from training the model. Model providers can then respond to audit requests by privately computing any function F of the dataset (or model) and releasing the output of F alongside another ZKP certifying the correct execution of F. To enable ZkAudit, we develop new methods of computing ZKPs for SGD on modern neural nets for simple recommender systems and image classification models capable of high accuracies on ImageNet. Empirically, we show it is possible to provide trustless audits of DNNs, including copyright, censorship, and counterfactual audits with little to no loss in accuracy.
Published: 2024

35. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Author: Duan, Jiangfei, Lu, Runyu, Duanmu, Haojie, Li, Xiuhong, Zhang, Xingcheng, Lin, Dahua, Stoica, Ion, and Zhang, Hao
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests within $99\%$ SLO attainment. The code is available at: \url{https://github.com/hao-ai-lab/MuxServe}.
Published: 2024

36. Unexpected sustained soil carbon flux in response to simultaneous warming and nitrogen enrichment compared with single factors alone

Author: Knorr, Melissa A., Contosta, A. R., Morrison, E. W., Muratore, T. J., Anthony, M. A., Stoica, I., Geyer, K. M., Simpson, M. J., and Frey, S. D.
Published: 2024
Full Text: View/download PDF

37. Enhancement of W Nanoparticles Synthesis by Injecting H2 in a Magnetron Sputtering Gas Aggregation Cluster Source Operated in Ar

Author: Acsente, Tomy, Stoica, Silviu Daniel, Craciun, Cristina, Mitu, Bogdana, and Dinescu, Gheorghe
Published: 2024
Full Text: View/download PDF

38. Non-equilibrium pathways to emergent polar supertextures

Author: Stoica, Vladimir A., Yang, Tiannan, Das, Sujit, Cao, Yue, Wang, Huaiyu (Hugo), Kubota, Yuya, Dai, Cheng, Padma, Hari, Sato, Yusuke, Mangu, Anudeep, Nguyen, Quynh L., Zhang, Zhan, Talreja, Disha, Zajac, Marc E., Walko, Donald A., DiChiara, Anthony D., Owada, Shigeki, Miyanishi, Kohei, Tamasaku, Kenji, Sato, Takahiro, Glownia, James M., Esposito, Vincent, Nelson, Silke, Hoffmann, Matthias C., Schaller, Richard D., Lindenberg, Aaron M., Martin, Lane W., Ramesh, Ramamoorthy, Matsuda, Iwao, Zhu, Diling, Chen, Long-Q., Wen, Haidan, Gopalan, Venkatraman, and Freeland, John W.
Published: 2024
Full Text: View/download PDF

39. RAFT: Adapting Language Model to Domain Specific RAG

Author: Zhang, Tianjun, Patil, Shishir G., Jain, Naman, Shen, Sheng, Zaharia, Matei, Stoica, Ion, and Gonzalez, Joseph E.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain such new knowledge remains an open question. In this paper, we present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a "open-book" in-domain settings. In RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document that would help answer the question. This coupled with RAFT's chain-of-thought-style response helps improve the model's ability to reason. In domain-specific RAG, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG. RAFT's code and demo are open-sourced at github.com/ShishirPatil/gorilla.
Published: 2024

40. depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers

Author: You, Kaichao, Bai, Runsheng, Cao, Meng, Wang, Jianmin, Stoica, Ion, and Long, Mingsheng
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Programming Languages
Abstract: PyTorch \texttt{2.x} introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, adapting to the PyTorch compiler to full potential can be challenging. The compiler operates at the Python bytecode level, making it appear as an opaque box. To address this, we introduce \texttt{depyf}, a tool designed to demystify the inner workings of the PyTorch compiler. \texttt{depyf} decompiles bytecode generated by PyTorch back into equivalent source code, and establishes connections between in-memory code objects and their on-disk source code counterparts. This feature enables users to step through the source code line by line using debuggers, thus enhancing their understanding of the underlying processes. Notably, \texttt{depyf} is non-intrusive and user-friendly, primarily relying on two convenient context managers for its core functionality. The project is \href{https://github.com/thuml/depyf}{ openly available} and is recognized as a \href{https://pytorch.org/ecosystem/}{PyTorch ecosystem project}., Comment: 16 pages, 2 figures
Published: 2024

41. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Author: Jain, Naman, Han, King, Gu, Alex, Li, Wen-Ding, Yan, Fanjia, Zhang, Tianjun, Wang, Sida, Solar-Lezama, Armando, Sen, Koushik, and Stoica, Ion
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model, Comment: Website - https://livecodebench.github.io/
Published: 2024

42. Optimizing LLM Queries in Relational Workloads

Author: Liu, Shu, Biswal, Asim, Cheng, Audrey, Mo, Xiangxi, Cao, Shiyi, Gonzalez, Joseph E., Stoica, Ion, and Zaharia, Matei
Subjects: Computer Science - Machine Learning, Computer Science - Databases
Abstract: Analytical database providers (e.g., Redshift, Databricks, BigQuery) have rapidly added support for invoking Large Language Models (LLMs) through native user-defined functions (UDFs) to help users perform natural language tasks, such as classification, entity extraction, and translation, inside analytical workloads. For instance, an analyst might want to extract customer sentiments on millions of product reviews. However, LLM inference is highly expensive in both computational and economic terms: for example, an NVIDIA L4 GPU running Llama2-7B can only process 6 KB of text per second. In this paper, we explore how to optimize LLM inference for analytical workloads that invoke LLMs within relational queries. We show that relational queries present novel opportunities for accelerating LLM inference, including reordering rows to maximize key-value (KV) cache reuse within the LLM inference engine, reordering columns within a row to further increase cache reuse, and deduplicating redundant inference requests. We implement these optimizations in Apache Spark, with vLLM as the model serving backend and achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets. To the best of our knowledge, this is the first work to explicitly address the problem of optimizing LLM invocations within SQL queries.
Published: 2024

43. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Author: Chiang, Wei-Lin, Zheng, Lianmin, Sheng, Ying, Angelopoulos, Anastasios Nikolas, Li, Tianle, Li, Dacheng, Zhang, Hao, Zhu, Banghua, Jordan, Michael, Gonzalez, Joseph E., and Stoica, Ion
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{https://chat.lmsys.org}.
Published: 2024

44. Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Author: Chen, Lingjiao, Davis, Jared Quincy, Hanin, Boris, Bailis, Peter, Stoica, Ion, Zaharia, Matei, and Zou, James
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Systems and Control
Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we initiate the study of scaling properties of compound inference systems. We analyze, theoretically and empirically, how the number of LM calls affects the performance of Vote and Filter-Vote, two of the simplest compound system designs, which aggregate LM responses via majority voting, optionally applying LM filters. We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior can emerge when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LM calls that maximizes system performance, and define an analytical scaling model for both systems. Experiments show that our scaling model can accurately predict the performance of Vote and Filter-Vote systems and thus find the optimal number of LM calls to make.
Published: 2024

45. Fairness Rising from the Ranks: HITS and PageRank on Homophilic Networks

Author: Stoica, Ana-Andreea, Litvak, Nelly, and Chaintreau, Augustin
Subjects: Computer Science - Social and Information Networks, Computer Science - Information Retrieval
Abstract: In this paper, we investigate the conditions under which link analysis algorithms prevent minority groups from reaching high ranking slots. We find that the most common link-based algorithms using centrality metrics, such as PageRank and HITS, can reproduce and even amplify bias against minority groups in networks. Yet, their behavior differs: one one hand, we empirically show that PageRank mirrors the degree distribution for most of the ranking positions and it can equalize representation of minorities among the top ranked nodes; on the other hand, we find that HITS amplifies pre-existing bias in homophilic networks through a novel theoretical analysis, supported by empirical results. We find the root cause of bias amplification in HITS to be the level of homophily present in the network, modeled through an evolving network model with two communities. We illustrate our theoretical analysis on both synthetic and real datasets and we present directions for future work., Comment: Accepted for publication in Proceedings of The Web Conference, 2024
Published: 2024
Full Text: View/download PDF

46. Theoretical analysis and predictions for the double electron capture of $^{124}$Xe

Author: Niţescu, Ovidiu, Ghinescu, Stefan, Sevestrean, Vasile-Alin, Horoi, Mihai, Šimkovic, Fedor, and Stoica, Sabin
Subjects: Nuclear Theory
Abstract: We provide a complete theoretical description of the two-neutrino electron capture in $^{124}$Xe, improving both the nuclear and the atomic structure calculations. We improve the general formalism through the use of the Taylor expansion method, leading to higher order terms in the decay rate of the process. The nuclear part is treated with pn-QRPA and interacting shell model (ISM) methods. The nuclear matrix elements (NMEs) are calculated with the pn-QRPA method with spin restoration by fixing the input parameters so that the experimental decay rate is reproduced, resulting in values significantly lower than in previous calculations. The validity of the pn-QRPA NMEs is tested by showing their values to be comparable with the ones for double-beta decay with emission of two electrons of $^{128,130}$Te, which have similar pairing features. Within the ISM, we reproduce the total experimental half-life within a factor of two and predict the capture fraction to the KK channel of about 74\%. We also predict the capture fractions to other decay channels and show that for the cumulative decay to the $\rm{KL_{1}}$-$\rm{KO_{1}}$ channels, a capture fraction of about 24\% could be observed experimentally. On the atomic side, calculations are improved by accounting for the Pauli blocking of the decay of innermost nucleon states and by considering all $s$-wave electrons available for capture, expanding beyond the K and L$_1$ orbitals considered in previous studies. We also provide improved atomic relaxation energies of the final atomic states of $^{124}$Te, which may be used as input for background modeling in liquid Xenon experiments., Comment: 9 pages, 2 figures
Published: 2024

47. Hidden domain boundary dynamics towards crystalline perfection

Author: Mangu, A., Stoica, V. A., Zheng, H., Yang, T., Zhang, M., Wang, H., Nguyen, Q. L., Song, S., Das, S., Meisenheimer, P., Donoway, E., Chollet, M., Sun, Y., Turner, J. J., Freeland, J. W., Wen, H., Martin, L. W., Chen, L. -Q., Gopalan, V., Zhu, D., Cao, Y., and Lindenberg, A. M.
Subjects: Condensed Matter - Materials Science, Condensed Matter - Mesoscale and Nanoscale Physics
Abstract: A central paradigm of non-equilibrium physics concerns the dynamics of heterogeneity and disorder, impacting processes ranging from the behavior of glasses to the emergent functionality of active matter. Understanding these complex mesoscopic systems requires probing the microscopic trajectories associated with irreversible processes, the role of fluctuations and entropy growth, and the timescales on which non-equilibrium responses are ultimately maintained. Approaches that illuminate these processes in model systems may enable a more general understanding of other heterogeneous non-equilibrium phenomena, and potentially define ultimate speed and energy cost limits for information processing technologies. Here, we apply ultrafast single shot x-ray photon correlation spectroscopy to resolve the non-equilibrium, heterogeneous, and irreversible mesoscale dynamics during a light-induced phase transition. This approach defines a new way of capturing the nucleation of the induced phase, the formation of transient mesoscale defects at the boundaries of the nuclei, and the eventual annihilation of these defects, even in systems with complex polarization topologies. A non-equilibrium response spanning >10 orders of magnitude in timescales is observed, with multistep behavior similar to the plateaus observed in supercooled liquids and glasses. We show how the observed time-dependent long-time correlations can be understood in terms of the stochastic dynamics of domain walls, encoded in effective waiting-time distributions with power-law tails. This work defines new possibilities for probing the non-equilibrium and correlated dynamics of disordered and heterogeneous media.
Published: 2024

48. Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Author: Fu, Yichao, Bailis, Peter, Stoica, Ion, and Zhang, Hao
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding often require a draft model (e.g., speculative decoding), which is nontrivial to obtain and unable to generalize. In this paper, we introduce Lookahead decoding, an exact, parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores. It allows trading per-step log(FLOPs) to reduce the number of total decoding steps, is more parallelizable on single or multiple modern accelerators, and is compatible with concurrent memory-efficient attention (e.g., FlashAttention). Our implementation of Lookahead decoding can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks. Our code is avialable at https://github.com/hao-ai-lab/LookaheadDecoding
Published: 2024

49. Does the Hamiltonian determine the tensor product structure and the 3d space?

Author: Stoica, Ovidiu Cristinel
Subjects: Quantum Physics, General Relativity and Quantum Cosmology, Mathematical Physics, Physics - History and Philosophy of Physics
Abstract: It was proposed that the tensor product structure of the Hilbert space is uniquely determined by the Hamiltonian's spectrum, for most finite-dimensional cases satisfying certain conditions. I show that any such method would lead to infinitely many tensor product structures. The dimension of the space of solutions grows exponentially with the number of qudits. In addition, even if the result were unique, such a Hamiltonian would not entangle subsystems. These results affect the proposals to recover the 3d space from the Hamiltonian., Comment: 3 pages, comments welcome!
Published: 2024

50. Effects of Retrofitting Diesel Engine Injector Nozzles, Focusing on Engine Performances

Author: Suciu, Cosmin Constantin, Stoica, Virgil, Igret, Sorin Vlad, Ionel, Ioana, Chiru, Anghel, editor, and Covaciu, Dinu, editor
Published: 2025
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

32,659 results on '"Stoica, A."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources