Start Over

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Authors :: Choe, Sang Keun
Ahn, Hwijeen
Bae, Juhan
Zhao, Kewen
Kang, Minsoo
Chung, Youngseog
Pratapa, Adithya
Neiswanger, Willie
Strubell, Emma
Mitamura, Teruko
Schneider, Jeff
Hovy, Eduard
Grosse, Roger
Xing, Eric
Publication Year :: 2024
Abstract: Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.

Subjects :: Computer Science - Machine Learning
Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2405.13954
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources