Author: "Eugene Yang" / Topic: computer science - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Eugene Yang"' showing total 11 results

Start Over Author "Eugene Yang" Topic computer science

11 results on '"Eugene Yang"'

1. Certifying One-Phase Technology-Assisted Reviews

Author: David D. Lewis, Ophir Frieder, and Eugene Yang
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Operations research, Total cost, Computer science, Active learning (machine learning), Sampling (statistics), Tar, Sample (statistics), Statistical process control, Machine Learning (cs.LG), Computer Science - Information Retrieval, Workflow, Information Retrieval (cs.IR), Quantile
Abstract: Technology-assisted review (TAR) workflows based on iterative active learning are widely used in document review applications. Most stopping rules for one-phase TAR workflows lack valid statistical guarantees, which has discouraged their use in some legal contexts. Drawing on the theory of quantile estimation, we provide the first broadly applicable and statistically valid sample-based stopping rules for one-phase TAR. We further show theoretically and empirically that overshooting a recall target, which has been treated as innocuous or desirable in past evaluations of stopping rules, is a major source of excess cost in one-phase TAR workflows. Counterintuitively, incurring a larger sampling cost to reduce excess recall leads to lower total cost in almost all scenarios., 10 pages, 4 figures, accepted at CIKM 2021
Published: 2021

2. Heuristic Stopping Rules For Technology-Assisted Review

Author: David D. Lewis, Ophir Frieder, and Eugene Yang
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Recall, Computer science, business.industry, Heuristic, Active learning (machine learning), Machine learning, computer.software_genre, Variety (cybernetics), Computer Science - Information Retrieval, Machine Learning (cs.LG), Range (mathematics), Workflow, Stopping rules, Artificial intelligence, Heuristics, business, computer, Information Retrieval (cs.IR)
Abstract: Technology-assisted review (TAR) refers to human-in-the-loop active learning workflows for finding relevant documents in large collections. These workflows often must meet a target for the proportion of relevant documents found (i.e. recall) while also holding down costs. A variety of heuristic stopping rules have been suggested for striking this tradeoff in particular settings, but none have been tested against a range of recall targets and tasks. We propose two new heuristic stopping rules, Quant and QuantCI based on model-based estimation techniques from survey research. We compare them against a range of proposed heuristics and find they are accurate at hitting a range of recall targets while substantially reducing review costs., Comment: 10 pages, 2 figures. Accepted at DocEng 21
Published: 2021
Full Text: View/download PDF

3. On Minimizing Cost in Legal Document Review Workflows

Author: Eugene Yang, David D. Lewis, and Ophir Frieder
Subjects: FOS: Computer and information sciences, Iterative and incremental development, Information retrieval, Point (typography), Active learning (machine learning), Computer science, Computer Science - Human-Computer Interaction, Computer Science - Information Retrieval, Task (project management), Human-Computer Interaction (cs.HC), Workflow, Factor (programming language), Key (cryptography), Legal document, computer, Information Retrieval (cs.IR), computer.programming_language
Abstract: Technology-assisted review (TAR) refers to human-in-the-loop machine learning workflows for document review in legal discovery and other high recall review tasks. Attorneys and legal technologists have debated whether review should be a single iterative process (one-phase TAR workflows) or whether model training and review should be separate (two-phase TAR workflows), with implications for the choice of active learning algorithm. The relative cost of manual labeling for different purposes (training vs. review) and of different documents (positive vs. negative examples) is a key and neglected factor in this debate. Using a novel cost dynamics analysis, we show analytically and empirically that these relative costs strongly impact whether a one-phase or two-phase workflow minimizes cost. We also show how category prevalence, classification task difficulty, and collection size impact the optimal choice not only of workflow type, but of active learning method and stopping point., Comment: 10 pages, 3 figures. Accepted at DocEng 21
Published: 2021
Full Text: View/download PDF

4. GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection

Author: Sajad Sotudeh, Hao-Ren Yao, Sean MacAvaney, Ophir Frieder, Tong Xiang, Nazli Goharian, and Eugene Yang
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Language identification, Computer science, business.industry, Offensive, computer.software_genre, SemEval, Task (project management), Domain (software engineering), Support vector machine, Artificial intelligence, Language model, business, computer, Computation and Language (cs.CL), Natural language processing
Abstract: Offensive language detection is an important and challenging task in natural language processing. We present our submissions to the OffensEval 2020 shared task, which includes three English sub-tasks: identifying the presence of offensive language (Sub-task A), identifying the presence of target in offensive language (Sub-task B), and identifying the categories of the target (Sub-task C). Our experiments explore using a domain-tuned contextualized language model (namely, BERT) for this task. We also experiment with different components and configurations (e.g., a multi-view SVM) stacked upon BERT models for specific sub-tasks. Our submissions achieve F1 scores of 91.7% in Sub-task A, 66.5% in Sub-task B, and 63.2% in Sub-task C. We perform an ablation study which reveals that domain tuning considerably improves the classification performance. Furthermore, error analysis shows common misclassification errors made by our model and outlines research directions for future., Comment: SemEval 2020
Published: 2020
Full Text: View/download PDF

5. Text Retrieval Priors for Bayesian Logistic Regression

Author: David D. Lewis, Ophir Frieder, and Eugene Yang
Subjects: Training set, business.industry, Computer science, Gaussian, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, 02 engineering and technology, Logistic regression, Machine learning, computer.software_genre, Regularization (mathematics), symbols.namesake, Naive Bayes classifier, Generative model, Discriminative model, 020204 information systems, Prior probability, 0202 electrical engineering, electronic engineering, information engineering, symbols, Domain knowledge, Artificial intelligence, Heuristics, business, computer, Discriminative learning
Abstract: Discriminative learning algorithms such as logistic regression excel when training data are plentiful, but falter when it is meager. An extreme case is text retrieval (zero training data), where discriminative learning is impossible and heuristics such as BM25, which combine domain knowledge (a topical keyword query) with generative learning (Naive Bayes), are dominant. Building on past work, we show that BM25-inspired Gaussian priors for Bayesian logistic regression based on topical keywords provide better effectiveness than the usual L2 (zero mode, uniform variance) Gaussian prior. On two high recall retrieval datasets, the resulting models transition smoothly from BM25 level effectiveness to discriminative effectiveness as training data volume increases, dominating L2 regularization even when substantial training data is available.
Published: 2019

6. A Regularization Approach to Combining Keywords and Training Data in Technology-Assisted Review

Author: Ophir Frieder, Eugene Yang, and David D. Lewis
Subjects: 021103 operations research, Recall, business.industry, Computer science, Supervised learning, Bayesian probability, 0211 other engineering and technologies, 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, Regularization (mathematics), Software, Computational learning theory, SAFER, Stochastic optimization, Artificial intelligence, business, computer, 0105 earth and related environmental sciences
Abstract: Manual keyword queries and supervised learning (technology-assisted review) have been viewed as conflicting approaches to high recall retrieval tasks (such as civil discovery and sunshine law requests) in the law. We propose a synthesis that uses a keyword list as a regularizer when learning a logistic regression model from labeled examples. Balancing keywords against training data requires knowing how the regularization penalty should scale with training set size. We show, however, that advice on scaling from theory is contradictory, software defaults are inconsistent, and standard practice (validation-based tuning) is impractical in many high-recall retrieval settings. Through experiments on simulated e-discovery data sets, we show that the penalization scheme suggested by a Bayesian interpretation is substantially safer than alternatives from stochastic optimization and computational learning theory. Combining keywords and training data provides better effectiveness on our datasets than using either alone, showing that both approaches bring value.
Published: 2019

7. A Practical Obstacle Avoidance Method Using Q-Learning with Local Information

Author: J. L. Chen, Eugene Yang, S. C. Chen, and Eric J. Tzeng
Subjects: Computer science, business.industry, Obstacle avoidance, Q-learning, Reinforcement learning, Robot, Generalizability theory, Artificial intelligence, business, Focus (optics)
Abstract: Various methods have been proposed for solving the obstacle avoidance problem. However, many of them are based on information that might not be available for robots in real-world settings. We focus on the generalizability and the practical aspects of the problem instead of studying yet another obstacle avoidance method. We propose a simple but robust method based on reinforcement learning for obstacle avoidance using only local information that could be gathered by the sensors on the robot. We train the model with simple and random cases having only static obstacles in a simulated environment and deploy the trained model to an actual robot car. The robot successfully avoided the static and, surprisingly, dynamic obstacles and eventually reached the target.
Published: 2019

8. Fuzz testing & software composition analysis in software engineering

Author: Eugene Yang
Subjects: Software, Computer science, business.industry, Mandate, ComputingMethodologies_GENERAL, Open source software, Fuzz testing, Software engineering, business
Abstract: Today's world is filled with complexity. New threats are waiting for cracks to appear. How can we see the cracks and defend the threats? The new imperative is to stop building walls and quick fixes. Rise to the mandate of building a more resilient world — from the networks that power the connected world to the software that enables global connection.
Published: 2018

9. Effectiveness results for popular e-discovery algorithms

Author: Roman Yurchak, Eugene Yang, Ophir Frieder, and David Grossman
Subjects: Computer science, Suite, 02 engineering and technology, 01 natural sciences, Weighting, Term (time), 010104 statistics & probability, Categorization, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Benchmark (computing), Test suite, 020201 artificial intelligence & image processing, Relevance (information retrieval), 0101 mathematics, Algorithm
Abstract: E-Discovery applications rely upon binary text categorization to determine relevance of documents to a particular case. Although many such categorization algorithms exist, at present, vendors often deploy tools that typically include only one text categorization approach. Unlike previous studies that vary many evaluation parameters simultaneously, fail to include common current algorithms, weights, or features, or used small document collections which are no longer meaningful, we systematically evaluate binary text categorization algorithms using modern benchmark e-Discovery queries (topics) on a benchmark e-Discovery data set. We demonstrate the wide variance of performance obtained using the different parameter combinations, motivating this evaluation.Specifically, we compare five text categorization algorithms, three term weighting techniques and two feature types on a large standard dataset and evaluate the results of this test suite (30 variations) using metrics of greatest interest to the e-Discovery community. Our findings systematically demonstrate that an e-Discovery project is better served by a suite of, rather than a single, algorithms since performance varies greatly depending on the topic, and no approach is uniformly superior across the range of conditions and topics. To that end, we developed an open source project called FreeDiscovery that provides e-Discovery projects with simplified access to a suite of algorithms.
Published: 2017

10. Hate speech detection: Challenges and solutions

Author: Nazli Goharian, Katina Russell, Eugene Yang, Sean MacAvaney, Hao-Ren Yao, and Ophir Frieder
Subjects: Facebook, Computer science, Datasets as Topic, Social Sciences, 02 engineering and technology, computer.software_genre, Task (project management), Machine Learning, Sociology, 0202 electrical engineering, electronic engineering, information engineering, Data Mining, Psychology, Language, Interpretability, Multidisciplinary, Hate, Social Communication, Social Networks, Engineering and Technology, Medicine, 020201 artificial intelligence & image processing, Network Analysis, Research Article, Computer and Information Sciences, Science, Twitter, MEDLINE, Violence, Machine learning, Artificial Intelligence, Support Vector Machines, 020204 information systems, Humans, Speech, Voice activity detection, business.industry, Cognitive Psychology, Biology and Life Sciences, Linguistics, Communications, Support vector machine, Speech Signal Processing, Signal Processing, Cognitive Science, Artificial intelligence, business, Social Media, computer, Neuroscience
Abstract: As online content continues to grow, so does the spread of hate speech. We identify and examine challenges faced by online automatic approaches for hate speech detection in text. Among these difficulties are subtleties in language, differing definitions on what constitutes hate speech, and limitations of data availability for training and testing of these systems. Furthermore, many recent approaches suffer from an interpretability problem—that is, it can be difficult to understand why the systems make the decisions that they do. We propose a multi-view SVM approach that achieves near state-of-the-art performance, while being simpler and producing more easily interpretable decisions than neural methods. We also discuss both technical and practical challenges that remain for this task.
Published: 2019

11. Identifying Political Sentiment between Nation States with Social Media

Author: Victor Bowen, Ganesh Harihara, Xisen Tian, Nathanael Chambers, Ethan Genco, Eugene Yang, and Eric Young
Subjects: Politics, Alliance, business.industry, Computer science, Sentiment analysis, Opinion poll, Social media, Public relations, business, Filter (software)
Abstract: This paper describes an approach to largescale modeling of sentiment analysis for the social sciences. The goal is to model relations between nation states through social media. Many cross-disciplinary applications of NLP involve making predictions (such as predicting political elections), but this paper instead focuses on a model that is applicable to broader analysis. Do citizens express opinions in line with their home country’s formal relations? When opinions diverge over time, what is the cause and can social media serve to detect these changes? We describe several learning algorithms to study how the populace of a country discusses foreign nations on Twitter, ranging from state-of-theart contextual sentiment analysis to some required practical learners that filter irrelevant tweets. We evaluate on standard sentiment evaluations, but we also show strong correlations with two public opinion polls and current international alliance relationships. We conclude with some political science use cases.
Published: 2015

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

11 results on '"Eugene Yang"'

1. Certifying One-Phase Technology-Assisted Reviews

2. Heuristic Stopping Rules For Technology-Assisted Review

3. On Minimizing Cost in Legal Document Review Workflows

4. GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection

5. Text Retrieval Priors for Bayesian Logistic Regression

6. A Regularization Approach to Combining Keywords and Training Data in Technology-Assisted Review

7. A Practical Obstacle Avoidance Method Using Q-Learning with Local Information

8. Fuzz testing & software composition analysis in software engineering

9. Effectiveness results for popular e-discovery algorithms

10. Hate speech detection: Challenges and solutions

11. Identifying Political Sentiment between Nation States with Social Media

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

11 results on '"Eugene Yang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources