32 results on '"Ce, Zhang"'
Search Results
2. BRIGHT - Graph Neural Networks in Real-time Fraud Detection
- Author
-
Mingxuan Lu, Zhichao Han, Susie Xi Rao, Zitao Zhang, Yang Zhao, Yinan Shan, Ramesh Raghunathan, Ce Zhang, and Jiawei Jiang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,graph neural networks ,fraud detection ,heterogeneous graph ,dynamic graph ,graph inference ,Machine Learning (cs.LG) - Abstract
Detecting fraudulent transactions is an essential component to control risk in e-commerce marketplaces. Apart from rule-based and machine learning filters that are already deployed in production, we want to enable efficient real-time inference with graph neural networks (GNNs), which is useful to catch multihop risk propagation in a transaction graph. However, two challenges arise in the implementation of GNNs in production. First, future information in a dynamic graph should not be considered in message passing to predict the past. Second, the latency of graph query and GNN model inference is usually up to hundreds of milliseconds, which is costly for some critical online services. To tackle these challenges, we propose a Batch and Real-time Inception GrapH Topology (BRIGHT) framework to conduct an end-to-end GNN learning that allows efficient online real-time inference. BRIGHT framework consists of a graph transformation module (Two-Stage Directed Graph) and a corresponding GNN architecture (Lambda Neural Network). The Two-Stage Directed Graph guarantees that the information passed through neighbors is only from the historical payment transactions. It consists of two subgraphs representing historical relationships and real-time links, respectively. The Lambda Neural Network decouples inference into two stages: batch inference of entity embeddings and real-time inference of transaction prediction. Our experiments show that BRIGHT outperforms the baseline models by >2\% in average w.r.t.~precision. Furthermore, BRIGHT is computationally efficient for real-time fraud detection. Regarding end-to-end performance (including neighbor query and inference), BRIGHT can reduce the P99 latency by >75\%. For the inference stage, our speedup is on average 7.8$\times$ compared to the traditional GNN., CIKM Acceptance
- Published
- 2022
- Full Text
- View/download PDF
3. Variational Graph Author Topic Modeling
- Author
-
Delvin Ce Zhang and Hady W. Lauw
- Published
- 2022
- Full Text
- View/download PDF
4. dcbench
- Author
-
Sabri Eyuboglu, Bojan Karlaš, Christopher Ré, Ce Zhang, and James Zou
- Published
- 2022
- Full Text
- View/download PDF
5. HUNTER: An Online Cloud Database Hybrid Tuning System for Personalized Requirements
- Author
-
Baoqing Cai, Yu Liu, Ce Zhang, Guangyu Zhang, Ke Zhou, Li Liu, Chunhua Li, Bin Cheng, Jie Yang, and Jiashu Xing
- Published
- 2022
- Full Text
- View/download PDF
6. A Deep Markov Model for Clickstream Analytics in Online Shopping
- Author
-
Yilmazcan Ozyurt, Tobias Hatt, Ce Zhang, and Stefan Feuerriegel
- Published
- 2022
- Full Text
- View/download PDF
7. Topic Modeling for Multi-Aspect Listwise Comparisons
- Author
-
Hady W. Lauw and Delvin Ce Zhang
- Subjects
Topic model ,Set (abstract data type) ,Probabilistic method ,Information retrieval ,Ranking ,Computer science ,Plain text ,Rank (computer programming) ,computer.file_format ,Semantics ,computer ,Generative grammar - Abstract
As a well-established probabilistic method, topic models seek to uncover latent semantics from plain text. In addition to having textual content, we observe that documents are usually compared in listwise rankings based on their content. For instance, world-wide countries are compared in an international ranking in terms of electricity production based on their national reports. Such document comparisons constitute additional information that reveal documents' relative similarities. Incorporating them into topic modeling could yield comparative topics that help to differentiate and rank documents. Furthermore, based on different comparison criteria, the observed document comparisons usually cover multiple aspects, each expressing a distinct ranked list. For example, a country may be ranked higher in terms of electricity production, but fall behind others in terms of life expectancy or government budget. Each comparison criterion, or aspect, observes a distinct ranking. Considering such multiple aspects of comparisons based on different ranking criteria allows us to derive one set of topics that inform heterogeneous document similarities. We propose a generative topic model aimed at learning topics that are well aligned to multi-aspect listwise comparisons. Experiments on public datasets demonstrate the advantage of the proposed method in jointly modeling topics and ranked lists against baselines comprehensively.
- Published
- 2021
- Full Text
- View/download PDF
8. AutoML: From Methodology to Application
- Author
-
Zhen Wang, Yaliang Li, Yuexiang Xie, Ce Zhang, Bolin Ding, and Kai Zeng
- Subjects
Hyperparameter ,Range (mathematics) ,Model architecture ,business.industry ,Computer science ,Process (engineering) ,Hyperparameter optimization ,Feature generation ,Architecture ,Software engineering ,business ,External Data Representation - Abstract
Machine Learning methods have been adopted for a wide range of real-world applications, ranging from social networks, online image/video-sharing platforms, and e-commerce to education, healthcare, etc. However, in practice, a large amount of effort is required to tune several components of machine learning methods, including data representation, hyperparameter, and model architecture, in order to achieve a good performance. To alleviate the required tunning efforts, Automated Machine Learning (AutoML), which can automate the process of applying machine learning methods, has been studied in both academy and industry recently. In this tutorial, we will introduce the main research topics of AutoML, including Hyperparameter Optimization, Neural Architecture Search, and Meta-Learning. Two emerging topics of AutoML, Automatic Feature Generation and Machine Learning Guided Database, will also be discussed since they are important components for real-world applications. For each topic, we will motivate it with application examples from industry, illustrate the state-of-the-art methodologies, and discuss some future research directions based on our experience from industry and the trends in academy.
- Published
- 2021
- Full Text
- View/download PDF
9. AutoML
- Author
-
Ce Zhang, Bolin Ding, Yaliang Li, and Zhen Wang
- Subjects
Hyperparameter ,Meta learning (computer science) ,Computer science ,Process (engineering) ,Scale (chemistry) ,Hyperparameter optimization ,Perspective (graphical) ,Architecture ,External Data Representation ,Data science - Abstract
Machine learning methods have been adopted for various real-world applications, ranging from social networks, online image/video-sharing platforms, and e-commerce to education, healthcare, etc. However, several components of machine learning methods, including data representation, hyperparameter and model architecture, can largely affect their performance in practice. Moreover, the explosions of data scale and model size make the optimization of these components more and more time-consuming for machine learning developers. To tackle these challenges, Automated Machine Learning (AutoML) aims to automate the process of applying machine learning methods to solve real-world application tasks, reducing the time of tuning machine learning methods while maintaining good performance. In this tutorial, we will introduce the main research topics of AutoML, including Hyperparameter Optimization, Neural Architecture Search and Meta-Learning. Two emerging topics of AutoML, DNN-based Feature Generation and Machine Learning Guided Database, will also be discussed as they are important components for real-world applications. For each topic, we will motivate it with examples from industry, illustrate the state-of-the-art methods, and discuss their pros and cons from both perspectives of industry and academy. We will also discuss some future research directions based on our experience from industry and the trends in academy.
- Published
- 2021
- Full Text
- View/download PDF
10. FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data
- Author
-
Ce Zhang, Yuexiang Xie, Wei Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Nezihe Merve Gürel, Minlie Huang, and Zhen Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,business.industry ,Feature vector ,Machine Learning (stat.ML) ,Cloud computing ,Recommender system ,computer.software_genre ,Machine Learning (cs.LG) ,Tree traversal ,Statistics - Machine Learning ,Feature (computer vision) ,Benchmark (computing) ,Graph (abstract data type) ,Data mining ,business ,computer ,Interpretability - Abstract
High-order interactive features capture the correlation between different columns and thus are promising to enhance various learning tasks on ubiquitous tabular data. To automate the generation of interactive features, existing works either explicitly traverse the feature space or implicitly express the interactions via intermediate activations of some designed models. These two kinds of methods show that there is essentially a trade-off between feature interpretability and search efficiency. To possess both of their merits, we propose a novel method named Feature Interaction Via Edge Search (FIVES), which formulates the task of interactive feature generation as searching for edges on the defined feature graph. Specifically, we first present our theoretical evidence that motivates us to search for useful interactive features with increasing order. Then we instantiate this search strategy by optimizing both a dedicated graph neural network (GNN) and the adjacency tensor associated with the defined feature graph. In this way, the proposed FIVES method simplifies the time-consuming traversal as a typical training course of GNN and enables explicit feature generation according to the learned adjacency tensor. Experimental results on both benchmark and real-world datasets show the advantages of FIVES over several state-of-the-art methods. Moreover, the interactive features identified by FIVES are deployed on the recommender system of Taobao, a worldwide leading e-commerce platform. Results of an online A/B testing further verify the effectiveness of the proposed method FIVES, and we further provide FIVES as AI utilities for the customers of Alibaba Cloud., Comment: Accepted by KDD-21
- Published
- 2021
- Full Text
- View/download PDF
11. Towards understanding end-to-end learning in the context of data
- Author
-
Ce Zhang and Wentao Wu
- Subjects
Computer science ,business.industry ,Process (engineering) ,Data management ,Feature extraction ,Context (language use) ,Machine learning ,computer.software_genre ,Pipeline (software) ,Robustness (computer science) ,Table (database) ,Artificial intelligence ,business ,computer ,Pace - Abstract
Recent advances in machine learning (ML) systems have made it incredibly easier to train ML models given a training set. However, our understanding of the behavior of the model training process has not been improving at the same pace. Consequently, a number of key questions remain: How can we systematically assign importance or value to training data with respect to the utility of the trained models, may it be accuracy, fairness, or robustness? How does noise in the training data, either injected by noisy data acquisition processes or adversarial parties, have an impact on the trained models? How can we find the right data that can be cleaned and labeled to improve the utility of the trained models? Just when we start to understand these important questions for ML models in isolation recently, we now have to face the reality that most real-world ML applications are way more complex than a single ML model. In this article---an extended abstract for an invited talk at the DEEM workshop---we will discuss our current efforts in revisiting these questions for an end-to-end ML pipeline, which consists of a noise model for data and a feature extraction pipeline, followed by the training of an ML model. In our opinion, this poses a unique challenge on the joint analysis of data processing and learning. Although we will describe some of our recent results towards understanding this interesting problem, this article is more of a "confession" on our technical struggles and a "cry for help" to our data management community.
- Published
- 2021
- Full Text
- View/download PDF
12. Research on SRGM Parameter Optimization Based on Improved Particle Swarm Optimization Algorithm
- Author
-
Wenqian Jiang, Ce Zhang, Zhichao Sun, Miaomiao Fan, Wenyu Li, Yafei Wen, Wen Song, and Kaiwei Liu
- Published
- 2021
- Full Text
- View/download PDF
13. Analysis of the Influence of Total Number of Software Faults on SRGM Performance
- Author
-
Zhichao Sun, Ce Zhang, YuFei Yuan, Wenqian Jiang, Miaomiao Fan, Wenyu Li, Yafei Wen, Wen Song, and Kaiwei Liu
- Published
- 2021
- Full Text
- View/download PDF
14. Learning User Representations with Hypercuboids for Recommender Systems
- Author
-
Ce Zhang, Yumeng Li, Yue Hu, Shaojian He, Wenwu Ou, Tanchao Zhu, Aston Zhang, Huoyu Liu, and Shuai Zhang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Point (typography) ,business.industry ,Computer science ,Novelty ,02 engineering and technology ,Recommender system ,Space (commercial competition) ,Machine learning ,computer.software_genre ,Computer Science - Information Retrieval ,Machine Learning (cs.LG) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Collaborative filtering ,Key (cryptography) ,020201 artificial intelligence & image processing ,Artificial intelligence ,Architecture ,Representation (mathematics) ,business ,computer ,Information Retrieval (cs.IR) - Abstract
Modeling user interests is crucial in real-world recommender systems. In this paper, we present a new user interest representation model for personalized recommendation. Specifically, the key novelty behind our model is that it explicitly models user interests as a hypercuboid instead of a point in the space. In our approach, the recommendation score is learned by calculating a compositional distance between the user hypercuboid and the item. This helps to alleviate the potential geometric inflexibility of existing collaborative filtering approaches, enabling a greater extent of modeling capability. Furthermore, we present two variants of hypercuboids to enhance the capability in capturing the diversities of user interests. A neural architecture is also proposed to facilitate user hypercuboid learning by capturing the activity sequences (e.g., buy and rate) of users. We demonstrate the effectiveness of our proposed model via extensive experiments on both public and commercial datasets. Empirical results show that our approach achieves very promising results, outperforming existing state-of-the-art., Comment: Accepted by WSDM 2021
- Published
- 2021
- Full Text
- View/download PDF
15. vChain: A Blockchain System Ensuring Query Integrity
- Author
-
Cheng Xu, Haixin Wang, Ce Zhang, and Jianliang Xu
- Subjects
Blockchain ,Database ,Computer science ,business.industry ,Usability ,02 engineering and technology ,Service provider ,computer.software_genre ,Visualization ,020204 information systems ,Data integrity ,0202 electrical engineering, electronic engineering, information engineering ,business ,computer - Abstract
This demonstration presents vChain, a blockchain system that ensures query integrity. With the proliferation of blockchain applications and services, there has been an increasing demand for querying the data stored in a blockchain database. However, existing solutions either are at the risk of losing query integrity, or require users to maintain a full copy of the blockchain database. In comparison, by employing a novel verifiable query processing framework, vChain enables a lightweight user to authenticate the query results returned from a potentially untrusted service provider. We demonstrate its verifiable query operations, usability, and performance with visualization for better insights. We also showcase how users can detect falsified results in the case that the service provider is compromised.
- Published
- 2020
- Full Text
- View/download PDF
16. Smart Surface Classification for Accessible Routing through Built Environment
- Author
-
Ce Zhang, Janick Edinger, Zheng Cao, Vaskar Raychoudhury, Valeria Mokrenko, and Osman Gani
- Subjects
Empirical research ,Computer science ,Order (business) ,law ,Human–computer interaction ,Scalability ,Gyroscope ,Routing (electronic design automation) ,Accelerometer ,Built environment ,law.invention - Abstract
In order to provide individuals with restricted mobility the opportunity to travel more efficiently, various systems have proposed modeling techniques and routing algorithms that handle accessible navigation through the built environment which is otherwise dotted with mobility barriers. Such systems use data gathered from smartphone sensors or crowd-sourcing to pinpoint the location of the barriers as well as the facilities, such as crosswalks with traffic signals or access ramps to curbs. Though the previous works have identified the type of surface and incline to be important features to determine accessibility, no extensive empirical research exists on how these parameters affect navigation. In order to address this problem, we propose to build a novel system called WheelShare, which uses machine learning to classify surfaces into accessible or otherwise and uses that knowledge to generate accessible routes for wheelchair users. We have trained our system with accelerometer and gyroscope data obtained from 26 different surfaces found frequently in indoor and outdoor environments across Europe and USA. More data is collected by the system through crowd-sourcing based contribution from interested users. Our evaluation shows that WheelShare can achieve an accuracy of up to 96% in identifying surfaces in one of the 5 different accessibility classes. Overall, WheelShare is a novel, scalable and data-centric approach to objectively identify the accessible features of a surface and can generate end-to-end routes for wheelchair users using frequently updated crowd-sourced information.
- Published
- 2019
- Full Text
- View/download PDF
17. vChain
- Author
-
Jianliang Xu, Cheng Xu, and Ce Zhang
- Subjects
FOS: Computer and information sciences ,Computer Science - Cryptography and Security ,Blockchain ,Database ,Range query (data structures) ,Computer science ,Databases (cs.DB) ,02 engineering and technology ,computer.software_genre ,Computer Science - Databases ,Robustness (computer science) ,020204 information systems ,Data integrity ,Trie ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Verifiable secret sharing ,Accumulator (computing) ,Cryptography and Security (cs.CR) ,computer - Abstract
Blockchains have recently been under the spotlight due to the boom of cryptocurrencies and decentralized applications. There is an increasing demand for querying the data stored in a blockchain database. To ensure query integrity, the user can maintain the entire blockchain database and query the data locally. However, this approach is not economic, if not infeasible, because of the blockchain's huge data size and considerable maintenance costs. In this paper, we take the first step toward investigating the problem of verifiable query processing over blockchain databases. We propose a novel framework, called vChain, that alleviates the storage and computing costs of the user and employs verifiable queries to guarantee the results' integrity. To support verifiable Boolean range queries, we propose an accumulator-based authenticated data structure that enables dynamic aggregation over arbitrary query attributes. Two new indexes are further developed to aggregate intra-block and inter-block data records for efficient query verification. We also propose an inverted prefix tree structure to accelerate the processing of a large number of subscription queries simultaneously. Security analysis and empirical study validate the robustness and practicality of the proposed techniques.
- Published
- 2019
- Full Text
- View/download PDF
18. Network Scheduling in the Dark
- Author
-
ukić, Ce Zhang, Muhsen Owaida, Bojan Karlas, Sangeetha Abdu Jyothi, Vojislav Dstrok, and Ankit Singla
- Subjects
Computer science ,Distributed computing ,Scheduling (computing) - Published
- 2018
- Full Text
- View/download PDF
19. DimBoost
- Author
-
Ce Zhang, Jiawei Jiang, Bin Cui, and Fangcheng Fu
- Subjects
Boosting (machine learning) ,Computer science ,business.industry ,02 engineering and technology ,Machine learning ,computer.software_genre ,Bottleneck ,020204 information systems ,Histogram ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Curse of dimensionality - Abstract
Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask "Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.
- Published
- 2018
- Full Text
- View/download PDF
20. Sensing Social Media Signals for Cryptocurrency News
- Author
-
Beck, Johannes, primary, Huang, Roberta, additional, Lindner, David, additional, Guo, Tian, additional, Ce, Zhang, additional, Helbing, Dirk, additional, and Antulov-Fantulin, Nino, additional
- Published
- 2019
- Full Text
- View/download PDF
21. How good are machine learning clouds for binary classification with good features?
- Author
-
Luyuan Zeng, Wentao Wu, Hantian Zhang, and Ce Zhang
- Subjects
business.industry ,Active learning (machine learning) ,Computer science ,Algorithmic learning theory ,Stability (learning theory) ,Online machine learning ,Multi-task learning ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Robot learning ,Computational learning theory ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Instance-based learning ,business ,computer - Abstract
In spite of the recent advancement of machine learning research, modern machine learning systems are still far from easy to use, at least from the perspective of business users or even scientists without a computer science background. Recently, there is a trend toward pushing machine learning onto the cloud as a "service," a.k.a. machine learning clouds. By putting a set of machine learning primitives on the cloud, these services significantly raise the level of abstraction for machine learning. For example, with Amazon Machine Learning, users only need to upload the dataset and specify the type of task (classification or regression). The cloud will then train machine learning models without any user intervention.
- Published
- 2017
- Full Text
- View/download PDF
22. An Overreaction to the Broken Machine Learning Abstraction
- Author
-
Ce Zhang, Wentao Wu, and Tian Li
- Subjects
Syntax (programming languages) ,Point (typography) ,Computer science ,business.industry ,Interface (Java) ,0102 computer and information sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Domain (software engineering) ,010201 computation theory & mathematics ,0103 physical sciences ,Artificial intelligence ,business ,010303 astronomy & astrophysics ,computer ,Abstraction (linguistics) - Abstract
After hours of teaching astrophysicists TensorFlow and then see them, nevertheless, continue to struggle in the most creative way possible, we asked, What is the point of all of these efforts?It was a warm winter afternoon, Zurich was not gloomy at all; while Seattle was sunny as usual, and Beijing's air was crystally clear. One of the authors stormed out of a Marathon meeting with biologists, and our journey of overreaction begins. We ask, Can we build a system that gets domain experts completely out of the machine learning loop? Can this system have exactly the same interface as linear regression, the bare minimum requirement of a scientist?We started trial-and-errors and discussions with domain experts, all of whom not only have a great sense of humor but also generously offered to be our "guinea pigs." After months of exploration the architecture of our system, ease.ml, starts to get into shape---It is not as general as TensorFlow but not completely useless; in fact, many applications we are supporting can be built completely with ease.ml, and many others just need some syntax sugars. During development, we find that building ease.ml in the right way raises a series of technical challenges. In this paper, we describe our ease.ml vision, discuss each of these technical challenges, and map out our research agenda for the months and years to come.
- Published
- 2017
- Full Text
- View/download PDF
23. Heterogeneity-aware Distributed Parameter Servers
- Author
-
Lele Yu, Bin Cui, Ce Zhang, and Jiawei Jiang
- Subjects
020203 distributed computing ,Range (mathematics) ,Schedule ,Stochastic gradient descent ,Constant (computer programming) ,Computer science ,020204 information systems ,Distributed computing ,Server ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,02 engineering and technology - Abstract
We study distributed machine learning in heterogeneous environments in this work. We first conduct a systematic study of existing systems running distributed stochastic gradient descent; we find that, although these systems work well in homogeneous environments, they can suffer performance degradation, sometimes up to 10x, in heterogeneous environments where stragglers are common because their synchronization protocols cannot fit a heterogeneous setting. Our first contribution is a heterogeneity-aware algorithm that uses a constant learning rate schedule for updates before adding them to the global parameter. This allows us to suppress stragglers' harm on robust convergence. As a further improvement, our second contribution is a more sophisticated learning rate schedule that takes into consideration the delayed information of each update. We theoretically prove the valid convergence of both approaches and implement a prototype system in the production cluster of our industrial partner Tencent Inc. We validate the performance of this prototype using a range of machine-learning workloads. Our prototype is 2-12x faster than other state-of-the-art systems, such as Spark, Petuum, and TensorFlow; and our proposed algorithm takes up to 6x fewer iterations to converge.
- Published
- 2017
- Full Text
- View/download PDF
24. Extracting Databases from Dark Data with DeepDive
- Author
-
Ce Zhang, Feng Niu, Christopher Ré, Jaeho Shin, and Michael Cafarella
- Subjects
Database ,Relational database ,Computer science ,business.industry ,Big data ,Probabilistic logic ,02 engineering and technology ,computer.software_genre ,Dark data ,Data science ,Article ,Set (abstract data type) ,Information extraction ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Precision and recall ,business ,computer ,Data integration - Abstract
DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data --- scientific papers, Web classified ads, customer service notes, and so on --- were instead in a relational database, it would give analysts access to a massive and highly-valuable new set of "big data" to exploit. DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.
- Published
- 2016
- Full Text
- View/download PDF
25. Caffe con Troll
- Author
-
Firas Abuzaid, Stefan Hadjis, Christopher Ré, and Ce Zhang
- Subjects
FOS: Computer and information sciences ,Speedup ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (stat.ML) ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Convolutional neural network ,Article ,Machine Learning (cs.LG) ,Statistics - Machine Learning ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Throughput (business) ,Caffè ,Artificial neural network ,business.industry ,Deep learning ,020206 networking & telecommunications ,FLOPS ,Computer Science - Learning ,Computer architecture ,Central processing unit ,Artificial intelligence ,business - Abstract
We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5x throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.
- Published
- 2015
- Full Text
- View/download PDF
26. A Markov logic framework for recognizing complex events from multimodal data
- Author
-
Henry Kautz, Ce Zhang, Jiebo Luo, James F. Allen, Young Chol Song, Mary Swift, and Yuncheng Li
- Subjects
Structure (mathematical logic) ,Parsing ,Markov chain ,Computer science ,business.industry ,Probabilistic logic ,Inference ,Machine learning ,computer.software_genre ,Multimodal interaction ,Task (project management) ,Artificial intelligence ,business ,computer ,Gesture - Abstract
We present a general framework for complex event recognition that is well-suited for integrating information that varies widely in detail and granularity. Consider the scenario of an agent in an instrumented space performing a complex task while describing what he is doing in a natural manner. The system takes in a variety of information, including objects and gestures recognized by RGB-D and descriptions of events extracted from recognized and parsed speech. The system outputs a complete reconstruction of the agent's plan, explaining actions in terms of more complex activities and filling in unobserved but necessary events. We show how to use Markov Logic (a probabilistic extension of first-order logic) to create a model in which observations can be partial, noisy, and refer to future or temporally ambiguous events; complex events are composed from simpler events in a manner that exposes their structure for inference and learning; and uncertainty is handled in a sound probabilistic manner. We demonstrate the effectiveness of the approach for tracking kitchen activities in the presence of noisy and incomplete observations.
- Published
- 2013
- Full Text
- View/download PDF
27. GeoDeepDive
- Author
-
Shanan E. Peters, Jackson Borchardt, Christopher Ré, Ce Zhang, Tim Foltz, and Vidhya Govindaraju
- Subjects
Feature engineering ,SQL ,Information retrieval ,Computer science ,business.industry ,Data management ,Search engine indexing ,Python (programming language) ,computer.software_genre ,Data science ,Data processing system ,Data extraction ,business ,computer ,Data integration ,computer.programming_language - Abstract
We describe our proposed demonstration of GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles. This requires solving a host of classical data management challenges including data acquisition (e.g., from scanned documents), data extraction, and data integration. SIGMOD attendees will see demonstrations of three aspects of our system: (1) an end-to-end system that is of a high enough quality to perform novel geological science, but is written by a small enough team so that each aspect can be manageably explained; (2) a simple feature engineering system that allows a user to write in familiar SQL or Python; and (3) the effect of different sources of feedback on result quality including expert labeling, distant supervision, traditional rules, and crowd-sourced data.Our prototype builds on our work integrating statistical inference and learning tools into traditional database systems. If successful, our demonstration will allow attendees to see that data processing systems that use machine learning contain many familiar data processing problems such as efficient querying, indexing, and supporting tools for database-backed websites, none of which are machine-learning problems, per se.
- Published
- 2013
- Full Text
- View/download PDF
28. Towards high-throughput gibbs sampling at scale
- Author
-
Ce Zhang and Christopher Ré
- Subjects
symbols.namesake ,Computer science ,Scalability ,symbols ,Sampling (statistics) ,Data mining ,computer.software_genre ,computer ,Throughput (business) ,Synthetic data ,Factor graph ,Gibbs sampling - Abstract
Factor graphs and Gibbs sampling are a popular combination for Bayesian statistical methods that are used to solve diverse problems including insurance risk models, pricing models, and information extraction. Given a fixed sampling method and a fixed amount of time, an implementation of a sampler that achieves a higher throughput of samples will achieve a higher quality than a lower-throughput sampler. We study how (and whether) traditional data processing choices about materialization, page layout, and buffer-replacement policy need to be changed to achieve high-throughput Gibbs sampling for factor graphs that are larger than main memory. We find that both new theoretical and new algorithmic techniques are required to understand the tradeoff space for each choice. On both real and synthetic data, we demonstrate that traditional baseline approaches may achieve two orders of magnitude lower throughput than an optimal approach. For a handful of popular tasks across several storage backends, including HBase and traditional unix files, we show that our simple prototype achieves competitive (and sometimes better) throughput compared to specialized state-of-the-art approaches on factor graphs that are larger than main memory.
- Published
- 2013
- Full Text
- View/download PDF
29. Content-enriched classifier for web video classification
- Author
-
Ce Zhang, Gao Cong, and Bin Cui
- Subjects
Training set ,Information retrieval ,Categorization ,Computer science ,Classifier (UML) - Abstract
With the explosive growth of online videos, automatic real-time categorization of Web videos plays a key role for organizing, browsing and retrieving the huge amount of videos on the Web. Previous work shows that, in addition to text features, content features of videos are also useful for Web video classification. Unfortunately, extracting content features is computationally prohibitive for real-time video classification. In this paper we propose a novel video classification framework that is able to exploit both content and text features for video classification while avoiding the expensive computation of extracting content features at classification time. The main idea of our approach is to utilize the content features extracted from training data to enrich the text based semantic kernels, yielding content-enriched semantic kernels. The content-enriched semantic kernels enable to utilize both content and text features for classifying new videos without extracting their content features. The experimental results show that our approach significantly outperforms the state-of-the-art video classification methods.
- Published
- 2010
- Full Text
- View/download PDF
30. Multiple feature fusion for social media applications
- Author
-
Bin Cui, Zhe Zhao, Anthony K. H. Tung, and Ce Zhang
- Subjects
User information ,Markov random field ,Computer science ,business.industry ,Social media ,Statistical model ,Artificial intelligence ,Similarity measure ,business ,Machine learning ,computer.software_genre ,computer ,Social relation - Abstract
The emergence of social media as a crucial paradigm has posed new challenges to the research and industry communities, where media are designed to be disseminated through social interaction. Recent literature has noted the generality of multiple features in the social media environment, such as textual, visual and user information. However, most of the studies employ only a relatively simple mechanism to merge the features rather than fully exploit feature correlation for social media applications. In this paper, we propose a novel approach to fusing multiple features and their correlations for similarity evaluation. Specifically, we first build a Feature Interaction Graph (FIG) by taking features as nodes and the correlations between them as edges. Then, we employ a probabilistic model based on Markov Random Field to describe the graph for similarity measure between multimedia objects. Using that, we design an efficient retrieval algorithm for large social media data. Further, we integrate temporal information into the probabilistic model for social media recommendation. We evaluate our approach using a large real-life corpus collected from Flickr, and the experimental results indicate the superiority of our proposed method over state-of-the-art techniques.
- Published
- 2010
- Full Text
- View/download PDF
31. A smart method for tracking of moving objects on production line
- Author
-
Ce Zhang, Jiangping Mei, Yabin Ding, and Wenchang Zhang
- Subjects
Production line ,Identification (information) ,business.industry ,Computer science ,Machine vision ,Robot ,Computer vision ,Artificial intelligence ,business ,Tracking (particle physics) ,Automation ,Displacement (vector) ,Servo - Abstract
A decision-making analysis method was described for tracking of moving objects in automation product line. Based on coordinates of moving objects in image sequence, combined with the displacement information provided by servo controlled conveyer, this method solved the problem of targets' repeated identification and missing. The dependable targets' localization information was required and provided to the packing robot.
- Published
- 2008
- Full Text
- View/download PDF
32. Semantic similarity based on compact concept ontology
- Author
-
Gao Cong, Bin Cui, Ce Zhang, and Yujing Wang
- Subjects
Information retrieval ,Semantic similarity ,Similarity (network science) ,Computer science ,Semantic computing ,WordNet ,Ontology ,Upper ontology ,Ontology (information science) - Abstract
This paper presents a new method of calculating the semantic similarity between two articles based on WordNet. To further improve the performance of the proposed method, we build a new Compact Concept Ontology (CCO) from WordNet by combining the words with similar semantic meanings. The experimental results show that our approach significantly outperforms a recent proposal of computing semantic similarity, and demonstrate the superiority of the proposed CCO method.
- Published
- 2008
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.