6,486 results on '"Web query classification"'
Search Results
2. Penerapan Klasifikasi Kueri untuk Meningkatkan Efektivitas Mesin Pencari
- Author
-
Lutfi Rahmatuti Maghfiroh and Handy Geraldy
- Subjects
business.industry ,Computer science ,media_common.quotation_subject ,Machine learning ,computer.software_genre ,Random forest ,Support vector machine ,Search engine ,Naive Bayes classifier ,Web query classification ,Artificial intelligence ,Gradient boosting ,business ,Function (engineering) ,Precision and recall ,computer ,media_common - Abstract
Dalam menjalankan peran sebagai penyedia data, Badan Pusat Statistik (BPS) memberikan layanan akses data BPS bagi masyarakat. Salah satu layanan tersebut adalah fitur pencarian di website BPS. Namun, layanan pencarian yang diberikan belum memenuhi harapan konsumen. Untuk memenuhi harapan konsumen, salah satu upaya yang dapat dilakukan adalah meningkatkan efektivitas pencarian agar lebih relevan dengan maksud pengguna. Oleh karena itu, penelitian ini bertujuan untuk membangun fungsi klasifikasi kueri pada mesin pencari dan menguji apakah fungsi tersebut dapat meningkatkan efektivitas pencarian. Fungsi klasifikasi kueri dibangun menggunakan model machine learning. Kami membandingkan lima algoritma yaitu SVM, Random Forest, Gradient Boosting, KNN, dan Naive Bayes. Dari lima algoritma tersebut, model terbaik diperoleh pada algoritma SVM. Kemudian, fungsi tersebut diimplementasikan pada mesin pencari yang diukur efektivitasnya berdasarkan nilai precision dan recall. Hasilnya, fungsi klasifikasi kueri dapat mempersempit hasil pencarian pada kueri tertentu, sehingga meningkatkan nilai precision. Namun, fungsi klasifikasi kueri tidak memengaruhi nilai recall.
- Published
- 2021
3. Achieving Secure, Universal, and Fine-Grained Query Results Verification for Secure Search Scheme Over Encrypted Cloud Data
- Author
-
Hui Yin, Zheng Qin, Lu Ou, Jixin Zhang, and Keqin Li
- Subjects
Database ,Computer Networks and Communications ,Computer science ,business.industry ,Cloud computing ,Cryptography ,Query optimization ,computer.software_genre ,Encryption ,Computer Science Applications ,Query expansion ,Hardware and Architecture ,Web query classification ,Sargable ,business ,computer ,Software ,Information Systems ,RDF query language ,computer.programming_language - Abstract
Secure search techniques over encrypted cloud data allow an authorized user to query data files of interest by submitting encrypted query keywords to the cloud server in a privacy-preserving manner. However, in practice, the returned query results may be incorrect or incomplete in the dishonest cloud environment. For example, the cloud server may intentionally omit some qualified results to save computational resources and communication overhead. Thus, a well-functioning secure query system should provide a query results verification mechanism that allows the data user to verify results. In this paper, we design a secure, easily integrated, and fine-grained query results verification mechanism, by which, given an encrypted query results set, the query user not only can verify the correctness of each data file in the set but also can further check how many or which qualified data files are not returned if the set is incomplete before decryption. The verification scheme is loose-coupling to concrete secure search techniques and can be very easily integrated into any secure query scheme. We achieve the goal by constructing secure verification object for encrypted cloud data. Furthermore, a short signature technique with extremely small storage cost is proposed to guarantee the authenticity of verification object and a verification object request technique is presented to allow the query user to securely obtain the desired verification object . Performance evaluation shows that the proposed schemes are practical and efficient.
- Published
- 2021
4. Improving Efficiency of Web Application Firewall to Detect Code Injection Attacks with Random Forest Method and Analysis Attributes HTTP Request
- Author
-
Nguyen Manh Thang
- Subjects
Hypertext Transfer Protocol ,business.industry ,computer.internet_protocol ,Computer science ,020207 software engineering ,Denial-of-service attack ,0102 computer and information sciences ,02 engineering and technology ,Computer security ,computer.software_genre ,01 natural sciences ,Firewall (construction) ,010201 computation theory & mathematics ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,Web application ,The Internet ,Application firewall ,business ,computer ,Software ,Computer technology - Abstract
In the era of information technology, the use of computer technology for both work and personal use is growing rapidly with time. Unfortunately, with the increasing number and size of computer networks and systems, their vulnerability also increases. Protecting web applications of organizations is becoming increasingly relevant as most of the transactions are carried out over the Internet. Traditional security devices control attacks at the network level, but modern web attacks occur through the HTTP protocol at the application level. On the other hand, the attacks often come together. For example, a denial of service attack is used to hide code injection attacks. The system administrator spends a lot of time to keep the system running, but they may forget the code injection attacks. Therefore, the main task for system administrators is to detect network attacks at the application level using a web application firewall and apply effective algorithms in this firewall to train web application firewalls automatically for increasing his efficiency. The article introduces parameterization of the task for increasing the accuracy of query classification by the random forest method, thereby creating the basis for detecting attacks at the application level.
- Published
- 2020
5. RETRACTED ARTICLE: Automated query classification based web service similarity technique using machine learning
- Author
-
V. Jeyakrishnan, S. Balakrishnan, K. Venkatachalam, and B. Saravana Balaji
- Subjects
General Computer Science ,business.industry ,Computer science ,SOAP ,computer.internet_protocol ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Hash function ,020206 networking & telecommunications ,02 engineering and technology ,computer.file_format ,computer.software_genre ,Machine learning ,Business Process Execution Language ,ebXML ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,Table (database) ,020201 artificial intelligence & image processing ,The Internet ,Artificial intelligence ,Web service ,business ,computer - Abstract
With the tremendous growth of the internet, services provided through the internet are increasing day by day. For the adaption of web service techniques, several standards like ebXML, SOAP, WSDL, UDDI, and BPEL etc. are proposed and approved by W3C. Most of the web services are operating as a query—response model. User has to submit query according to the standard adapted, and services are supporting natural language queries nowadays. The given inputs are processed by web services server can find few similarities in sentence like nouns. The keyword for nouns is filtered accurately and saved in the list as table for each domain. Same time input query words are stored in the domain. The words stored in the domain is matched with the given input queries, later used to find the similarity between the queries In this paper, an automated technique for finding web service similarity based on query classification proposed. The proposed method adapted machine learning approach called KNN, and the data maintained in a hash indexed storage tables. As a result, the relationships between the input query and stored database have been showed in precision, recall, F1-Score and Support.
- Published
- 2020
6. Query expansion based on term selection for Hindi – English cross lingual IR
- Author
-
Ganesh Chandra and Sanjay K. Dwivedi
- Subjects
General Computer Science ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,02 engineering and technology ,Query optimization ,computer.software_genre ,Query language ,lcsh:QA75.5-76.95 ,Query expansion ,Okapi BM25 ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,Cross-language information retrieval ,Information retrieval ,Web search query ,business.industry ,05 social sciences ,020201 artificial intelligence & image processing ,Sargable ,lcsh:Electronic computers. Computer science ,Artificial intelligence ,0509 other social sciences ,050904 information & library sciences ,business ,computer ,Natural language processing - Abstract
Retrieving accurate information from collection of information available on web in a cross-lingual communication environment is a very difficult task in our world. In order to retrieve information, user specifies the needed information in the form of query. Sometimes query may not be able to express the needed information in specific way due to ambiguity or un-translated query words. This problem can be minimized by expanding the query with other suitable words that make it more specific. Purpose of query expansion is to improve the performance and quality of retrieved information in CLIR. In this paper, Q.E. has been explored for a Hindi-English CLIR in which Hindi queries are used to search English documents. We used Okapi BM25 for documents ranking and then by using Term Selection Value (TSV) translated queries have been expanded. All experiments have been performed on FIRE 2012 dataset by analysing the impact of occurrence of terms in top @3 ranked documents. Our result shows that the relevancy of retrieved results of Hindi-English CLIR using Q.E. which is performed by adding a lowest frequency term from the corpus of top @3 ranked documents is 51.33%, which is higher than before and after Q.E. (i.e. Case1, Case2). Keywords: Okapi BM25, Term selection value (TSV), Query expansion, Information retrieval, Cross language information retrieval
- Published
- 2020
7. Query Design Analysis
- Author
-
Grant Fritchey
- Subjects
Spatial query ,Information retrieval ,Web query classification ,Computer science ,View ,InformationSystems_DATABASEMANAGEMENT ,Online aggregation ,Query by Example ,Sargable ,Query optimization ,Query language ,computer ,computer.programming_language - Abstract
A database schema may include a number of performance-enhancement features such as indexes, statistics, and stored procedures. But none of these features guarantees good performance if your queries are written badly in the first place. The SQL queries may not be able to use the available indexes effectively. The structure of the SQL queries may add avoidable overhead to the query cost. Queries may be attempting to deal with data in a row-by-row fashion (or to quote Jeff Moden, Row By Agonizing Row, which is abbreviated to RBAR and pronounced “reebar”) instead of in logical sets. To improve the performance of a database application, it is important to understand the cost associated with varying ways of writing a query.
- Published
- 2022
8. Performance Evaluation of Policy-Based SQL Query Classification for Data-Privacy Compliance
- Author
-
Peter K. Schwab, Maximilian S. Langohr, Klaus Meyer-Wegener, and Jonas Röckl
- Subjects
Metadata ,SQL ,Information privacy ,Computer science ,Web query classification ,ddc:000 ,Overhead (computing) ,Data mining ,Latency (engineering) ,computer.software_genre ,computer ,Graph ,computer.programming_language - Abstract
Data science must respect privacy in many situations. We have built a query repository with automatic SQL query classification according to data-privacy directives. It can intercept queries that violate the directives, since a JDBC proxy driver inserted between the end-users’ SQL tooling and the target data consults the repository for the compliance of each query. Still, this slows down query processing. This paper presents two optimizations implemented to increase classification performance and describes a measurement environment that allows quantifying the induced performance overhead. We present measurement results and show that our optimized implementation significantly reduces classification latency. The query metadata (QM) is stored in both relational and graph-based databases. Whereas query classification can be done in a few ms on average using relational QM, a graph-based classification is orders of magnitude more expensive at 137 ms on average. However, the graphs contain more precise information, and thus in some cases the final decision requires to check them, too. Our optimizations considerably reduce the number of graph-based classifications and, thus, decrease the latency to 0.35 ms in$$87\%$$87%of the classification cases.
- Published
- 2021
9. CoST: An annotated Data Collection for Complex Search
- Author
-
Aline Chevalier, Jose G. Moreno, Cheyenne Dosso, and Lynda Tamine
- Subjects
Data collection ,Information retrieval ,Computer science ,05 social sciences ,Intelligent decision support system ,Information access ,Cognitive complexity ,02 engineering and technology ,Information science ,Domain (software engineering) ,Task (project management) ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,0509 other social sciences ,050904 information & library sciences - Abstract
While great progress is made in the area of information access, there are still open issues that involve designing intelligent systems supporting task-based search. Despite the importance of task-based search, the information retrieval and information science communities still feel the lack of open-ended and annotated datasets that enable the evaluation of a number of related facets of search tasks in downstream applications. Existing datasets are either sampled from large-scale logs but provide poor annotations, or sampled from lower-scale user studies but focus on ranked list evaluation. In this work, we present CoST: a novel richly annotated dataset for evaluating complex search tasks, collaboratively designed by researchers from the computer science and cognitive psychology domains, and intended to answer a wide range of research questions dealing with task-based search. CoST includes 5667 queries recorded in 630 task-based sessions that result from a user study involving 70 french native participants who are expert in one among 3 different domains of expertise (computer science, medicine, psychology). Each participant completed 15 tasks with 5 different types of cognitive complexity (fact-finding, exploratory learning, decision-making, problem-solving, multicriteria-inferential). In addition to search data (e.g., queries and clicks), CoST provides task and session-related data, task annotations and query annotations. We illustrate possible usages of CoST through the evaluation of query classification models and the understanding of the effect of task complexity and domain on user's search behavior.
- Published
- 2021
10. Securing Aggregate Queries for DNA Databases
- Author
-
Mohamed Nassar, Qutaibah M. Malluhi, Mikhail J. Atallah, and Abdullatif Shikfa
- Subjects
0303 health sciences ,Theoretical computer science ,Logical disjunction ,Database ,Computer Networks and Communications ,Computer science ,0206 medical engineering ,02 engineering and technology ,Range query (database) ,computer.software_genre ,Query optimization ,Computer Science Applications ,03 medical and health sciences ,Hardware and Architecture ,Web query classification ,Server ,Sargable ,Special case ,computer ,020602 bioinformatics ,Software ,Boolean conjunctive query ,030304 developmental biology ,Information Systems - Abstract
This paper addresses the problem of sharing person-specific genomic sequences without violating the privacy of their data subjects to support large-scale biomedical research projects. The proposed method builds on the framework proposed by Kantarcioglu et al. [1] but extends the results in a number of ways. One improvement is that our scheme is deterministic, with zero probability of a wrong answer (as opposed to a low probability). We also provide a new operating point in the space-time tradeoff, by offering a scheme that is twice as fast as theirs but uses twice the storage space. This point is motivated by the fact that storage is cheaper than computation in current cloud computing pricing plans. Moreover, our encoding of the data makes it possible for us to handle a richer set of queries than exact matching between the query and each sequence of the database, including: (i) counting the number of matches between the query symbols and a sequence; (ii) logical OR matches where a query symbol is allowed to match a subset of the alphabet thereby making it possible to handle (as a special case) a “not equal to” requirement for a query symbol (e.g., “not a G”); (iii) support for the extended alphabet of nucleotide base codes that encompasses ambiguities in DNA sequences (this happens on the DNA sequence side instead of the query side); (iv) queries that specify the number of occurrences of each kind of symbol in the specified sequence positions (e.g., two ‘A’ and four ‘C’ and one ‘G’ and three ‘T’, occurring in any order in the query-specified sequence positions); (v) a threshold query whose answer is ‘yes’ if the number of matches exceeds a query-specified threshold (e.g., “7 or more matches out of the 15 query-specified positions”). (vi) For all query types, we can hide the answers from the decrypting server, so that only the client learns the answer. (vii) In all cases, the client deterministically learns only the query's answer, except for query type (v) where we quantify the (very small) statistical leakage to the client of the actual count.
- Published
- 2019
11. Research on power quality disturbance identification and classification technology in high noise background
- Author
-
Li Yonggang, Li Jianwen, Qin Gang, and Xiaofei Ruan
- Subjects
Computer science ,020209 energy ,Noise reduction ,020208 electrical & electronic engineering ,Energy Engineering and Power Technology ,02 engineering and technology ,Filter (signal processing) ,Time–frequency analysis ,Noise ,Tree structure ,Interference (communication) ,Control and Systems Engineering ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,Noise control ,Electrical and Electronic Engineering ,Algorithm - Abstract
In order to solve the problem of noise interference in power quality disturbance identification, a new method of multiresolution hyperbolic S-transform for noise abatement based on energy density is proposed. First, multiresolution hyperbolic S-transform is performed on the power quality disturbance signal. This method combines suitable time-frequency resolution with good noise suppression performances. Second, the transient disturbance time-frequency domain is determined according to the fluctuation energy density. By using a mean time-frequency filter, the interference of the noise in the non-transient disturbance time-frequency domain with signals is eliminated, and the noise in the non-signal time-frequency domain is suppressed. Then, the signal time-frequency domain is determined according to the energy density, and the noise in the non-signal time-frequency domain is further suppressed by the denoising time-frequency filter. The characteristic curve is extracted from the complex matrix module after the noise abatement. Finally, a time-frequency database of tree structure is established. The dynamic time warping distance query classification method is used for quick classification according to the relationship of membership degree, which reduces the number of queries and improves the recognition accuracy. The classification result shows the effectiveness of the algorithm in high noise background and the applicability in actual fields.
- Published
- 2019
12. A Framework for Extracting Information from Semi-Structured Web Data Sources
- Author
-
Mahmoud Shaker, Hamidah Ibrahim, Lili Nurliyana Abdullah, and Aida Mustapha
- Subjects
medicine.medical_specialty ,Information retrieval ,Computer science ,computer.software_genre ,World Wide Web ,Information extraction ,Web query classification ,Web page ,medicine ,Web navigation ,Web mapping ,Web intelligence ,computer ,Web modeling ,Data Web - Abstract
Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various semi-structured information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek a specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. The number of selected pages is enormous. Therefore, the performance capabilities, the overlap among results for the same queries and limitations of web search engines are an important and large area of research. Extracting information from the web data sources also becomes very important because the massive and increasing amount of diverse semi-structured information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes a framework for extracting, classifying and browsing semi-structured web data sources. The framework is able to extract relevant information from different web data sources, and classify the extracted information based on the standard classification of Nokia products.
- Published
- 2021
13. Modeling Across-Context Attention For Long-Tail Query Classification in E-commerce
- Author
-
Junhao Zhang, Weidi Xu, Hongbo Deng, Xi Chen, Keping Yang, and Jianhui Ji
- Subjects
Product category ,Information retrieval ,Computer science ,Context (language use) ,02 engineering and technology ,Task (computing) ,Search engine ,Web query classification ,020204 information systems ,Component (UML) ,Taxonomy (general) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Product (category theory) - Abstract
Product query classification is the basic component for query understanding, which aims to classify the user queries into multiple categories under a predefined product category taxonomy for the E-commerce search engine. It is a challenging task due to the tremendous amount of product categories. And a slight modification to a query will change its corresponding categories entirely, e.g., appending the "button" to the query "shirt". The problem is more severe for the tail queries which lack enough supervision information from customers. Motivated by this phenomenon, this paper proposes to model the contrasting/similar relationships between such similar queries. Our framework is composed of a base model and an across-context attention module. The across-context attention module plays the role of deriving and extracting external information from these variant queries by predicting their categories. We conduct both offline and online experiments on the real-world E-commerce search engine. Experimental results demonstrate the effectiveness of our across-context attention module.
- Published
- 2021
14. Students Query Classification System
- Author
-
K Veeresh, S Nithish Kumar, M Sai Subhakar, and S Nithish Kumar
- Subjects
Information retrieval ,Computer science ,Web query classification ,Management of Technology and Innovation ,ComputingMilieux_COMPUTERSANDEDUCATION ,General Engineering ,Classification, Complaints, Departments, Machine Learning, TF-IDF (term frequency-inverse document frequency), Vectors - Abstract
A University or educational institute generally receives a bulk of complaints posted by students every day. The issues relate to their academics or any issues related to their education or related to exam sections etc., because of these bulk of complaints received from the students every day, makes it difficult for the university to sort out them and classify them and send them to their respective departments for resolving the issues. In this project, we work on classifying these complaints based on the classes or departments they belong to, using. By using TF-IDF (term frequency-inverse document frequency) it finds terms which are more related to a specific document by converting to vectors. By capturing some keywords in the complaints, adding some weight to the keywords and using different Machine Learning classification’s we are classifying the complaint based on these keywords. This classification makes the works easier for the university and saves time which is used to sort them and gives better service for the students. Now they can directly send the complaints to the respective departments with ease.
- Published
- 2021
15. Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
- Author
-
Benny Kimelfeld, Mirek Riedewald, Nikolaos Tziavelis, Nofar Carmeli, and Wolfgang Gatterbauer
- Subjects
FOS: Computer and information sciences ,Class (set theory) ,Theoretical computer science ,Selection (relational algebra) ,Computer science ,Databases (cs.DB) ,Data structure ,Task (project management) ,Decidability ,Computer Science - Databases ,Web query classification ,Lexicographic preferences ,Computer Science - Data Structures and Algorithms ,Conjunctive query ,Data Structures and Algorithms (cs.DS) ,Information Systems - Abstract
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability, and establish the corresponding generalizations of our characterizations for every set of unary FDs., 44 pages
- Published
- 2020
16. Query Understanding via Intent Description Generation
- Author
-
Xueqi Cheng, Yanyan Lan, Yixing Fan, Jiafeng Guo, and Ruqing Zhang
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Web search query ,Information retrieval ,Computer science ,02 engineering and technology ,SemEval ,Ranking (information retrieval) ,Task (computing) ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Relevance (information retrieval) ,Cluster analysis ,Computation and Language (cs.CL) ,Natural language - Abstract
Query understanding is a fundamental problem in information retrieval (IR), which has attracted continuous attention through the past decades. Many different tasks have been proposed for understanding users' search queries, e.g., query classification or query clustering. However, it is not that precise to understand a search query at the intent class/cluster level due to the loss of many detailed information. As we may find in many benchmark datasets, e.g., TREC and SemEval, queries are often associated with a detailed description provided by human annotators which clearly describes its intent to help evaluate the relevance of the documents. If a system could automatically generate a detailed and precise intent description for a search query, like human annotators, that would indicate much better query understanding has been achieved. In this paper, therefore, we propose a novel Query-to-Intent-Description (Q2ID) task for query understanding. Unlike those existing ranking tasks which leverage the query and its description to compute the relevance of documents, Q2ID is a reverse task which aims to generate a natural language intent description based on both relevant and irrelevant documents of a given query. To address this new task, we propose a novel Contrastive Generation model, namely CtrsGen for short, to generate the intent description by contrasting the relevant documents with the irrelevant documents given a query. We demonstrate the effectiveness of our model by comparing with several state-of-the-art generation models on the Q2ID task. We discuss the potential usage of such Q2ID technique through an example application., Accepted as Long Research Paper in CIKM2020
- Published
- 2020
17. Query Classification with Multi-objective Backoff Optimization
- Author
-
Hang Yu and Lester Litchfield
- Subjects
0209 industrial biotechnology ,Hierarchy (mathematics) ,Computer science ,Reliability (computer networking) ,media_common.quotation_subject ,02 engineering and technology ,computer.software_genre ,Search engine ,020901 industrial engineering & automation ,Categorization ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Relevance (information retrieval) ,Quality (business) ,Data mining ,computer ,media_common - Abstract
E-commerce platforms greatly benefit from high-quality search that retrieves relevant search results in response to search terms. For the sake of search relevance, Query Classification (QC) has been widely adopted to make search engines robust against low text quality and complex category hierarchy. Generally, QC solutions categorize search queries and direct users to the suggested categories whereby the search results are then retrieved. In this way, the search scope is contextually constrained to increase search relevance. However, such operations might risk deteriorating e-commerce metrics when irrelevant categories are suggested. Thus, QC solutions are expected to demonstrate high accuracy. Unfortunately, existing QC methods mainly focus on the intrinsic performance of classifiers whereas fail to consider post-inference optimization that could further improve reliability. To fill up the research gap, we propose the Query Classification with Multi-objective Backoff (QCMB). The proposed solution consists of two steps: 1) hierarchical text classification that classifies search queries into multi-level categories; and 2) multi-objective backoff that substitutes potentially misclassified leaf categories with appropriate ancestors that optimize the trade-off between accuracy and depth. The proposed QCMB is evaluated using the real-world search data of Trade Me that is the largest e-commerce platform in New Zealand. Compared with the benchmarks, QCMB delivers superior solutions with flexible tuning to satisfy different users' demands. To the best of our knowledge, this work is the first attempt to enhance QC with multi-objective optimization.
- Published
- 2020
18. We Know What You Did Last Session
- Author
-
Klaus Meyer-Wegener, Maximilian S. Langohr, and Peter K. Schwab
- Subjects
050101 languages & linguistics ,SQL ,Information privacy ,Information retrieval ,Computer science ,05 social sciences ,02 engineering and technology ,Filter (signal processing) ,Session (web analytics) ,Compliance (psychology) ,Test (assessment) ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,0501 psychology and cognitive sciences ,computer ,computer.programming_language ,Complement (set theory) - Abstract
This paper explains the demonstration of the DataEconomist, a framework for policy-based SQL query classification according to data-privacy directives. Our framework automatically derives query meta-information based on query-log analysis and provides user-friendly, graphical interfaces for browsing and filtering queries based on this meta-information. We aim to complement existing data-privacy approaches and enable privacy officers to define domain-specific compliance policy rules based on the graphical filter mechanisms. Policies automatically classify queries as compliant or non-compliant regarding their processing of personal data. During our demonstration, conference attendees assess our system in several scenarios. They filter queries based on various query meta-information, learn how to define compliance policies for automatic query classification without profound technical knowledge, and test this classification by formulating non-compliant queries.
- Published
- 2020
19. A Framework for DSL-Based Query Classification Using Relational and Graph-Based Data Models
- Author
-
Klaus Meyer-Wegener, Maximilian S. Langohr, and Peter K. Schwab
- Subjects
SQL ,Information retrieval ,Computer science ,Relational database ,05 social sciences ,Graph based ,02 engineering and technology ,Data modeling ,Task (computing) ,Digital subscriber line ,Web query classification ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,050211 marketing ,020201 artificial intelligence & image processing ,computer ,computer.programming_language - Abstract
In this paper, we demonstrate a framework for DSL-based SQL query classification according to data-privacy directives. Based on query-log analysis, this framework automatically derives query meta-information (QMI) and provides interfaces for browsing and filtering queries based on this QMI. Domain-specific policy rules enable automatic classification of queries concerning their access to personal data. The generic policy-rule definition based on the QMI covers many syntactical SQL variations. To optimize classification performance, our framework stores the QMI both in relational and graph-based databases (DBs). This case study compares the behavior of a relational DB with that of a graph-based DB with respect to a particular task, namely searching for the policy rules applicable to a given query. It turned out that both solutions have their benefits, so a hybrid solution has been chosen in the end.
- Published
- 2020
20. Event-Related Query Classification with Deep Neural Networks
- Author
-
Behrooz Mansouri, Sahaj Gandhi, Ricardo Campos, and Adam Jatowt
- Subjects
business.industry ,Computer science ,Event (computing) ,SIGNAL (programming language) ,computer.software_genre ,Search engine ,Contextual design ,Recurrent neural network ,Web query classification ,Aperiodic graph ,The Internet ,Data mining ,business ,computer - Abstract
Users tend to search over the Internet to get the most updated news when an event occurs. Search engines should then be capable of effectively retrieving relevant documents for event-related queries. As the previous studies have shown, different retrieval models are needed for different types of events. Therefore, the first step for improving effectiveness is identifying the event-related queries and determining their types. In this paper, we propose a novel model based on deep neural networks to classify event-related queries into four categories: periodic, aperiodic, one-time-only, and non-event. The proposed model combines recurrent neural networks (by feeding two LSTM layers with query frequencies) and visual recognition models (by transforming time-series data from a 1D signal to a 2D image - later passed to a CNN model) for effective query type estimation. Worth noting is that our method uses only the time-series data of query frequencies, without the need to resort to any external sources such as contextual data, which makes it language and domain-independent with regards to the query issued. For evaluation, we build upon the previous datasets on event-related queries to create a new dataset that fits the purpose of our experiments. The obtained results show that our proposed model can achieve an F1-score of 0.87.
- Published
- 2020
21. Characterizing Robotic and Organic Query in SPARQL Search Sessions
- Author
-
Han Yang, Xinyue Zhang, Ruyang Liu, Meng Wang, Bingchen Zhao, and Jingyuan Zhang
- Subjects
050101 languages & linguistics ,Information retrieval ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,05 social sciences ,InformationSystems_DATABASEMANAGEMENT ,Usage analysis ,02 engineering and technology ,computer.file_format ,Research opportunities ,Query language ,computer.software_genre ,Partition (database) ,Knowledge graph ,Web query classification ,Scripting language ,0202 electrical engineering, electronic engineering, information engineering ,SPARQL ,020201 artificial intelligence & image processing ,0501 psychology and cognitive sciences ,computer - Abstract
SPARQL, as one of the most powerful query languages over knowledge graphs, has gained significant popularity in recent years. A large amount of SPARQL query logs have become available and provided new research opportunities to discover user interests, understand query intentions, and model search behaviors. However, a significant portion of the queries to SPARQL endpoints on the Web are robotic queries that are generated by automated scripts. Detecting and separating these robotic queries from those organic ones issued by human users is crucial to deep usage analysis of knowledge graphs. In light of this, in this paper, we propose a novel method to identify SPARQL queries based on session-level query features. Specifically, we define and partition SPARQL queries into different sessions. Then, we design an algorithm to detect loop patterns, which is an important characteristic of robotic queries, in a given query session. Finally, we employ a pipeline method that leverages loop pattern features and query request frequency to distinguish the robotic and organic SPARQL queries. Differing from other machine learning based methods, the proposed method can identify the query types accurately without labelled data. We conduct extensive experiments on six real-world SPARQL query log datasets. The results demonstrate that our approach can distinguish robotic and organic queries effectively and only need \(7.63 \times 10^{-4}\) s on average to process a query.
- Published
- 2020
22. Data Munging with Power Query
- Author
-
Dan Clark
- Subjects
Query expansion ,Information retrieval ,Web query classification ,Computer science ,View ,Sargable ,Query by Example ,Query language ,Query optimization ,computer ,computer.programming_language ,RDF query language - Abstract
Although Power Pivot provides many types of connections that you can use to query data, there are times when you need to clean and shape it (commonly called data munging) before loading it into the model. This is where Power Query really shines and is a very useful part of your BI arsenal. Power Query provides an easy-to-use interface for discovering and transforming data. It contains tools to clean and shape data such as removing duplicates, replacing values, and grouping data. In addition, it supports a vast array of data sources, both structured and unstructured, such as relational databases, web pages, and Hadoop, just to name a few. Once the data is extracted and transformed, you can then easily load it into a Power Pivot model.
- Published
- 2020
23. Automatic prediction of news intent for search queries
- Author
-
Wei Lu, Xiaojuan Zhang, and Shuguang Han
- Subjects
Information retrieval ,Point (typography) ,Computer science ,media_common.quotation_subject ,05 social sciences ,02 engineering and technology ,Dynamic web page ,Library and Information Sciences ,Computer Science Applications ,Search engine ,Originality ,Web query classification ,Similarity (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Social media ,0509 other social sciences ,Macro ,050904 information & library sciences ,media_common - Abstract
Purpose The purpose of this paper is to predict news intent by exploring contextual and temporal features directly mined from a general search engine query log. Design/methodology/approach First, a ground-truth data set with correctly marked news and non-news queries was built. Second, a detailed analysis of the search goals and topics distribution of news/non-news queries was conducted. Third, three news features, that is, the relationship between entity and contextual words extended from query sessions, topical similarity among clicked results and temporal burst point were obtained. Finally, to understand the utilities of the new features and prior features, extensive prediction experiments on SogouQ (a Chinese search engine query log) were conducted. Findings News intent can be predicted with high accuracy by using the proposed contextual and temporal features, and the macro average F1 of classification is around 0.8677. Contextual features are more effective than temporal features. All the three new features are useful and significant in improving the accuracy of news intent prediction. Originality/value This paper provides a new and different perspective in recognizing queries with news intent without use of such large corpora as social media (e.g. Wikipedia, Twitter and blogs) and news data sets. The research will be helpful for general-purpose search engines to address search intents for news events. In addition, the authors believe that the approaches described here in this paper are general enough to apply to other verticals with dynamic content and interest, such as blog or financial data.
- Published
- 2018
24. A Wikipedia powered state-based approach to automatic search query enhancement
- Author
-
Kyle Goslin and Markus Hofmann
- Subjects
Automatic Search Query Enhancement ,Computer science ,02 engineering and technology ,Library and Information Sciences ,Management Science and Operations Research ,Query optimization ,Query language ,computer.software_genre ,Query expansion ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Callback ,Query Drift ,Web search query ,Information retrieval ,Computer Sciences ,05 social sciences ,Computer Science Applications ,Weighting ,Information Retrieval ,020201 artificial intelligence & image processing ,Sargable ,Data mining ,0509 other social sciences ,050904 information & library sciences ,computer ,Wikipedia ,Information Systems - Abstract
This paper describes the development and testing of a novel Automatic Search Query Enhancement (ASQE) algorithm, the Wikipedia N Sub-state Algorithm (WNSSA), which utilises Wikipedia as the sole data source for prior knowledge. This algorithm is built upon the concept of iterative states and sub-states, harnessing the power of Wikipedia’s data set and link information to identify and utilise reoccurring terms to aid term selection and weighting during enhancement. This algorithm is designed to prevent query drift by making callbacks to the user’s original search intent by persisting the original query between internal states with additional selected enhancement terms. The developed algorithm has shown to improve both short and long queries by providing a better understanding of the query and available data. The proposed algorithm was compared against five existing ASQE algorithms that utilise Wikipedia as the sole data source, showing an average Mean Average Precision (MAP) improvement of 0.273 over the tested existing ASQE algorithms.
- Published
- 2018
25. Course Recommendation Based on Query Classification Approach
- Author
-
A. Anny Leema and Zameer Gulzar
- Subjects
Information retrieval ,Computer science ,E-learning (theory) ,05 social sciences ,050301 education ,02 engineering and technology ,Recommender system ,Ontology (information science) ,Computer Science Applications ,Education ,Course (navigation) ,Personalization ,Web query classification ,Computer software ,0202 electrical engineering, electronic engineering, information engineering ,Spite ,020201 artificial intelligence & image processing ,0503 education - Abstract
This article describes how with a non-formal education, a scholar has to choose courses among various domains to meet the research aims. In spite of this, the availability of large number of courses, makes the process of selecting the appropriate course a tedious, time-consuming, and risky decision, and the course selection will directly affect the performance of a scholar. The best approach to solve such problems and to produce desirable results is to use a “recommendation system.” Recommender systems at the core employ information retrieval techniques and the ongoing effort of such information retrieval systems is to deliver the most relevant information to the learner. Therefore, if a recommender system is able to recognize the intent and requirements that a user express in the form of queries, it can generate more valid recommendations. This article presents an N-Gram classification technique which can be used to generate course recommendations to scholars depend on the requirements and domain of interest. This way of personalization can improve the quality of research and learning experience by recommending courses which are otherwise overlooked by scholars, as it takes the time to go through the curriculum and finding the best possible match.
- Published
- 2018
26. Research on Web Page Classification Method Based on Query Log
- Author
-
Feiyue Ye and Yixing Ma
- Subjects
Multidisciplinary ,Information retrieval ,Web search query ,Computer science ,Static web page ,02 engineering and technology ,010501 environmental sciences ,Query optimization ,01 natural sciences ,Web query classification ,020204 information systems ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Vertical search ,Web log analysis software ,Web content ,0105 earth and related environmental sciences - Abstract
Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.
- Published
- 2018
27. Mining user queries with information extraction methods and linked data
- Author
-
Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza, and Anne Chardonnens
- Subjects
FOS: Computer and information sciences ,Web analytics ,Computer science ,02 engineering and technology ,Library and Information Sciences ,computer.software_genre ,Computer Science - Information Retrieval ,Named-entity recognition ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Digital Libraries (cs.DL) ,Technologies de l'information et de la communication (TIC) ,User query ,Knowledge bases ,Information retrieval ,Digital libraries ,business.industry ,End user ,05 social sciences ,Computer Science - Digital Libraries ,Linked data ,Query classification ,Object (computer science) ,Digital library ,Information extraction ,Cultural heritage ,0509 other social sciences ,050904 information & library sciences ,business ,computer ,Information Retrieval (cs.IR) ,Information Systems - Abstract
Purpose: Advanced usage of web analytics tools allows to capture the content of user queries. Despite their relevant nature, the manual analysis of large volumes of user queries is problematic. The purpose of this paper is to address the problem of named entity recognition in digital library user queries. Design/methodology/approach: The paper presents a large-scale case study conducted at the Royal Library of Belgium in its online historical newspapers platform BelgicaPress. The object of the study is a data set of 83,854 queries resulting from 29,812 visits over a 12-month period. By making use of information extraction methods, knowledge bases (KBs) and various authority files, this paper presents the possibilities and limits to identify what percentage of end users are looking for person and place names. Findings: Based on a quantitative assessment, the method can successfully identify the majority of person and place names from user queries. Due to the specific character of user queries and the nature of the KBs used, a limited amount of queries remained too ambiguous to be treated in an automated manner. Originality/value: This paper demonstrates in an empirical manner how user queries can be extracted from a web analytics tool and how named entities can then be mapped with KBs and authority files, in order to facilitate automated analysis of their content. Methods and tools used are generalisable and can be reused by other collection holders., SCOPUS: ar.j, info:eu-repo/semantics/published
- Published
- 2018
28. An efficient image retrieval system with structured query based feature selection and filtering initial level relevant images using range query
- Author
-
C. Seldev Christopher and J. Annrose
- Subjects
Normalization (statistics) ,SQL ,Range query (data structures) ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Feature extraction ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Feature selection ,02 engineering and technology ,Content-based image retrieval ,Query optimization ,Ranking (information retrieval) ,Query expansion ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,Visual Word ,Electrical and Electronic Engineering ,Image retrieval ,computer.programming_language ,Web search query ,business.industry ,Search engine indexing ,020207 software engineering ,Pattern recognition ,Atomic and Molecular Physics, and Optics ,Electronic, Optical and Magnetic Materials ,Euclidean distance ,Feature (computer vision) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Precision and recall ,computer ,Subspace topology - Abstract
Content Based Image Retrieval is a proficient way of storing, managing, indexing, searching, browsing, mining or retrieving images from a large image repository. Most of the researchers are intensely competing for developing an efficient and precise image retrieval system with less time and space constraint. The proposed method creates two different techniques to reduce the space and time constraints. The first method develops an efficient CBIR system by reducing a number of features to obtain an optimal feature subset using SQL query based feature selection for the normalized feature set. The second method uses SQL range query to filter out initial level relevant images and further the Euclidean distance is applied to refine the filtered subspace inorder to obtain the most relevant images. Gray-level co-occurrence matrix, Region based image descriptors and dominant color descriptor are used to extract the features. Elapsed time, retrieval precision and recall are the evaluation metrics used to analyze the performance with other image retrieval systems. The experiment was performed on Corel dataset and it shows superior performance over the previous systems.
- Published
- 2018
29. Rank fusion and semantic genetic notion based automatic query expansion model
- Author
-
Aditi Sharan and Jagendra Singh
- Subjects
Information retrieval ,Web search query ,General Computer Science ,Computer science ,General Mathematics ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,05 social sciences ,050301 education ,02 engineering and technology ,Query optimization ,computer.software_genre ,Term (time) ,Ranking (information retrieval) ,Query expansion ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Sargable ,Data mining ,0503 education ,computer ,Boolean conjunctive query - Abstract
Query expansion term selection methods are really very important for improving the accuracy and efficiency of pseudo-relevance feedback based automatic query expansion for information retrieval system by removing irrelevant and redundant terms from the top retrieved feedback documents corpus with respect to user query. Individual query expansion term selection methods have been widely investigated for improving its performance. However, it is always a challenging task to find an individual query expansion term selection method that would outperform other individual query expansion term selection methods in most cases. In this paper, first we explore the possibility of improving the overall performance using individual query expansion term selection methods. Second, we propose a model for combining multiple query expansion term selection methods by using rank combination approach, called multiple ranks combination based query expansion. Third, semantic filtering is used to filter semantically irrelevant term obtained after combining multiple query expansion term selection methods, called ranks combination and semantic filtering based query expansion. Fourth, the genetic algorithm is used to make an optimal combination of query terms and candidate term obtained after rank combination and semantic filtering approach, called semantic genetic filtering and rank combination based query expansion. Our experimental results demonstrated that our proposed approaches achieved significant improvement over each individual query expansion term selection method and related state-of-the-art approaches.
- Published
- 2018
30. Inducing and Refining Topics for Web Query Classification Using a Semantic Network
- Author
-
R. Sathish Kumar and M. Chandrasekaran
- Subjects
Computational Mathematics ,Information retrieval ,Web query classification ,Computer science ,Refining ,General Materials Science ,General Chemistry ,Electrical and Electronic Engineering ,Condensed Matter Physics ,Semantic network - Abstract
Web query classification, the task of inferring topical categories from a web search query is a non-trivial problem in Information Retrieval domain. The topic categories inferred by a Web query classification system may provide a rich set of features for improving query expansion and web advertising. Conventional methods for Web query classification derive corpus statistics from the web and employ machine-learning techniques to infer Open Directory Project categories. But they suffer from two major drawbacks, the computational overhead to derive corpus statistics and inferring topic categories that are too abstract for semantic discrimination due to polysemy. Concepts too shallow or too deep in the semantic gradient are produced due to the wrong senses of the query terms coalescing with the correct senses. This paper proposes and demonstrates a succinct solution to these problems through a method based on the Tree cut model and Wordnet Thesarus to infer fine-grained topic categories for Web query classification, and also suggests an enhancement to the Tree Cut Model to resolve sense ambiguities.
- Published
- 2018
31. Query completion in community-based Question Answering search
- Author
-
Heyan Huang, Dan Wang, Xian-Ling Mao, and Yi-Jing Hao
- Subjects
Web search query ,Information retrieval ,Computer science ,Cognitive Neuroscience ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Information needs ,02 engineering and technology ,Query optimization ,Query language ,Computer Science Applications ,Ranking (information retrieval) ,Query expansion ,Ranking ,Artificial Intelligence ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Question answering ,020201 artificial intelligence & image processing ,Sargable - Abstract
Query completion has long been proved useful to help a user explore and express his information need. In general search, such completions can be generated from a large scale query log and other accessory information. However, without query log, how to generate query completion for community-based Question Answering (cQA) search remains a challenging problem. In this work, we propose a novel query completion algorithm based on ranking cQA questions with entity and phrase information for cQA search, and a demonstration system has been developed. Without involvement of query log, this method clearly helps users complete their queries. Empirical experiments on a large scale cQA dataset show that the proposed algorithm can successfully improve user experience.
- Published
- 2018
32. A Prospect-Guided global query expansion strategy using word embeddings
- Author
-
Manuel Montes-y-Gómez, Francis C. Fernández-Reyes, and Jorge Hermosillo-Valadez
- Subjects
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Relevance feedback ,02 engineering and technology ,Library and Information Sciences ,Management Science and Operations Research ,Query optimization ,computer.software_genre ,Query language ,Ranking (information retrieval) ,Query expansion ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Mathematics ,Information retrieval ,business.industry ,Computer Science Applications ,020201 artificial intelligence & image processing ,Sargable ,Artificial intelligence ,business ,computer ,Natural language processing ,Boolean conjunctive query ,Information Systems - Abstract
The effectiveness of query expansion methods depends essentially on identifying good candidates, or prospects, semantically related to query terms. Word embeddings have been used recently in an attempt to address this problem. Nevertheless query disambiguation is still necessary as the semantic relatedness of each word in the corpus is modeled, but choosing the right terms for expansion from the standpoint of the un-modeled query semantics remains an open issue. In this paper we propose a novel query expansion method using word embeddings that models the global query semantics from the standpoint of prospect vocabulary terms. The proposed method allows to explore query-vocabulary semantic closeness in such a way that new terms, semantically related to more relevant topics, are elicited and added in function of the query as a whole. The method includes candidates pooling strategies that address disambiguation issues without using exogenous resources. We tested our method with three topic sets over CLEF corpora and compared it across different Information Retrieval models and against another expansion technique using word embeddings as well. Our experiments indicate that our method achieves significant results that outperform the baselines, improving both recall and precision metrics without relevance feedback.
- Published
- 2018
33. Nearest close friend search in geo-social networks
- Author
-
Changbeom Shim, Sungmin Yi, Wan Heo, Wooil Kim, and Yon Dohn Chung
- Subjects
Information Systems and Management ,Social network ,Computer science ,business.industry ,Closeness ,02 engineering and technology ,Computer Science Applications ,Theoretical Computer Science ,World Wide Web ,Artificial Intelligence ,Control and Systems Engineering ,Web query classification ,020204 information systems ,Location-based service ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,Software - Abstract
The proliferation of GPS-enabled devices has led to the development of location-based social network services such as Facebook, Twitter, and Foursquare. Users of these services not only make new friends but also post various content that contains their location. Although the existing services have continued to improve, they are still weak in handling some situations. If some users want to make a new friend, for example, they could manually search for the potential friends among the acquaintances of their friends by considering both spatial proximity and social closeness one by one. However, conventional studies have insufficiently tackled this problem yet.In this paper, we define a novel type of geo-social query called the k-Nearest -Close Friends query, which retrieves the k nearest data objects from among the -hop friends of the query user. We also propose three approaches for processing a k-NCF query: Neighboring Cell Search, Friend-Cell Search, and Personal-Cell Search. In addition, we develop an efficient method of index update for supporting dynamic environments. We conduct a variety of experiments on synthetic and real data sets to evaluate and compare our methods.
- Published
- 2018
34. Query personalization using social network information and collaborative filtering techniques
- Author
-
Panagiotis Georgiadis, Costas Vassilakis, and Dionisis Margaris
- Subjects
Information retrieval ,Web search query ,Social network ,Computer Networks and Communications ,Computer science ,business.industry ,02 engineering and technology ,Query language ,Query optimization ,Personalization ,Query expansion ,Hardware and Architecture ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Collaborative filtering ,020201 artificial intelligence & image processing ,business ,Software - Abstract
Query personalization has emerged as a means to handle the issue of information volume growth, aiming to tailor query answer results to match the goals and interests of each user. Query personalization dynamically enhances queries, based on information regarding user preferences or other contextual information; typically enhancements relate to incorporation of conditions that filter out results that are deemed of low value to the user and/or ordering results so that data of high value are presented first. In the domain of personalization, social network information can prove valuable; users’ social networks profiles, including their interests, influence from social friends, etc. can be exploited to personalize queries. In this paper, we present a query personalization algorithm, which employs collaborative filtering techniques and takes into account influence factors between social network users, leading to personalized results that are better-targeted to the user.
- Published
- 2018
35. A query privacy-enhanced and secure search scheme over encrypted data in cloud computing
- Author
-
Zheng Qin, Keqin Li, Lu Ou, and Hui Yin
- Subjects
020203 distributed computing ,Database ,Computer Networks and Communications ,Computer science ,business.industry ,Applied Mathematics ,020206 networking & telecommunications ,Cloud computing ,02 engineering and technology ,Construct (python library) ,Bloom filter ,computer.software_genre ,Query optimization ,Encryption ,Theoretical Computer Science ,Query expansion ,Computational Theory and Mathematics ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,Sargable ,business ,computer - Abstract
With the emerging of the cloud computing, secure search over encrypted cloud data has become a hot research spot. Previous schemes achieve weaker query privacy-preserving ability due to the limitations of query trapdoor generation mechanisms. In these schemes, a data owner usually knows fully well the query contents of data users and a data user can also easily analyze query contents of another data user. In some application scenarios, the data user may be unwilling to leak their query privacy to anyone else except himself. We propose a privacy-enhanced search scheme by allowing the data user to generate random query trapdoor every time. We leverage Bloom filter and bilinear pairing operation to construct secure index for each data file, which enables the cloud to perform search without obtaining any useful information. We prove that our scheme is secure and extensive experiments demonstrate the correctness and practicality of the proposed scheme.
- Published
- 2017
36. Joint Learning of Distance Metric and Query Model for Posteriorgram-Based Keyword Search
- Author
-
Bolaji Yusuf, Batuhan Gundogdu, and Murat Saraclar
- Subjects
Dynamic time warping ,Web search query ,Computer science ,business.industry ,Feature vector ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Query optimization ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Query expansion ,Keyword density ,Web query classification ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Electrical and Electronic Engineering ,0305 other medical science ,business ,Hidden Markov model ,computer ,Natural language processing - Abstract
In this paper, we propose a novel approach to keyword search (KWS) in low-resource languages, which provides an alternative method for retrieving the terms of interest, especially for the out of vocabulary (OOV) ones. Our system incorporates the techniques of query-by-example retrieval tasks into KWS and conducts the search by means of the subsequence dynamic time warping (sDTW) algorithm. For this, text queries are modeled as sequences of feature vectors and used as templates in the search. A Siamese neural network-based model is trained to learn a frame-level distance metric to be used in sDTW and the proper query model frame representations for this learned distance. Experiments conducted on Intelligence Advanced Research Projects Activity Babel Program's Turkish, Pashto, and Zulu datasets demonstrate the effectiveness of our approach. In each of the languages, the proposed system outperforms the large vocabulary continuous speech recognition (LVCSR) based baseline for OOV terms. Furthermore, the fusion of the proposed system with the baseline system provides an average relative actual term weighted value (ATWV) improvement of 13.9% on all terms and, more significantly, the fusion yields an average relative ATWV improvement of 154.5% on OOV terms. We show that this new method can be used as an alternative to conventional LVCSR-based KWS systems, or in combination with them, to achieve the goal of closing the gap between OOV and in-vocabulary retrieval performances.
- Published
- 2017
37. A single quadtree-based algorithm for top-kspatial keyword query
- Author
-
Wan-Yu Tsai, Ge-Ming Chiu, and Hsiang-Jen Hong
- Subjects
Web search query ,Information retrieval ,Computer Networks and Communications ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,02 engineering and technology ,computer.software_genre ,Query optimization ,Query language ,Computer Science Applications ,Spatial query ,Query expansion ,Keyword density ,Hardware and Architecture ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Sargable ,Data mining ,computer ,Algorithm ,Software ,Information Systems - Abstract
Recent years have witnessed the generation of a massive amount of spatialtextual data. In view of this, a new type of query coined spatial keyword query has been proposed to deal with the location-based services with additional keyword constraint. This paper studies one of the most popular spatial keyword queries called Top-k Spatial Keyword Query(TkSKQ). Explicitly speaking, given a set of objects, a TkSKQfinds the k objects that are closest to the querier with each of these k objects satisfying all the keywords specified by the query. This kind of query is of paramount importance in a variety of application domains such as location-based recommendation and advertisement.The state-of-art algorithm for processing a TkSKQis highly sensitive to the number of query keywords specified in the query such that its performance degrades significantly with an increase in the number of keywords. To remedy this drawback, this paper proposes a novel mechanism that utilizes an additional keyword list to enhance the efficiency of the existing solution. Based on this indexing technique, our algorithm needs only traverse a single quadtree when processing a TkSKQ. Moreover, we study how to prioritize the keywords in the vocabulary so as to optimize the performance of our technique. Furthermore, we deal with a generalized version of the TkSKQproblem, called HkSKQ. A similar technique can also be useful for solving HkSKQ. Experimental results on both synthetic and real data reveal the superiority of our proposed scheme.
- Published
- 2017
38. Using online data sources to make query suggestions for children
- Author
-
Yiu-Kai Ng and Maria Soledad Pera
- Subjects
Information retrieval ,Computer Networks and Communications ,Computer science ,05 social sciences ,050301 education ,02 engineering and technology ,Backpropagation ,World Wide Web ,Artificial Intelligence ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,0503 education ,Software - Published
- 2017
39. Privacy-Preserving Similarity Joins Over Encrypted Data
- Author
-
Cong Wang, Sarana Nutanong, Xinyu Wang, Chenyun Yu, and Xingliang Yuan
- Subjects
Web search query ,Computer Networks and Communications ,Computer science ,Nearest neighbor search ,Data security ,020206 networking & telecommunications ,02 engineering and technology ,Query optimization ,Query language ,computer.software_genre ,Data set ,Query expansion ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Sargable ,Data mining ,Safety, Risk, Reliability and Quality ,computer - Abstract
Similarity search on high-dimensional data has been intensively studied for data processing and analytics. Despite its broad applicability, data security and privacy concerns along the trend of data outsourcing have not been fully addressed. In this paper, we investigate privacy-preserving similarity join queries, i.e., a pivotal primitive of similarity search that finds pairwise similar data points across two data sets. We start from locality-sensitive hashing and searchable symmetric encryption, i.e., the most practical techniques for similarity search and encrypted search, respectively. However, the immediate combination of two techniques discloses the distribution of the query set, which is exploitable to compromise the confidentiality of queries. To enhance the security, we propose the frequency hiding query scheme, which allows the server to see the flattened query distribution only. To improve the scalability, we further design the result sharing query scheme, which processes a small portion of query points and shares the results with other nearby points. Besides, we set up a strict constraint to carefully select query points to achieve “as-strong-as-possible” guarantees. We formalize the leakage functions in the context of similarity joins, and conduct rigorous security analysis. We implement and evaluate the proposed query schemes on Azure cloud. Experimental results indicate that they have different tradeoffs on security, efficiency, and accuracy, which can flexibly be used for different deployment scenarios.
- Published
- 2017
40. Knowledge-infused and consistent Complex Event Processing over real-time and persistent streams
- Author
-
Viktor K. Prasanna, Yogesh Simmhan, and Qunzhi Zhou
- Subjects
FOS: Computer and information sciences ,Online and offline ,Computer Networks and Communications ,Computer science ,Relational database ,Distributed computing ,Big data ,Complex event processing ,02 engineering and technology ,Query language ,computer.software_genre ,Query optimization ,68U35 ,H.3.4 ,Computer Science - Databases ,H.2.4 ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Semantic Web ,business.industry ,Data stream mining ,Databases (cs.DB) ,Knowledge base ,Hardware and Architecture ,Analytics ,Scalability ,020201 artificial intelligence & image processing ,Sargable ,Data mining ,business ,computer ,Software - Abstract
Emerging applications in Internet of Things (IoT) and Cyber-Physical Systems (CPS) present novel challenges to Big Data platforms for performing online analytics. Ubiquitous sensors from IoT deployments are able to generate data streams at high velocity, that include information from a variety of domains, and accumulate to large volumes on disk. Complex Event Processing (CEP) is recognized as an important real-time computing paradigm for analyzing continuous data streams. However, existing work on CEP is largely limited to relational query processing, exposing two distinctive gaps for query specification and execution: (1) infusing the relational query model with higher level knowledge semantics, and (2) seamless query evaluation across temporal spaces that span past, present and future events. These allow accessible analytics over data streams having properties from different disciplines, and help span the velocity (real-time) and volume (persistent) dimensions. In this article, we introduce a Knowledge-infused CEP (X-CEP) framework that provides domain-aware knowledge query constructs along with temporal operators that allow end-to-end queries to span across real-time and persistent streams. We translate this query model to efficient query execution over online and offline data streams, proposing several optimizations to mitigate the overheads introduced by evaluating semantic predicates and in accessing high-volume historic data streams. The proposed X-CEP query model and execution approaches are implemented in our prototype semantic CEP engine, SCEPter. We validate our query model using domain-aware CEP queries from a real-world Smart Power Grid application, and experimentally analyze the benefits of our optimizations for executing these queries, using event streams from a campus-microgrid IoT deployment., Comment: 34 pages, 16 figures, accepted in Future Generation Computer Systems, October 27, 2016
- Published
- 2017
41. A NOVEL APPROACH OF INDEXING AND RETRIEVING SPATIAL POLYGONS FOR EFFICIENT SPATIAL REGION QUERIES
- Author
-
J. H. Zhao, X. Z. Wang, F. Y. Wang, Z. H. Shen, Y. C. Zhou, and Y. L. Wang
- Subjects
lcsh:Applied optics. Photonics ,Web search query ,Information retrieval ,Range query (data structures) ,lcsh:T ,Computer science ,Spatial database ,lcsh:TA1501-1820 ,020206 networking & telecommunications ,02 engineering and technology ,Query optimization ,computer.software_genre ,lcsh:Technology ,Spatial query ,Query expansion ,lcsh:TA1-2040 ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Sargable ,Data mining ,lcsh:Engineering (General). Civil engineering (General) ,computer - Abstract
Spatial region queries are more and more widely used in web-based applications. Mechanisms to provide efficient query processing over geospatial data are essential. However, due to the massive geospatial data volume, heavy geometric computation, and high access concurrency, it is difficult to get response in real time. Spatial indexes are usually used in this situation. In this paper, based on k-d tree, we introduce a distributed KD-Tree (DKD-Tree) suitbable for polygon data, and a two-step query algorithm. The spatial index construction is recursive and iterative, and the query is an in memory process. Both the index and query methods can be processed in parallel, and are implemented based on HDFS, Spark and Redis. Experiments on a large volume of Remote Sensing images metadata have been carried out, and the advantages of our method are investigated by comparing with spatial region queries executed on PostgreSQL and PostGIS. Results show that our approach not only greatly improves the efficiency of spatial region query, but also has good scalability, Moreover, the two-step spatial range query algorithm can also save cluster resources to support a large number of concurrent queries. Therefore, this method is very useful when building large geographic information systems.
- Published
- 2017
42. Improved continuous query plan with cluster weighted dominant querying in synthetic datasets
- Author
-
C. Suresh Gnana Dhas and M. Madhankumar
- Subjects
Computer Networks and Communications ,Computer science ,InformationSystems_DATABASEMANAGEMENT ,020206 networking & telecommunications ,02 engineering and technology ,Query optimization ,computer.software_genre ,Query plan ,Spatial query ,Query expansion ,Web query classification ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Sargable ,Data mining ,Tuple ,computer ,Software ,Boolean conjunctive query - Abstract
The arrival of large voluminous continuous queries sets for a given query leads an insignificant insights. The elimination of certain data tuples occurs in order to balance the system load. The streaming query removes the improper data tuples and uses proper data tuples in the form of defined tables or sets. However, major drawback occurs due to unbounded streaming and inadequate access to end data. Due to such constraints, many stream processing methods makes the processed data unavailable for any applications or to the related queries of neighborhood branches. This paper avoids such problems during the process of data tuples at the generation of queries. The study uses a streaming model that executes effective query plans in continuous data. The streaming model aims reduce the communication cost and improves the scalability of continuous aggregation queries. It sub-divides the client query and executes it over data aggregators within the incoherent limit. A weighted dominant query algorithm is formulated to provide the top dominant value in each sub-query clusters. This reduces the cost for computation in synthetic databases. The experimental results proved that the proposed model with weighted dominant query algorithm effectively improves scalability by reducing the computational cost.
- Published
- 2017
43. Semantic Extension of Query for the Linked Data
- Author
-
Zhilei Yin, Ju Wang, Pu Li, and Yuncheng Jiang
- Subjects
Information retrieval ,Computer Networks and Communications ,Computer science ,InformationSystems_DATABASEMANAGEMENT ,02 engineering and technology ,Linked data ,computer.file_format ,Query language ,Web query classification ,020204 information systems ,Semantic computing ,0202 electrical engineering, electronic engineering, information engineering ,SPARQL ,020201 artificial intelligence & image processing ,Semantic Web Stack ,RDF ,computer ,Information Systems ,RDF query language ,computer.programming_language - Abstract
With the advent of Big Data Era, users prefer to get knowledge rather than pages from Web. Linked Data, a new form of knowledge representation and publishing described by RDF, can provide a more precise and comprehensible semantic structure to satisfy the aforementioned requirement. Further, the SPARQL query language for RDF is the foundation of many current researches about Linked Data querying. However, these SPARQL-based methods cannot fully express the semantics of the query, so they cannot unleash the potential of Linked Data. To fill this gap, this paper designs a new querying method which extends the SPARQL pattern. Firstly, the authors present some new semantic properties for predicates in RDF triples and design a Semantic Matrix for Predicates (SMP). They then establish a well-defined framework for the notion of Semantically-Extended Query Model for the Linked Data (SEQMLD). Moreover, the authors propose some novel algorithms for executing queries by integrating semantic extension into SPARQL pattern. Lastly, experimental results show that the authors' proposal has a good generality and performs better than some of the most representative similarity search methods.
- Published
- 2017
44. Efficient Retrieval of Bounded-Cost Informative Routes
- Author
-
Jiannong Cao, Shuigeng Zhou, Wengen Li, Man Lung Yiu, and Jihong Guan
- Subjects
Information retrieval ,Web search query ,Computer science ,02 engineering and technology ,Query optimization ,computer.software_genre ,Query language ,Electronic mail ,Computer Science Applications ,Constraint (information theory) ,Data set ,Query expansion ,Computational Theory and Mathematics ,Web query classification ,Bounded function ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Pruning (decision trees) ,Data mining ,computer ,Information Systems - Abstract
The widespread location-aware applications produce a vast amount of spatio-textual data that contains both spatial and textual attributes. To make use of this enriched information for users to describe their preferences for travel routes, we propose a Bounded-Cost Informative Route (BCIR) query to retrieve the routes that are the most textually relevant to the user-specified query keywords subject to a travel cost constraint. BCIR query is particularly helpful for tourists and city explorers to plan their travel routes. We will show that BCIR query is an NP-hard problem. To answer BCIR query efficiently, we propose an exact solution with effective pruning techniques and two approximate solutions with performance guarantees. Extensive experiments over real data sets demonstrate that the proposed solutions achieve the expected performance.
- Published
- 2017
45. Characterizing, predicting, and handling web search queries that match very few or no results
- Author
-
Roi Blanco, Rifat Ozcan, B. Barla Cambazoglu, Erdem Sarigil, Özgür Ulusoy, Ismail Sengor Altingovde, Ulusoy, Özgür, and Sarıgil, Erdem
- Subjects
Information Systems and Management ,Information retrieval ,Web search query ,Computer Networks and Communications ,Computer science ,05 social sciences ,02 engineering and technology ,Range query (database) ,Library and Information Sciences ,computer.software_genre ,Query language ,Spatial query ,Query expansion ,Search engine ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Queries per second ,Data mining ,0509 other social sciences ,050904 information & library sciences ,computer ,Information Systems - Abstract
A non‐negligible fraction of user queries end up with very few or even no matching results in leading commercial web search engines. In this work, we provide a detailed characterization of such queries and show that search engines try to improve such queries by showing the results of related queries. Through a user study, we show that these query suggestions are usually perceived as relevant. Also, through a query log analysis, we show that the users are dissatisfied after submitting a query that match no results at least 88.5% of the time. As a first step towards solving these no‐answer queries, we devised a large number of features that can be used to identify such queries and built machine‐learning models. These models can be useful for scenarios such as the mobile‐ or meta‐search, where identifying a query that will retrieve no results at the client device (i.e., even before submitting it to the search engine) may yield gains in terms of the bandwidth usage, power consumption, and/or monetary costs. Experiments over query logs indicate that, despite the heavy skew in class sizes, our models achieve good prediction quality, with accuracy (in terms of area under the curve) up to 0.95.
- Published
- 2017
46. A continuous reverse skyline query processing scheme for multimedia data sharing in mobile environments
- Author
-
Kyoungsoo Bok, Jaesoo Yoo, and Jongtae Lim
- Subjects
Skyline ,Web search query ,Multimedia ,Computer Networks and Communications ,Computer science ,InformationSystems_DATABASEMANAGEMENT ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Query optimization ,Data sharing ,Query expansion ,Hardware and Architecture ,Web query classification ,Location-based service ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Sargable ,Mobile device ,computer ,Software - Abstract
Recently, various query processing schemes in mobile environments have been studied. Particularly, a reverse skyline query that is the variation of a skyline query has been receiving much attention these days for multimedia data. However, the existing reverse skyline query processing schemes did not consider the mobility of devices. In this paper, we propose a continuous reverse skyline query processing scheme that considers the mobility of mobile devices. The proposed scheme removes the devices that do not affect a query by using a pruning method and continuously monitors the areas of candidate devices to update the query result incrementally.
- Published
- 2017
47. A novel framework to facilitate personalized web search in a dual mode
- Author
-
J. Selvakumar and J. Jayanthi
- Subjects
Information retrieval ,Web search query ,User profile ,Query string ,Computer Networks and Communications ,Computer science ,Search analytics ,User modeling ,Information needs ,02 engineering and technology ,Ranking (information retrieval) ,World Wide Web ,Search engine ,Query expansion ,Ranking ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Software - Abstract
Now days, web search engines provide good services in terms of retrieval and presentation of the information to the user. A foremost difficulty in the modern and ever growing web is the lack of user interest adaption in the process of web search. All users are presented with the same set of search engine result pages (SERPs) for a given input query string, since it follows the keyword based search. The limitation of keyword based search is (i) uncertain user needs and (ii) improper query selection. If the programmer is searching for a query “switch”, it refers to the switch statement of a programming language and for an electrical engineer, the context of search is the physical house hold switch component. In addition to that a user may fall short in choosing the proper query for search that best articulate their information need. Hence, it is evident that keyword searches have tough time to distinguish the user context over the query. A typical approach to focus on this challenge is a personalized web search strategy where the results are retrieved based on the user interest and preferences. The three different major search modules are: (i) building user profiles (ii) re-ranking the SERPs in personal mode and (iii) re-ranking the SERPs in group mode. The proposed work stands for contributing in the field of user profile construction and personalized page ranking. A new method of user model representation termed as Preference Network is constructed. The proposed system can work in both initialization and maintenance mode to build a new or update an existing model. Both the short term and long term interest are utilized to rank the SERPs. The user interest score and group interest score are computed dynamically.
- Published
- 2017
48. DYNAMIC QUERY FORMS FOR HANDLING RANK BASED DATABASE QUERIES
- Author
-
K.Sai Prasad
- Subjects
Web search query ,Information retrieval ,business.industry ,View ,Computer science ,Online aggregation ,Pattern recognition ,Query language ,Query optimization ,Spatial query ,Web query classification ,Conjunctive query ,Artificial intelligence ,business - Published
- 2017
49. Decision fusion-based approach for content-based image classification
- Author
-
Rik Das, Sudeep D. Thepade, and Saurav Ghosh
- Subjects
General Computer Science ,Contextual image classification ,Computer science ,business.industry ,Feature vector ,Feature extraction ,020207 software engineering ,Pattern recognition ,Linear classifier ,02 engineering and technology ,Machine learning ,computer.software_genre ,Automatic image annotation ,Web query classification ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Visual Word ,Artificial intelligence ,business ,computer - Abstract
Purpose Current practices in data classification and retrieval have experienced a surge in the use of multimedia content. Identification of desired information from the huge image databases has been facing increased complexities for designing an efficient feature extraction process. Conventional approaches of image classification with text-based image annotation have faced assorted limitations due to erroneous interpretation of vocabulary and huge time consumption involved due to manual annotation. Content-based image recognition has emerged as an alternative to combat the aforesaid limitations. However, exploring rich feature content in an image with a single technique has lesser probability of extract meaningful signatures compared to multi-technique feature extraction. Therefore, the purpose of this paper is to explore the possibilities of enhanced content-based image recognition by fusion of classification decision obtained using diverse feature extraction techniques. Design/methodology/approach Three novel techniques of feature extraction have been introduced in this paper and have been tested with four different classifiers individually. The four classifiers used for performance testing were K nearest neighbor (KNN) classifier, RIDOR classifier, artificial neural network classifier and support vector machine classifier. Thereafter, classification decisions obtained using KNN classifier for different feature extraction techniques have been integrated by Z-score normalization and feature scaling to create fusion-based framework of image recognition. It has been followed by the introduction of a fusion-based retrieval model to validate the retrieval performance with classified query. Earlier works on content-based image identification have adopted fusion-based approach. However, to the best of the authors’ knowledge, fusion-based query classification has been addressed for the first time as a precursor of retrieval in this work. Findings The proposed fusion techniques have successfully outclassed the state-of-the-art techniques in classification and retrieval performances. Four public data sets, namely, Wang data set, Oliva and Torralba (OT-scene) data set, Corel data set and Caltech data set comprising of 22,615 images on the whole are used for the evaluation purpose. Originality/value To the best of the authors’ knowledge, fusion-based query classification has been addressed for the first time as a precursor of retrieval in this work. The novel idea of exploring rich image features by fusion of multiple feature extraction techniques has also encouraged further research on dimensionality reduction of feature vectors for enhanced classification results.
- Published
- 2017
50. Document Language Models, Query Models, and Risk Minimization for Information Retrieval
- Author
-
ChengXiang Zhai and John Lafferty
- Subjects
Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,02 engineering and technology ,Query language ,Query optimization ,Management Information Systems ,Ranking (information retrieval) ,Query expansion ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Data control language ,Query by Example ,Document retrieval ,computer.programming_language ,Web search query ,Concept search ,Information retrieval ,05 social sciences ,Ranking ,Hardware and Architecture ,Sargable ,Language model ,0509 other social sciences ,050904 information & library sciences ,computer ,Boolean conjunctive query ,RDF query language - Abstract
We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.
- Published
- 2017
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.