37 results on '"Jefrey Lijffijt"'
Search Results
2. Covering the Egonet: A Crowdsourcing Approach to Social Circle Discovery on Twitter
- Author
-
Karmen Dykstra, Jefrey Lijffijt, and Aristides Gionis
- Abstract
Twitter and other social media provide the functionality of manually grouping users into lists. The goal is to enable selective viewing of content and easier information acquisition. However, creating lists manually requires significant time and effort. To mitigate this effort, a number of recent methods attempt to create lists automatically using content and/or network structure, but results are far from perfect. In this work, we study the power of the millions of lists that are already created by other twitter users in order to “crowdsource” the task of list creation. We find that in a large dataset, collected specifically for this study, an optimal matching of existing lists from other twitter users to the ground-truth lists in egonets gives an F1 score of 0.43, while the best existing method achieves only 0.21. We explore the informativeness of features derived from network structure, existing lists, and posted content. We observe that different types of features are informative for different lists, and introduce a simple algorithm for ranking candidate lists. The proposed algorithm outperforms existing methods, but still falls short of the optimal selection of existing lists.
- Published
- 2021
3. Evaluating Representation Learning and Graph Layout Methods for Visualization
- Author
-
Edith, Heiter, Bo, Kang, Tijl, De Bie, Jefrey, Lijffijt, and Mike, Potel
- Subjects
Benchmark testing ,Technology and Engineering ,Layout ,network embedding ,Machine learning algorithms ,Empirical Research ,Computer Graphics and Computer-Aided Design ,t-SNE ,Representation learning ,Boosting ,Computational efficiency ,Machine Learning ,Benchmarking ,Research Design ,graph visualization ,Prediction algorithms ,Software ,Algorithms ,dimensionality reduction - Abstract
Graphs and other structured data have come to the forefront in machine learning over the past few years due to the efficacy of novel representation learning methods boosting the prediction performance in various tasks. Representation learning methods embed the nodes in a low-dimensional real-valued space, enabling the application of traditional machine learning methods on graphs. These representations have been widely premised to be also suited for graph visualization. However, no benchmarks or encompassing studies on this topic exist. We present an empirical study comparing several state-of-the-art representation learning methods with two recent graph layout algorithms, using readability and distance-based measures as well as the link prediction performance. Generally, no method consistently outperformed the others across quality measures. The graph layout methods provided qualitatively superior layouts when compared to representation learning methods. Embedding graphs in a higher dimensional space and applying t-distributed stochastic neighbor embedding for visualization improved the preservation of local neighborhoods, albeit at substantially higher computational cost.
- Published
- 2022
4. Mining explainable local and global subgraph patterns with surprising densities
- Author
-
Junning Deng, Bo Kang, Tijl De Bie, and Jefrey Lijffijt
- Subjects
Theoretical computer science ,Computer Networks and Communications ,Computer science ,Graph summarization ,Of the form ,02 engineering and technology ,Graph ,Computer Science Applications ,Homogeneous ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Partition (number theory) ,020201 artificial intelligence & image processing ,Special case ,Prior information ,Information Systems ,Clustering coefficient - Abstract
The connectivity structure of graphs is typically related to the attributes of the vertices. In social networks for example, the probability of a friendship between any pair of people depends on a range of attributes, such as their age, residence location, workplace, and hobbies. The high-level structure of a graph can thus possibly be described well by means of patterns of the form ‘the subgroup of all individuals with certain properties X are often (or rarely) friends with individuals in another subgroup defined by properties Y’, ideally relative to their expected connectivity. Such rules present potentially actionable and generalizable insight into the graph. Prior work has already considered the search for dense subgraphs (‘communities’) with homogeneous attributes. The first contribution in this paper is to generalize this type of pattern to densities between apair of subgroups, as well as betweenall pairs from a set of subgroups that partition the vertices. Second, we develop a novel information-theoretic approach for quantifying the subjective interestingness of such patterns, by contrasting them with prior information an analyst may have about the graph’s connectivity. We demonstrate empirically that in the special case of dense subgraphs, this approach yields results that are superior to the state-of-the-art. Finally, we propose algorithms for efficiently finding interesting patterns of these different types.
- Published
- 2020
5. Adversarial Robustness of Probabilistic Network Embedding for Link Prediction
- Author
-
Xi Chen, Bo Kang, Jefrey Lijffijt, and Tijl De Bie
- Published
- 2021
6. Machine Learning and Knowledge Discovery in Databases
- Author
-
Jefrey Lijffijt
- Published
- 2021
7. Artificial Intelligence and Machine Learning
- Author
-
Mitra Baratchi, Lu Cao, Frank W. Takes, Walter A. Kosters, Jefrey Lijffijt, and Jan N. van Rijn
- Subjects
Coronavirus disease 2019 (COVID-19) ,business.industry ,Computer science ,Volume (computing) ,Artificial intelligence ,business ,Machine learning ,computer.software_genre ,computer ,Game theory ,Selection (genetic algorithm) - Abstract
This book contains a selection of the best papers of the 32nd Benelux Conference on Artificial Intelligence, BNAIC/Benelearn 2020, held in Leiden, The Netherlands, in November 2020. Due to the COVID-19 pandemic the conference was held online. The 12 papers presented in this volume were carefully reviewed and selected from 41 regular submissions. They address various aspects of artificial intelligence such as natural language processing, agent technology, game theory, problem solving, machine learning, human-agent interaction, AI and education, and data analysis.
- Published
- 2021
8. Machine Learning and Knowledge Discovery in Databases
- Author
-
Frank Hutter, Isabel Valera, Kristian Kersting, and Jefrey Lijffijt
- Subjects
Artificial architecture ,Commonsense knowledge ,Computer science ,business.industry ,Open Knowledge Base Connectivity ,Rule-based system ,Marketing and artificial intelligence ,Mathematical knowledge management ,Machine learning ,computer.software_genre ,Data science ,Knowledge extraction ,Software mining ,Artificial intelligence ,business ,computer - Published
- 2021
9. Machine Learning and Knowledge Discovery in Databases
- Author
-
Isabel Valera, Kristian Kersting, Jefrey Lijffijt, and Frank Hutter
- Subjects
World Wide Web ,Knowledge extraction ,Computer science - Published
- 2021
10. History of English as punctuated equilibria? A meta-analysis of the rate of linguistic change in Middle English
- Author
-
Terttu Nevalainen, Turo Vartiainen, Aatu Liimatta, Jefrey Lijffijt, Tanja Säily, English Philology, and Department of Languages
- Subjects
NormanConquest ,050101 languages & linguistics ,Linguistics and Language ,History ,Language Change Database ,Language change ,Punctuated equilibrium ,media_common.quotation_subject ,corpus linguistics ,Black Death ,Languages and Literatures ,050105 experimental psychology ,Language and Linguistics ,methods ,Norman Conquest ,History of English ,Corpus linguistics ,Historical linguistics ,6121 Languages ,0501 psychology and cognitive sciences ,Language Change Databas ,Pace ,media_common ,05 social sciences ,language change ,Middle English ,Linguistics ,language.human_language ,Focus (linguistics) ,English language ,meta-analysis ,language ,rate of language change ,historical sociolinguistics - Abstract
In this paper, we explore the rate of language change in the history of English. Our main focus is on detecting periods of accelerated change in Middle English (1150–1500), but we also compare the Middle English data with the Early Modern period (1500–1700) in order to establish a longer diachrony for the pace at which English has changed over time. Our study is based on a meta-analysis of existing corpus research, which is made available through a new linguistic resource, the Language Change Database (LCD). By aggregating the rates of 44 individual changes, we provide a critical assessment of how well the theory of punctuated equilibria (Dixon, Robert M. W. 1997. The rise and fall of languages. Cambridge: Cambridge University Press) fits with our results. More specifically, by comparing the rate of language change with major language-external events, such as the Norman Conquest and the Black Death, we provide the first corpus-based meta-analysis of whether these events, which had significant societal consequences, also had an impact on the rate of language change. Our results indicate that major changes in the rate of linguistic change in the late medieval period could indeed be connected to the social and cultural after-effects of the Norman Conquest. We also make a methodological contribution to the field of English historical linguistics: by re-using data from existing research, linguists can start to ask new, fundamental questions about the ways in which language change progresses.
- Published
- 2020
11. FONDUE: Framework for Node Disambiguation Using Network Embeddings
- Author
-
Ahmad Mel, Tijl De Bie, Bo Kang, and Jefrey Lijffijt
- Subjects
Theoretical computer science ,Computer science ,Node (networking) ,media_common.quotation_subject ,02 engineering and technology ,Range (mathematics) ,Identification (information) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,020201 artificial intelligence & image processing ,Quality (business) ,Downstream (networking) ,Feature learning ,Biological network ,media_common - Abstract
Real-world data often presents itself in the form of a network. Examples include social networks, citation networks, biological networks, and knowledge graphs. In their simplest form, networks represent real-life entities (e.g. people, papers, proteins, concepts) as nodes, and describe them in terms of their relations with other entities by means of edges between these nodes. This can be valuable for a range of purposes from the study of information diffusion to bibliographic analysis, bioinformatics research, and question-answering.The quality of networks is often problematic though, affecting downstream tasks. This paper focuses on the common problem where a node in the network in fact corresponds to multiple real-life entities. In particular, we introduce FONDUE, an algorithm based on network embedding for node disambiguation. Given a network, FONDUE identifies nodes that correspond to multiple entities, for subsequent splitting. Extensive experiments on fourteen benchmark datasets demonstrate that FONDUE is substantially and uniformly more accurate for ambiguous node identification compared to the existing state-of-the-art with a lower computational cost, while less optimal for determining the best way to split ambiguous nodes.
- Published
- 2020
12. Block-Approximated Exponential Random Graphs
- Author
-
Florian Adriaens, Alexandru Mara, Jefrey Lijffijt, Tijl De Bie, Webb, G., Zhang, Z., Tseng, V. S., Williams, G., Vlachos, M., and Cao, L.
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Technology and Engineering ,Computer science ,Machine Learning (stat.ML) ,02 engineering and technology ,01 natural sciences ,Machine Learning (cs.LG) ,Global information ,010104 statistics & probability ,Statistics - Machine Learning ,P-ASTERISK MODELS ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,exponential random graphs ,link prediction ,Clustering coefficient ,Network model ,network modeling ,Random graph ,Social and Information Networks (cs.SI) ,Assortativity ,Computer Science - Social and Information Networks ,maximum entropy models ,Exponential function ,Mathematics and Statistics ,Scalability ,Embedding ,Algorithm - Abstract
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs. By utilizing fast matrix block-approximation techniques, we propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions, while being able to meaningfully model both local information of the graph (e.g., degrees) as well as global information (e.g., clustering coefficient, assortativity, etc.) if desired. This allows one to efficiently generate random networks with similar properties as an observed network, and the models can be used for several downstream tasks such as link prediction. Our methods are scalable to sparse graphs consisting of millions of nodes. Empirical evaluation demonstrates competitiveness in terms of both speed and accuracy with state-of-the-art methods -- which are typically based on embedding the graph into some low-dimensional space -- for link prediction, showcasing the potential of a more direct and interpretable probabalistic model for this task., Comment: Accepted for DSAA 2020 conference
- Published
- 2020
- Full Text
- View/download PDF
13. CSNE: Conditional Signed Network Embedding
- Author
-
Alexandru Mara, Tijl De Bie, Jefrey Lijffijt, and Yoosof Mashayekhi
- Subjects
Social and Information Networks (cs.SI) ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Technology and Engineering ,Theoretical computer science ,Computer science ,Principle of maximum entropy ,Probabilistic logic ,Context (language use) ,Computer Science - Social and Information Networks ,Machine Learning (stat.ML) ,02 engineering and technology ,Function (mathematics) ,Machine Learning (cs.LG) ,Statistics - Machine Learning ,020204 information systems ,Prior probability ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Sign (mathematics) - Abstract
Signed networks are mathematical structures that encode positive and negative relations between entities such as friend/foe or trust/distrust. Recently, several papers studied the construction of useful low-dimensional representations (embeddings) of these networks for the prediction of missing relations or signs. Existing network embedding methods for sign prediction, however, generally enforce different notions of status or balance theories in their optimization function. These theories, are often inaccurate or incomplete which negatively impacts method performance. In this context, we introduce conditional signed network embedding (CSNE). Our novel probabilistic approach models structural information about the signs in the network separately from fine-grained detail. Structural information is represented in the form of a prior, while the embedding itself is used for capturing fine-grained information. These components are then integrated in a rigorous manner. CSNE's accuracy depends on the existence of sufficiently powerful structural priors for modelling signed networks, currently unavailable in the literature. Thus, as a second main contribution, which we find to be highly valuable in its own right, we also introduce a novel approach to construct priors based on the Maximum Entropy (MaxEnt) principle. These priors can model the polarity of nodes (the degree to which their links are positive) as well as signed triangle counts (a measure for the degree structural balance holds to in a network). Experiments on a variety of real-world networks confirm that CSNE outperforms the state-of-the-art on the task of sign prediction. Moreover, the MaxEnt priors on their own, while less accurate than full CSNE, achieve accuracies competitive with the state-of-the-art at very limited computational cost, thus providing an excellent runtime-accuracy trade-off in resource-constrained situations.
- Published
- 2020
- Full Text
- View/download PDF
14. Gibbs Sampling Subjectively Interesting Tiles
- Author
-
Tijl De Bie, Jefrey Lijffijt, Anes Bendimerad, Céline Robardet, Marc Plantevit, Berthold, MR, Feelders, A, Krempl, G, Data Mining and Machine Learning (DM2L), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), Internet Technology and Data Science Lab (IDLab), and Universiteit Antwerpen [Antwerpen]-Universiteit Gent = Ghent University [Belgium] (UGENT)
- Subjects
Technology and Engineering ,Computer science ,02 engineering and technology ,KNOWLEDGE DISCOVERY ,computer.software_genre ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Interpretation (model theory) ,Local pattern ,Set (abstract data type) ,symbols.namesake ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Gibbs sampling ,Pattern mining ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Pattern sampling ,[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB] ,Subjective interestingness ,business.industry ,Trawling ,Usability ,Mathematics and Statistics ,Large set (Ramsey theory) ,symbols ,020201 artificial intelligence & image processing ,Data mining ,Computational problem ,business ,computer - Abstract
International audience; The local pattern mining literature has long struggled with the so-called pattern explosion problem: the size of the set of patterns found exceeds the size of the original data. This causes computational problems (enumerating a large set of patterns will inevitably take a substantial amount of time) as well as problems for interpretation and usabil-ity (trawling through a large set of patterns is often impractical). Two complementary research lines aim to address this problem. The first aims to develop better measures of interestingness, in order to reduce the number of uninteresting patterns that are returned [6, 10]. The second aims to avoid an exhaustive enumeration of all 'interesting' patterns (where interestingness is quantified in a more traditional way, e.g. frequency), by directly sampling from this set in a way that more 'interest-ing' patterns are sampled with higher probability [2]. Unfortunately, the first research line does not reduce computational cost, while the second may miss out on the most interesting patterns. In this paper, we combine the best of both worlds for mining interesting tiles [8] from binary databases. Specifically, we propose a new pattern sampling approach based on Gibbs sampling, where the probability of sampling a pattern is proportional to their subjective interest-ingness [6]-an interestingness measure reported to better represent true interestingness. The experimental evaluation confirms the theory, but also reveals an important weakness of the proposed approach which we speculate is shared with any other pattern sampling approach. We thus conclude with a broader discussion of this issue, and a forward look.
- Published
- 2020
15. SICA: subjectively interesting component analysis
- Author
-
Raul Santos-Rodriguez, Tijl De Bie, Jefrey Lijffijt, and Bo Kang
- Subjects
Technology and Engineering ,Information theory ,Information retrieval ,Subjective interestingness ,Computer Networks and Communications ,Computer science ,Dimensionality reduction ,FORSIED ,02 engineering and technology ,Construct (python library) ,Computer Science Applications ,Term (time) ,Exploratory data mining ,Data point ,020204 information systems ,Principal component analysis ,Outlier ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Representation (mathematics) ,Information Systems - Abstract
The information in high-dimensional datasets is often too complex for human users to perceive directly. Hence, it may be helpful to use dimensionality reduction methods to construct lower dimensional representations that can be visualized. The natural question that arises is how do we construct a most informative low dimensional representation? We study this question from an information-theoretic perspective and introduce a new method for linear dimensionality reduction. The obtained model that quantifies the informativeness also allows us to flexibly account for prior knowledge a user may have about the data. This enables us to provide representations that are subjectively interesting. We title the method Subjectively Interesting Component Analysis (SICA) and expect it is mainly useful for iterative data mining. SICA is based on a model of a user’s belief state about the data. This belief state is used to search for surprising views. The initial state is chosen by the user (it may be empty up to the data format) and is updated automatically as the analysis progresses. We study several types of prior beliefs: if a user only knows the scale of the data, SICA yields the same cost function as Principal Component Analysis (PCA), while if a user expects the data to have outliers, we obtain a variant that we term t-PCA. Finally, scientifically more interesting variants are obtained when a user has more complicated beliefs, such as knowledge about similarities between data points. The experiments suggest that SICA enables users to find subjectively more interesting representations.
- Published
- 2018
16. Discovering Interesting Cycles in Directed Graphs
- Author
-
Aristides Gionis, Florian Adriaens, Jefrey Lijffijt, Cigdem Aslay, and Tijl De Bie
- Subjects
FOS: Computer and information sciences ,Mathematical optimization ,Polynomial ,Computer science ,Heuristic ,02 engineering and technology ,Directed graph ,Computational Complexity (cs.CC) ,Measure (mathematics) ,Graph ,Set (abstract data type) ,Computer Science - Computational Complexity ,020204 information systems ,Computer Science - Data Structures and Algorithms ,0202 electrical engineering, electronic engineering, information engineering ,Data Structures and Algorithms (cs.DS) ,020201 artificial intelligence & image processing - Abstract
Cycles in graphs often signify interesting processes. For example, cyclic trading patterns can indicate inefficiencies or economic dependencies in trade networks, cycles in food webs can identify fragile dependencies in ecosystems, and cycles in financial transaction networks can be an indication of money laundering. Identifying such interesting cycles, which can also be constrained to contain a given set of query nodes, although not extensively studied, is thus a problem of considerable importance. In this paper, we introduce the problem of discovering interesting cycles in graphs. We first address the problem of quantifying the extent to which a given cycle is interesting for a particular analyst. We then show that finding cycles according to this interestingness measure is related to the longest cycle and maximum mean-weight cycle problems (in the unconstrained setting) and to the maximum Steiner cycle and maximum mean Steiner cycle problems (in the constrained setting). A complexity analysis shows that finding interesting cycles is NP-hard, and is NP-hard to approximate within a constant factor in the unconstrained setting, and within a factor polynomial in the input size for the constrained setting. The latter inapproximability result implies a similar result for the maximum Steiner cycle and maximum mean Steiner cycle problems. Motivated by these hardness results, we propose a number of efficient heuristic algorithms. We verify the effectiveness of the proposed methods and demonstrate their practical utility on two real-world use cases: a food web and an international trade-network dataset., Accepted for CIKM'19
- Published
- 2019
17. SIMIT: Subjectively Interesting Motifs in Time Series
- Author
-
Jefrey Lijffijt, Junning Deng, Tijl De Bie, and Bo Kang
- Subjects
Theoretical computer science ,Computer science ,General Physics and Astronomy ,lcsh:Astrophysics ,02 engineering and technology ,Information theory ,Measure (mathematics) ,Article ,Synthetic data ,020204 information systems ,lcsh:QB460-466 ,Subsequence ,0202 electrical engineering, electronic engineering, information engineering ,Constraint programming ,Time series ,lcsh:Science ,motif detection ,information theory ,exploratory data mining ,Series (mathematics) ,Solver ,lcsh:QC1-999 ,subjective interestingness ,lcsh:Q ,020201 artificial intelligence & image processing ,time series ,lcsh:Physics ,pattern mining - Abstract
Numerical time series data are pervasive, originating from sources as diverse as wearable devices, medical equipment, to sensors in industrial plants. In many cases, time series contain interesting information in terms of subsequences that recur in approximate form, so-called motifs. Major open challenges in this area include how one can formalize the interestingness of such motifs and how the most interesting ones can be found. We introduce a novel approach that tackles these issues. We formalize the notion of such subsequence patterns in an intuitive manner and present an information-theoretic approach for quantifying their interestingness with respect to any prior expectation a user may have about the time series. The resulting interestingness measure is thus a subjective measure, enabling a user to find motifs that are truly interesting to them. Although finding the best motif appears computationally intractable, we develop relaxations and a branch-and-bound approach implemented in a constraint programming solver. As shown in experiments on synthetic data and two real-world datasets, this enables us to mine interesting patterns in small or mid-sized time series.
- Published
- 2019
18. SuMoTED: An intuitive edit distance between rooted unordered uniquely-labelled trees
- Author
-
Cédric Mesnage, Matt McVicar, Jefrey Lijffijt, Tijl De Bie, Eirini Spyropoulou, and Benjamin Sach
- Subjects
Focus (computing) ,Technology and Engineering ,Theoretical computer science ,Tree edit distance ,ALGORITHMS ,020207 software engineering ,02 engineering and technology ,Measure (mathematics) ,Tree (data structure) ,Tree structure ,Artificial Intelligence ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Scalable algorithms ,020201 artificial intelligence & image processing ,Edit distance ,Computer Vision and Pattern Recognition ,CONSENSUS ,Time complexity ,Software ,Taxonomies ,Mathematics - Abstract
Defining and computing distances between tree structures is a classical area of study in theoretical computer science, with practical applications in the areas of computational biology, information retrieval, text analysis, and many others. In this paper, we focus on rooted, unordered, uniquely-labelled trees such as taxonomies and other hierarchies. For trees as these, we introduce the intuitive concept of a ‘local move’ operation as an atomic edit of a tree. We then introduce SuMoTED, a new edit distance measure between such trees, defined as the minimal number of local moves required to convert one tree into another. We show how SuMoTED can be computed using a scalable algorithm with quadratic time complexity. Finally, we demonstrate its use on a collection of music genre taxonomies.
- Published
- 2016
19. Quantifying and Minimizing Risk of Conflict in Social Networks
- Author
-
Tijl De Bie, Xi Chen, and Jefrey Lijffijt
- Subjects
Technology and Engineering ,Social network ,business.industry ,Computer science ,conflict ,disagreement measures ,Polarization (politics) ,020206 networking & telecommunications ,02 engineering and technology ,Social networks ,Risk analysis (engineering) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Organizational structure ,business ,controversy - Abstract
Controversy, disagreement, conflict, polarization and opinion divergence in social networks have been the subject of much recent research. In particular, researchers have addressed the question of how such concepts can be quantified given people’s prior opinions, and how they can be optimized by influencing the opinion of a small number of people or by editing the network’s connectivity. Here, rather than optimizing such concepts given a specific set of prior opinions, we study whether they can be optimized in the average case and in the worst case over all sets of prior opinions. In particular, we derive the worst-case and average-case conflict risk of networks, and we propose algorithms for optimizing these. For some measures of conflict, these are non-convex optimization problems with many local minima. We provide a theoretical and empirical analysis of the nature of some of these local minima, and show how they are related to existing organizational structures. Empirical results show how a small number of edits quickly decreases its conflict risk, both average-case and worst-case. Furthermore, it shows that minimizing average-case conflict risk often does not reduce worst-case conflict risk. Minimizing worst-case conflict risk on the other hand, while computationally more challenging, is generally effective at minimizing both worst-case as well as average-case conflict risk.
- Published
- 2018
20. Interactive Visual Data Exploration with Subjective Feedback: An Information-Theoretic Approach
- Author
-
Emilia Oikarinen, Kai Puolamäki, Bo Kang, Tijl De Bie, Jefrey Lijffijt, and Department of Computer Science
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Technology and Engineering ,Information theory ,Computer Networks and Communications ,Computer science ,Computer Science - Information Theory ,Machine Learning (stat.ML) ,02 engineering and technology ,Machine learning ,computer.software_genre ,Machine Learning (cs.LG) ,Set (abstract data type) ,Exploratory data analysis ,Data visualization ,Statistics - Machine Learning ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Use case ,NONLINEAR DIMENSIONALITY REDUCTION ,Representation (mathematics) ,Maximum entropy distribution ,ta113 ,Data exploration ,Subjective interestingness ,Syntax (programming languages) ,business.industry ,Dimensionality reduction ,Information Theory (cs.IT) ,Construct (python library) ,113 Computer and information sciences ,FIT ,Computer Science Applications ,Maximum entropy probability distribution ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Information Systems - Abstract
Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing methods use predefined criteria to choose the representation of data. There is a lack of methods that (i) elicit from the user what she has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as "this set of points forms a cluster", and requires no knowledge of maths. This background knowledge is used to find a Maximum Entropy distribution of the data, after which the system provides the user data projections in which the data and the Maximum Entropy distribution differ the most, hence showing the user aspects of the data that are maximally informative given the user's current knowledge. We provide an open source EDA system with tailored interactive visualizations to demonstrate these concepts. We study the performance of the system and present use cases on both synthetic and real data. We find that the model and the prototype system allow the user to learn information efficiently from various data sources and the system works sufficiently fast in practice. We conclude that the information theoretic approach to exploratory data analysis where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system., Comment: 12 pages, 9 figures, 2 tables, conference submission
- Published
- 2017
- Full Text
- View/download PDF
21. Subjectively Interesting Connecting Trees
- Author
-
Tijl De Bie, Florian Adriaens, Jefrey Lijffijt, Ceci, Michelangelo, Hollm{\'e}n, Jaakko, Todorovski, Ljup{\v{c}}o, and Vens, Celine
- Subjects
Technology and Engineering ,Information theory ,Theoretical computer science ,Subjective interestingness ,Exploratory Data Mining ,business.industry ,Computer science ,Heuristic ,Node (networking) ,02 engineering and technology ,Telecommunications network ,Set (abstract data type) ,Graph pattern mining ,Tree (data structure) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,The Internet ,Heuristics ,business ,Graphs - Abstract
Consider a large network, and a user-provided set of query nodes between which the user wishes to explore relations. For example, a researcher may want to connect research papers in a citation network, an analyst may wish to connect organized crime suspects in a communication network, or an internet user may want to organize their bookmarks given their location in the world wide web. A natural way to show how query nodes are related is in the form of a tree in the network that connects them. However, in sufficiently dense networks, most such trees will be large or somehow trivial (e.g. involving high degree nodes) and thus not insightful. In this paper, we define and investigate the new problem of mining subjectively interesting trees connecting a set of query nodes in a network, i.e., trees that are highly surprising to the specific user at hand. Using information theoretic principles, we formalize the notion of interestingness of such trees mathematically, taking in account any prior beliefs the user has specified about the network. We then propose heuristic algorithms to find the best trees efficiently, given a specified prior belief model. Modeling the user's prior belief state is however not necessarily computationally tractable. Yet, we show how a highly generic class of prior beliefs, namely about individual node degrees in combination with the density of particular sub-networks, can be dealt with in a tractable manner. Such types of beliefs can be used to model knowledge of a partial or total order of the network nodes, e.g. where the nodes represent events in time (such as papers in a citation network). An empirical validation of our methods on a large real network evaluates the different heuristics and validates the interestingness of the given trees.
- Published
- 2017
22. Hierarchical Novelty Detection
- Author
-
Matthew J McVicar, Raul Santos-Rodriguez, Tijl De Bie, Paolo Simeone, and Jefrey Lijffijt
- Subjects
Optimization ,Point (typography) ,Hierarchy (mathematics) ,Music genre classification ,Computer science ,business.industry ,Music information retrieval ,Node (networking) ,Novelty ,Pattern recognition ,02 engineering and technology ,Hierarchical classification ,Novelty detection ,ComputingMethodologies_PATTERNRECOGNITION ,Data point ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Layer (object-oriented design) ,business - Abstract
Hierarchical classification is commonly defined as multi-class classification where the classes are hierarchically nested. Many practical hierarchical classification problems also share features with multi-label classification (i.e., each data point can have any number of labels, even non-hierarchically related) and novelty detection (i.e., some data points are novelties at some level of the hierarchy). A further complication is that it is common for training data to be incompletely labelled, e.g. the most specific labels are not always provided. In music genre classification for example, there are numerous music genres (multi-class) which are hierarchically related. Songs can belong to different (even non-nested) genres (multi-label), and a song labelled as Rock may not belong to any of its sub-genres, such that it is a novelty within this genre (novelty-detection). Finally, the training data may label a song as Rock whereas it really could be labelled correctly as the more specific genre Blues Rock. In this paper we develop a new method for hierarchical classification that naturally accommodates every one of these properties. To achieve this we develop a novel approach, modelling it as a Hierarchical Novelty Detection problem that can be trained through a single convex second-order cone programming problem. This contrasts with most existing approaches that typically require a model to be trained for each layer or internal node in the label hierarchy. Empirical results on a music genre classification problem are reported, comparing with a state-of-the-art method as well as simple benchmarks.
- Published
- 2017
23. A Constrained Randomization Approach to Interactive Visual Data Exploration with Subjective Feedback
- Author
-
Bo Kang, Tijl De Bie, Jefrey Lijffijt, Kai Puolamäki, Department of Computer Science, and Helsinki Institute for Information Technology
- Subjects
Clustering high-dimensional data ,Technology and Engineering ,Computer science ,Reactive power ,data randomization ,LINES ,02 engineering and technology ,Synthetic data ,Field (computer science) ,Data modeling ,Tools ,Exploratory data mining ,Data visualization ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Information system ,NONLINEAR DIMENSIONALITY REDUCTION ,Data mining ,Visualization ,dimensionality reduction ,Information retrieval ,business.industry ,Data models ,Computational modeling ,113 Computer and information sciences ,FIT ,Computer Science Applications ,Projection (relational algebra) ,Computational Theory and Mathematics ,subjective interestingness ,business ,Information Systems - Abstract
Data visualization and iterative/interactive data mining are growing rapidly in attention, both in research as well as in industry. However, while there are a plethora of advanced data mining methods and lots of works in the field of visualization, integrated methods that combine advanced visualization and/or interaction with data mining techniques in a principled way are rare. We present a framework based on constrained randomization which lets users explore high-dimensional data via 'subjectively informative' two-dimensional data visualizations. The user is presented with 'interesting' projections, allowing users to express their observations using visual interactions that update a background model representing the user's belief state. This background model is then considered by a projection-finding algorithm employing data randomization to compute a new 'interesting' projection. By providing users with information that contrasts with the background model, we maximize the chance that the user encounters striking new information present in the data. This process can be iterated until the user runs out of time or until the difference between the randomized and the real data is insignificant. We present two case studies, one controlled study on synthetic data and another on census data, using the proof-of-concept tool SIDE that demonstrates the presented framework.
- Published
- 2019
24. A statistical significance testing approach to mining the most informative set of patterns
- Author
-
Kai Puolamäki, Panagiotis Papapetrou, and Jefrey Lijffijt
- Subjects
Computer Networks and Communications ,Computer science ,Existential quantification ,computer.software_genre ,Small set ,Computer Science Applications ,Set (abstract data type) ,Null (SQL) ,p-value ,Data mining ,Special case ,Greedy algorithm ,computer ,Computer Science::Databases ,Information Systems ,Statistical hypothesis testing - Abstract
Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.
- Published
- 2012
25. A Tool for Subjective and Interactive Visual Data Exploration
- Author
-
Jefrey Lijffijt, Bo Kang, Kai Puolamäki, and Tijl De Bie
- Subjects
Clustering high-dimensional data ,Visual analytics ,Technology and Engineering ,Data exploration ,Exploratory Data Mining ,Computer science ,business.industry ,Dimensionality reduction ,Contrast (statistics) ,02 engineering and technology ,computer.software_genre ,Set (abstract data type) ,Data visualization ,Human–computer interaction ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,business ,Projection (set theory) ,computer ,Subjective Interestingness ,Dimensionality Reduction ,Data Visualisation - Abstract
We present SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations. Many existing visual analytics tools are either restricted to specific problems and domains or they aim to find visualizations that align with user’s belief about the data. In contrast, our generic tool computes data visualizations that are surprising given a user’s current understanding of the data. The user’s belief state is represented as a set of projection tiles. Hence, this user-awareness offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets.
- Published
- 2016
26. P-N-RMiner: A generic framework for mining interesting structured relational patterns
- Author
-
Jefrey Lijffijt, Eirini Spyropoulou, Bo Kang, and Tijl De Bie
- Subjects
Structure (mathematical logic) ,Technology and Engineering ,Syntax (programming languages) ,Computer science ,Relational database ,Applied Mathematics ,02 engineering and technology ,computer.software_genre ,Data type ,Computer Science Applications ,Set (abstract data type) ,Relational calculus ,Computational Theory and Mathematics ,020204 information systems ,Modeling and Simulation ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Semi-structured data ,Data mining ,Categorical variable ,computer ,Information Systems - Abstract
Local pattern mining methods are fragmented along two dimensions: the pattern syntax, and the data types on which they are applicable. Pattern syntaxes considered in the literature include subgroups, n-sets, itemsets, and many more; common data types include binary, categorical, and real-valued. Recent research on pattern mining in relational databases has shown how the aforementioned pattern syntaxes can be unified in a single framework. However, a unified understanding of how to deal with various data types is lacking, certainly for more complexly structured types such as time of day (which is circular), geographical location, terms from a taxonomy, etc. In this paper, we introduce a generic approach for mining interesting local patterns in (relational) data involving such structured data types as attributes. Importantly, we show how this can be done in a generic manner, by modelling the structure within a set of attribute values as a partial order. We then derive a measure of subjective interestingness of such patterns using Information Theory, and propose an algorithm for effectively enumerating all patterns of this syntax. Through empirical evaluation, we found that (a) the new interestingness derivation is relevant and cannot be approximated using existing tools, (b) the new tool, P-N-RMiner, finds patterns that are substantially more informative, and (c) the new enumeration algorithm is considerably faster.
- Published
- 2015
27. Interactively exploring supply and demand in the UK independent music scene
- Author
-
Matt McVicar, Tijl De Bie, Jefrey Lijffijt, and Cédric Mesnage
- Subjects
Geographic distribution ,Software ,Exploratory data mining ,Multimedia ,business.industry ,Computer science ,Meaning (existential) ,business ,computer.software_genre ,computer ,Data science ,Supply and demand - Abstract
We present an exploratory data mining tool useful for finding patterns in the geographic distribution of independent UK-based music artists. Our system is interactive, highly intuitive, and entirely browser-based, meaning it can be used without any additional software installations from any device. The target audiences are artists, other music professionals, and the general public. Potential uses of our software include highlighting discrepancies in supply and demand of specific music genres in different parts of the country, and identifying at a glance which areas have the highest densities of independent music artists.
- Published
- 2015
28. Supply and demand of independent UK music artists on the web
- Author
-
Cédric Mesnage, Matt McVicar, Jefrey Lijffijt, Eirini Spyropoulou, and Tijl De Bie
- Subjects
media_common.quotation_subject ,Disequilibrium ,Supply and demand ,Popular music ,State (polity) ,Proof of concept ,medicine ,Sociology ,Social science ,Web resource ,medicine.symptom ,Constant (mathematics) ,Industrial organization ,media_common - Abstract
As in any dynamic market, supply and demand of music are in a constant state of disequilibrium. Music charts have for many years documented the demand for the most popular music, but a more comprehensive understanding of this market has remained beyond reach. In this paper, we provide a proof of concept for how web resources now make it possible to study both demand and supply sides, accounting also for smaller, independent artists.
- Published
- 2015
29. Review of ((2008)): International Journal of Corpus Linguistics
- Author
-
Jefrey Lijffijt and Stefan Th. Gries
- Subjects
Linguistics and Language ,Corpus linguistics ,Psychology ,Language and Linguistics ,Linguistics - Published
- 2012
30. A Fast and Simple Method for Mining Subsequences with Surprising Event Counts
- Author
-
Jefrey Lijffijt
- Subjects
Sequence ,Joint probability distribution ,Large numbers ,Data mining ,Spurious relationship ,computer.software_genre ,Algorithm ,Upper and lower bounds ,computer ,Data type ,Mathematics ,Statistical hypothesis testing ,Event (probability theory) - Abstract
We consider the problem of mining subsequences with surprising event counts. When mining patterns, we often test a very large number of potentially present patterns, leading to a high likelihood of finding spurious results. Typically, this problem grows as the size of the data increases. Existing methods for statistical testing are not usable for mining patterns in big data, because they are either computationally too demanding, or fail to take into account the dependency structure between patterns, leading to true findings going unnoticed. We propose a new method to compute the significance of event frequencies in subsequences of a long data sequence. The method is based on analyzing the joint distribution of the patterns, omitting the need for randomization. We argue that computing the p-values exactly is computationally costly, but that an upper bound is easy to compute. We investigate the tightness of the upper bound and compare the power of the test with the alternative of post-hoc correction. We demonstrate the utility of the method on two types of data: text and DNA. We show that the proposed method is easy to implement and can be computed quickly. Moreover, we conclude that the upper bound is sufficiently tight and that meaningful results can be obtained in practice.
- Published
- 2013
31. Premodifying -ing participles in the parsed BNC
- Author
-
Turo Vartiainen and Jefrey Lijffijt
- Subjects
060201 languages & linguistics ,Literature ,Parsing ,business.industry ,media_common.quotation_subject ,Indo-European languages ,Applied linguistics ,06 humanities and the arts ,02 engineering and technology ,Art ,computer.software_genre ,Focus (linguistics) ,0602 languages and literature ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,computer ,media_common - Abstract
In this article we will focus on premodifying -ing participles in English. By premodifying -ing participles we refer to NP-internal -ing forms, such as the ones found in a charming man or a barking dog. Our first goal is to see how these participles are used in four registers of Present-Day English: academic prose, newspaper articles, fiction and conversations. Furthermore, we will attempt to find corpus evidence for Vartiainen (forthcoming), where it was suggested that there are solid grounds for dividing the class of premodifying -ing participles into adjectival and verbal -ing participles. Our hypothesis is that if this division is evident in the syntactic behaviour of the participles, then it might also be reflected in the way -ing participles are used in different registers.
- Published
- 2012
32. Size Matters: Finding the Most Informative Set of Window Lengths
- Author
-
Jefrey Lijffijt, Panagiotis Papapetrou, and Kai Puolamäki
- Subjects
Set (abstract data type) ,Sequence ,Burstiness ,Computer science ,Window (computing) ,Granularity ,Data mining ,Cluster analysis ,computer.software_genre ,computer ,Synthetic data ,Event (probability theory) - Abstract
Event sequences often contain continuous variability at different levels. In other words, their properties and characteristics change at different rates, concurrently. For example, the sales of a product may slowly become more frequent over a period of several weeks, but there may be interesting variation within a week at the same time. To provide an accurate and robust "view" of such multi-level structural behavior, one needs to determine the appropriate levels of granularity for analyzing the underlying sequence. We introduce the novel problem of finding the best set of window lengths for analyzing discrete event sequences. We define suitable criteria for choosing window lengths and propose an efficient method to solve the problem. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from two domains: text and DNA. We find that the optimal sets of window lengths themselves can provide new insight into the data, e.g., the burstiness of events affects the optimal window lengths for measuring the event frequencies.
- Published
- 2012
33. Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping
- Author
-
Panagiotis Papapetrou, Heikki Mannila, Jefrey Lijffijt, and Kai Puolamäki
- Subjects
Text corpus ,Structure (mathematical logic) ,Computer science ,business.industry ,Speech recognition ,Bootstrapping (linguistics) ,Statistical semantics ,computer.software_genre ,Word lists by frequency ,Artificial intelligence ,Bernoulli process ,business ,computer ,Natural language ,Word (computer architecture) ,Natural language processing - Abstract
Comparing frequency counts over texts or corpora is an important task in many applications and scientific disciplines. Given a text corpus, we want to test a hypothesis, such as "word X is frequent", "word X has become more frequent over time", or "word X is more frequent in male than in female speech". For this purpose we need a null model of word frequencies. The commonly used bag-of-words model, which corresponds to a Bernoulli process with fixed parameter, does not account for any structure present in natural languages. Using this model for word frequencies results in large numbers of words being reported as unexpectedly frequent. We address how to take into account the inherent occurrence patterns of words in significance testing of word frequencies. Based on studies of words in two large corpora, we propose two methods for modeling word frequencies that both take into account the occurrence patterns of words and go beyond the bag-of-words assumption. The first method models word frequencies based on the spatial distribution of individual words in the language. The second method is based on bootstrapping and takes into account only word frequency at the text level. The proposed methods are compared to the current gold standard in a series of experiments on both corpora. We find that words obey different spatial patterns in the language, ranging from bursty to non-bursty/uniform, independent of their frequency, showing that the traditional approach leads to many false positives.
- Published
- 2011
34. Benchmarking dynamic time warping for music retrieval
- Author
-
Vassilis Athitsos, Jaakko Hollmén, Panagiotis Papapetrou, and Jefrey Lijffijt
- Subjects
Set (abstract data type) ,Dynamic programming ,Dynamic time warping ,Matching (statistics) ,MIDI ,Computer science ,Sliding window protocol ,Speech recognition ,Subsequence ,computer.file_format ,Algorithm ,computer ,Query by humming - Abstract
We study the performance of three dynamic programming methods on music retrieval. The methods are designed for time series matching but can be directly applied to retrieval of music. Dynamic Time Warping (DTW) identifies an optimal alignment between two time series, and computes the matching cost corresponding to that alignment. Significant speed-ups can be achieved by constrained Dynamic Time Warping (cDTW), which narrows down the set of positions in one time series that can be matched with specific positions in the other time series. Both methods are designed for full sequence matching but can also be applied for subsequence matching, by using a sliding window over each database sequence to compute a matching score for each database subsequence. In addition, SPRING is a dynamic programming approach designed for subsequence matching, where the query is matched with a database subsequence without requiring the match length to be equal to the query length. SPRING has a lower computational cost than DTW and cDTW. Our database consists of a set of MIDI files taken from the web. Each MIDI file has been converted to a 2-dimensional time series, taking into account both note pitches and durations. We have used synthetic queries of fixed size and different noise levels. Surprisingly, when looking for the top-K best matches, all three approaches show similar behavior in terms of retrieval accuracy for small values of K. This suggests that for the specific application area, a computationally cheaper method, such as SPRING, is sufficient to retrieve the best top-K matches.
- Published
- 2010
35. Tracking your steps on the track
- Author
-
Jaakko Hollmén, Panagiotis Papapetrou, and Jefrey Lijffijt
- Subjects
Computer science ,business.industry ,Track (disk drive) ,Accelerometer ,Tracking (particle physics) ,Pressure sensor ,Motion (physics) ,Data set ,Acceleration ,Computer vision ,Artificial intelligence ,business ,Wireless sensor network ,Simulation - Abstract
Monitoring human motion has recently received great attention and can be used in many applications, such as human motion prediction. We present the collected data set from a body sensor network attached to the human body. The set of sensors consists of accelerometers measuring acceleration in three directions that are attached to the upper and lower back as well as the knees and ankles. In addition, pressures on the insoles are measured with four pressure sensors inside each shoe. Two types of motion are considered: walking backwards on a straight line and walking forwards on a figure-8 path. Finally, we study and present basic statistics of the data.
- Published
- 2010
36. CLIPPR: Maximally Informative CLIPped PRojections with Bounding Regions
- Author
-
Bo Kang, Cashman, Dylan, Chang, Remco, Jefrey Lijffijt, and Tijl De Bie
37. Fouille de sous-graphes attribués subjectivement intéressants
- Author
-
Bendimerad, Anes, Ahmad Mel, Jefrey Lijffijt, Plantevit, Marc, Robardet, Céline, Tijl De Bie, Data Mining and Machine Learning (DM2L), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), and Université de Lyon-Université Lumière - Lyon 2 (UL2)
- Subjects
Social and Information Networks (cs.SI) ,FOS: Computer and information sciences ,Technology and Engineering ,Subgraph Mining ,Computer Science - Social and Information Networks ,Networks ,Graphs ,Subjective Interestingness ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] - Abstract
Community detection in graphs, data clustering, and local pattern mining are three mature fields of data mining and machine learning. In recent years, attributed subgraph mining is emerging as a new powerful data mining task in the intersection of these areas. Given a graph and a set of attributes for each vertex, attributed subgraph mining aims to find cohesive subgraphs for which (a subset of) the attribute values has exceptional values in some sense. While research on this task can borrow from the three abovementioned fields, the principled integration of graph and attribute data poses two challenges: the definition of a pattern language that is intuitive and lends itself to efficient search strategies, and the formalization of the interestingness of such patterns. We propose an integrated solution to both of these challenges. The proposed pattern language improves upon prior work in being both highly flexible and intuitive. We show how an effective and principled algorithm can enumerate patterns of this language. The proposed approach for quantifying interestingness of patterns of this language is rooted in information theory, and is able to account for prior knowledge on the data. Prior work typically quantifies interestingness based on the cohesion of the subgraph and for the exceptionality of its attributes separately, combining these in a parametrized trade-off. Instead, in our proposal this trade-off is implicitly handled in a principled, parameter-free manner. Extensive empirical results confirm the proposed pattern syntax is intuitive, and the interestingness measure aligns well with actual subjective interestingness., International Workshop On Mining And Learning With Graphs, held with SIGKDD 2018
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.