Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info.

# DMKD Vol 1, Issue 3 - September, 1997

An International Journal

Volume 1, Issue 3

## Editorial

The KDD process targets the broad goal of helping humans make sense out of data. This goal spans the spectrum of tasks to help a user find {\it structures} in the data. Sometimes simple data reduction operations to help a user "understand" a data set (or parts of it) are all that is needed. Sometimes verifying user hypotheses about the data is sufficient. In situations where one is faced with high dimensional data sets that are not well understood, algorithms capable of  making explicit some implicit subtle structure in the data or even synthesizing structure from data become a necessity. Within the KDD process, data mining is but a single step, albeit a central one when it comes to finding "structure" (patterns and models). As it is practiced today, the KDD process is iterative, interactive, and very much a trial and error activity. We lack an understanding of how the various components of the process fit together: from data selection/sampling, to data cleaning, to data transformation, to data mining, to evaluation/interpretation, to visualization of the derived results, and to the final step of deciding which actions to take based on the derived "knowledge" at the end of the KDD process. For example, it is difficult to predict the effect of selection or cleaning on the data mining step. Often one invests a significant amount of effort before realizing that some key steps were not performed early on and the only recovery is a restart.

As we advance towards the goal of facilitating the extraction of useful, meaningful, or interesting structure from data, it is crucial to develop a basic understanding of how the various components of the KDD process fit together and affect each other. While the human should always be in the loop somewhere, the challenge is to raise the level of interaction to insulate a user from the details of interactions between the components and their intricacies and nuances. The user should be able to interact with the process in terms of the domain of application to maximize the efficiency of the interaction. The analogy is akin to a driver driving an automobile. When the driver turns the steering wheel, it is to change direction of the vehicle. The interactions between the hydraulic power systems, the suspension, the traction of the tires, and so forth are of little interest. While we are very far from this goal in KDD, we should not lose vision of it as a general guiding light (regardless of how far off and how dim that light may seem to be at the moment).

Another notion that we cannot lose track of is that KDD is an interdisciplinary activity that must draw on a variety of fields. In a recent workshop that targeted bringing together researchers working in statistics, database systems, artificial intelligence, data visualization, parallel and distributed computing, and practical KDD applications, it was very interesting to observe the interaction between the various communities [for more details on the content, participants, and reports see the UW/Microsoft Research Summer Institute on Data Mining, July 6-11, 1997 posted on the web at https://web.archive.org/web/20020621152258/http://www.research.microsoft.com/uwmsrdmi]. It quickly became clear that work on data mining techniques that has been evolving from the database community (e.g association rules) was fundamentally different from what a statistician would think about as extracting models from data. It was also clear that problems of scaling to massive data sets and of integration of analysis capabilities with database systems have not been a major consideration in statistics. We need to reach a basic understanding for important issues such as: What is the strategy for analysis with massive datasets that are live" and that grow in ways that are not well understood? Which analysis techniques are usable directly by end-users? How do we guard effectively against overfit, random correlations, and many of the "mines" in data mining? Are there meaningful general data reduction techniques that can help people effectively visualize and understand" massive datasets? Among the constituent fields of KDD, how can we avoid the problem of "reinventing" some techniques in one field that have been thoroughly worked out in another?

In this editorial, I would like to touch upon two additional issues of importance to KDD. The first issue concerns the exciting opportunities that arise from combining techniques from various KDD fields. Consider the interaction between data mining and data visualization. I see a great opportunity for coupling data mining and visualization techniques to scale human ability to visualize high-dimensional data. For example, a scalable clustering algorithm can sift through an unwieldy massive database and identify subsets of the data that consist of intrinsically "more homogeneous" clusters of data items. Attempting to summarize/visualize a single cluster is a significantly more tenable endeavour than attempting visualize the entire population or a random subsample of it. The difficulty is that from "global views" of the data it is exceedingly difficult for humans to home in on some local structure (or clusters of interest). In fact, clustering can be thought of as a way of choosing the right samples such that visualization becomes easier. This is true of many local modelling methods that find localized fragments of structure in the data (e.g. decision trees and association rules). In fact such local methods hold great promise of ushering in a new way to think about data modelling in general: at the piecewise level instead of at the traditional global data model.

The second issue concerns benchmarks for evaluating progress in data mining and KDD. The database community, and to some extent the information retrieval (IR) community have met with great success in establishing benchmark data sets for comparing new algorithm performance and assessing advances. In KDD, we could benefit from such benchmarks both for data mining algorithms and for the various steps in the KDD process. A good start may be the creation of some large challenge synthetic data sets. For example, one could take a model consisting of K clusters, each characterizable in some low dimensional subspace and then embed the clusters in a high dimensional large data set. The target would then be to extract the clusters of interest given the large databases. On some of these data sets, the actual correct answers need not be revealed: results can be submitted to a committee of judges and responses can be given back as to whether the discovered models are close to the correct answers, and so forth. Benchmarks can also have some standard prediction problems where the goal would be to derive models with as low a cost as possible in terms of required time, memory, etc. Some competitions are now being held at major data warehousing conferences where contestants are given data sets on the spot and are then judged by how quickly and effectively they manage to extract meaningful results. Such competitions assess the effectiveness of the various components of a KDD system, including steps that are difficult to assess quantitatively.

While benchmarks come with their own risks of focusing the work on "optimizing against the benchmark", the benefits often outweigh the risks. Enabling researchers to compare and replicate results, as well as enabling independent evaluation boards of assessing the efficacy of KDD and data mining systems is an important consideration. Of course, real applications should always be given the highest consideration when judging progress. While KDD has a substantial set of significant success stories to motivate researchers, practioners, and users, standard benchmarks also have an important role to play.

Usama Fayyad, Editor-in-chief, Data Mining and Knowledge Discovery

Usama Fayyad is a Senior Researcher at Microsoft Research . His research interests include knowledge discovery in large databases, data mining, machine learning, statistical pattern recognition, and clustering. After receiving the Ph.D. degree from The University of Michigan, Ann Arbor, in 1991, he joined the Jet Propulsion Laboratory (JPL), California Institute of Technology where (until early 1996) he headed the Machine Learning Systems Group and developed data mining systems for automated science data analysis. He is a co-editor of Advances in Knowledge Discovery and Data Mining (MIT Press, 1996). He was program co-chair of KDD-94 and KDD-95 (the First International Conference on Knowledge Discovery and Data Mining) and a general chair of KDD-96.

## Papers

### Levelwise Search and Borders of Theories in Knowledge Discovery

Heikki Mannila and Hannu Toivonen

Abstract:

One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the verification problem of a KDD process: given r and a subset of sentences S from L, determine whether S is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.

### Discovery of Frequent Episodes in Event Sequences

Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo

Abstract:

Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering frequently occurring episodes in a sequence. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We give efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and present detailed experimental results. The methods are in use in telecommunication alarm management.

### Data Mining for Adaptive Fraud Detection

Tom Fawcett and Foster Provost

Abstract:

One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a series of data mining techniques. Specifically, we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of customer transactions. Then the indicators are used to create a set of monitors, which profile legitimate customer behavior and indicate anomalies. Finally, the outputs of the monitors are used as features in a system that learns to combine evidence to generate high-confidence alarms. The system has been applied to the problem of detecting cellular cloning fraud based on a database of call records. Experiments indicate that this automatic approach performs better than hand-crafted methods for detecting fraud. Furthermore, this approach can adapt to the changing conditions typical of fraud detection environments.

## Methodological Note

### On Camparing Classifiers: Pitfalls to Avoid and a Recommended Approach

Stephen L. Salzberg

Abstract:

An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.