Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info.

# DMKD Vol 1, Issue 2 - June, 1997

An International Journal

Volume 1, Issue 2

## Editorial

The field of Data Mining and Knowledge Discovery in Databases continues to grow at a tremendous pace. In addition to the continuously growing activities in the commercial and corporate fronts, we are witnessing a large number of technical and applications-oriented conferences on the topic. Besides the main KDD conference (Note 1), there are now 5 or 6 regional conferences on this topic. In addition, specialized sessions within established conferences in statistics and in databases are being dedicated to this topic, sometimes as a whole conference theme (e.g. the Interface'97 statistics conference, KDD workshops at the ACM SIGMOD conferences in 96 and 97, special sessions at VLDB Conference, the 1997 conference of the International Association of Statistical Computing, and so forth). Finally, there are a few additional conferences dedicated to applied aspects of KDD. Of course there is a host of related conferences on data warehousing and OLAP (On-line Analytical Processing).

While this growth is healthy and encouraging, we must not lose sight of the importance of establishing some firm underpinnings to the field to avoid building expectations based on hype only. The need for the kinds of capabilities that data mining and KDD aim to offer is undeniably present and strong. It is unclear, however, that today's technology is capable of meeting expectations. End-users, who are now capable of building formidable data warehouses" from on-line transaction processing systems, often do so thinking that data mining" techniques will help them get value out of this data. Hence, we are witnessing a period of great expectations that comes with the anticipation of being able to mine" data warehouses for hidden treasures. Unfortunately, there is a significant gap between what data mining can do in principle and what can be done in practice. We need to brace ourselves for the oncoming wave of crashing expectations as users begin to realize how wide the gap is between {\it research prototypes} and off-the-shelf commercial systems.

We have not yet reached the stage where robust off-the-shelf tools exist that can easily be deployed in data mining applications. Most of the existing tools still need quite a bit of expertise to deploy and fine tune until good analysis results are achieved. The tools have no built-in mechanisms to evaluate when their use is appropriate, they are often stand-alone systems that are not integrated with the external" data store. The reliance is still primarily on an expert analyst/data engineer to make sure that a data mining algorithm is used correctly and yields good results.

What are the implications for KDD? Does this mean we are heading into failure with KDD endeavours? The good news is that the answer is: probably not. The reassuring fact is that many of the useful mining tasks are possible (in theory and in prototype). We do have several proof-of-concept systems and many successes that indicate the benefits of data mining algorithms and how they can be of tremendous help in many significant contexts. What is needed, and fairly urgently, is quite a bit of development effort to build up the right set of tools and systems. Hence the inevitable coming wave of disappointments should not be read as the "fading of a trend", but rather as a warning sign that expectations exceed capabilties. The encouraging facts are that we know that (most) desired goals are achievable, and it is only a matter of time that tools will be built-up. Indeed, tools will be developed because there is a healthy economic demand for them and they do add significant value when successful. Our duty, as researchers, practitioners, or contributors to this field, is to guard against building unreasonable expectations and to help disseminate information on what works, what does not work, and what the limitations are of the current state-of-the-art.

The goal of establishing an end-to-end strategy for a systematic approach to analysis over large databases is fundamentally important to the long-term survival of the field. While this is a long-term goal, in the short term we can advance over many other fronts and still deliver much needed capabilities. Assuming our techniques and methods begin to migrate into tools and get integrated with databases and data warehouses, we will have to face the question: how do we make these tools easy to use for users who do not know much about data analysis? This problem is a very challenging one, and certainly needs much research work. I do not anticipate to see robust and general tools for data mining in the short-term. These are many years away at this stage.

What are, then, the implications to KDD and its practice in the short-term? Again there is a bit of good news. In the near term, we can effectively avoid" the hurdle of making robust general tools usable by non-experts by concentrating only on specialized classes of problems. If integrated data mining capabilities exist, then it becomes possible to build specialized solutions on a case-by-case basis: data mining applications dedicated to very specialized tasks but using the underlying general (and common) infrastructure for communicating with a database, representing data and models, extracting useful statistical quantities from data, and so forth. As more specialized tools are deployed, we will begin to learn what the true issues are in developing general tools usable by non-experts. For certain classes of problems, it will be possible to field general tools, while for others, the services of specialists will always need to be contracted.

In my warnings about heightened expectations I do not mean to paint a dark picture. In fact, I strongly believe that the picture is quite bright on nearly all fronts. For research, we have our plates full of interesting problems of scaling mining algorithms, of integrations with the DBMS, and of optimization and the use of effective heuristics to improve the search for models and patterns. New statistical pattern recognition and modeling techniques that work with high dimensional, large volume data are still needed. There are also plenty of research problems to be addressed in data cleaning, preprocessing, visualization of results of data mining, and strategies for sampling and data selection from warehouses and so-called "data marts" (specialized views/subsets of data warehouses). On the development and applications fronts plenty of good work needs to be achieved in building up an infra-structure of connectivity to databases, integrated mining tools, and systems to make it easy to develop specialized data mining applications.

With the second issue of this journal, we continue to pursue our mission of bringing forth methods and techniques from the variety of fields that constitute the KDD community. The paper by Zhang, Ramakrishnan, and Livny presents a database approach to the fundamental and important problem of clustering with large databases. The papere by Mangasarian shows how mathematical programming techniques from the operations research community can play a role in classification and clustering. While these techniques still have scalability considerations to be addressed, they draw upon well-established and understood methodologies for performing the equivalent of efficient search". Cooper presents an efficient algorithms for learning graphical probabilistic models from data. These models are very promising for many reasons including: ability to deal with missing data, correct and robust inference, ability to show users insight into the data by pointing out important influences, and ability to easily incorporate and account for prior knowledge when available. See the article by Heckerman in Issue 1 of this journal for a tutorial on this topic. Finally, we continue our Brief Application Summary section of the journal with the paper by Cox, Eick, Wills, and Brachman. This paper shows an important application to a large analysis problem, with a heavy emphasis on a visualization-based approach to help people detect new and interesting patterns.

In conclusion, I continue to be very optimistic about the prospects for Data Mining and KDD and the great impact we stand to bring to the way people interact with large databases.

Notes:

1. The Third International Conference on Knowledge Discovery and Data Mining https://web.archive.org/web/20021104033255/http://www-aig.jpl.nasa.gov/kdd97 (KDD-97), will be held in August'97 in Newport Beach, California; see https://web.archive.org/web/20021104033255/http://www-aig.jpl.nasa.gov/kdd97

Usama Fayyad, Editor-in-chief, Data Mining and Knowledge Discovery

Usama Fayyad is a Senior Researcher at Microsoft Research . His research interests include knowledge discovery in large databases, data mining, machine learning, statistical pattern recognition, and clustering. After receiving the Ph.D. degree from The University of Michigan, Ann Arbor, in 1991, he joined the Jet Propulsion Laboratory (JPL), California Institute of Technology where (until early 1996) he headed the Machine Learning Systems Group and developed data mining systems for automated science data analysis. He is a co-editor of Advances in Knowledge Discovery and Data Mining (MIT Press, 1996). He was program co-chair of KDD-94 and KDD-95 (the First International Conference on Knowledge Discovery and Data Mining) and a general chair of KDD-96.

## Papers

### BIRCH: A New Data Clustering Algorithm and Its Applications

Tian Zhang, Raghu Ramakrishnan, Miron Livny

Abstract:

Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called *data mining,* and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality.

In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called *CF-tree.* which serves as an in-memory summary of the data distribution. We have it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.

### Mathematical Programming in Data Mining

O. L. Mangasarian

Abstract:

Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised leaning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained model degradation when applied to new problems. A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors. Examples of applications of these concepts are given.

### A Simple Constraint-Based Algorithm for Efficiently Mining Observational Databases for Causal Relationships

Gregory F. Cooper

Abstract:

This paper presents a simple, efficient computer-based method for discovering causal relationships from databases that contain observational data. Observational data is passively observed, as contrasted with experimental data. Most of the databases available for data mining are observational. There is a great potential for mining such databases to discover causal relationships. We illustrate how observational data can constrain the causal relationships among measured variables, sometimes to the point that we can conclude that one variable is causing another variable. The presentation here is based on a constraint-based approach to causal discovery. A primary purpose of this paper is to present the constraint-based causal discovery method in the simplest possible fashion in order to (1) readily convey the basic ideas that underlie more complex constraint-based causal discovery techniques, and (2) permit interested readers to rapidly program and apply the method to their own databases, as a start toward using more elaborate causal discovery algorithms.

## Brief Application Summaries

### Visual Data Mining: Recognizing Telephone Calling Fraud

Kenneth C. Cox, Stephen G. Eick, Graham J. Wills, and Ronald J. Brachman

Abstract:

Human pattern recognition skills are remarkable and in many situations far exceed the ability of automated mining algorithms. By building domain-specific interfaces that present information visually, we can combine human detection with machines far greater computational capacity. We illustrate our ideas by describing a suite of visual interfaces we built for telephone fraud detection.