Table of Contents
Preface
Introduction
Chapter 1 Data Exploration As a Process
Chapter 2 The Nature of the World and Its Impact on Data Preparation
Chapter 3 Data Preparation as a Process
Chapter 4 Getting the Data: Basic Preparation
Chapter 5 Sampling, Variability and Confidence
Chapter 6 Handling Non-Numerical Variables
Chapter 7 Normalizing and Redistributing Variables
Chapter 8 Replacing Missing and Empty Values
Chapter 9 Series Variables
Chapter 10 Preparing the Data Set
Chapter 11 The Data Survey
Chapter 12 Using Prepared Data
Appendix A Using the Demonstration Code on the CD
Appendix B Further Reading
Index
Interviews & Essays
From the AuthorThank you for your interest in my book! The book is about exactly what the title suggests, how to prepare data for mining. I wrote it because in data mining, one of the most important parts of the whole process is to properly prepare the data. The importance of preparation is acknowledged at conferences, seminars, presentations and in books about data mining. Yet despite its importance, it is not really addressed in detail anywhere else. Data mining is becoming very popular today, and many people are interested in using these new and powerful tools. Perhaps you are one of them.
You may not have a background in statistics or data analysis, but you still want to get the most out of what data mining offers. But how do you begin? Most data mining books talk at length about what various algorithms do, and how to apply them to prepared data. But how do you get started? This book will help you to see the process, understand what is needed, and get the most out of your data in solving real-world business problems. Of course, data preparation is a technical subject. I do assume that you know the basics of computing, and that at some point you took high school math (although you may well have forgotten most of what you learned about it!) That's O.K. Basic knowledge of computing and forgotten high school math, plus an interest in understanding how to get the most out of your data, is all you will need to understand what is in this book. There is very little math here, and even what there is can be ignored if you only want an overview.
If you are a programmer, or understand how to read computer programs, all of the tools that are described in the text are illustratedwith code on the accompanying CD-ROM. Once again, you don't need to understand the code to use the tools and techniques. It's there if you want it, but this is not a book about programming. My focus throughout is on helping you to understand what to do with and to data to get the most out of it. And so that you can experiment for yourself, there are some sample data sets provided for you to explore. The code is ready compiled for you to use on the data, as well as in source form.
My book is mainly intended for people who need to work with data and to mine it. However, if you only need to understand what is involved in the preparation and mining process, and what can realistically be expected from it, this book will help you too. You will certainly want to skip the more technical parts, but there is plenty of non-technical material that will give you a good idea of the process. I really enjoyed writing the book. I have spent a lot of my professional life working with data sets to find out what is in them and to get value out of them. I hope that you enjoy reading it, and that by doing so, you can avoid making some of the mistakes that I made along the way! Most of what I learned was as a result of discovering what didn't work, and then discovering what did on many, many projects. I wish you much luck and success in your mining efforts.
Dorian Pyle (dpyle@dca.net), the Author
Read an Excerpt
Chapter 5: Sampling, Variability, and Confidence
5.7 Problems and Shortcomings of Taking Samples Using Variability
The discussion so far has established the need for sampling, for using measurement of variability as a means to decide how much data is enough, and the use of confidence measures to determine what constitutes enough. Taken together, this provides a firm foundation to begin to determine how much data is enough. Although the basic method has now been established, there are a number of practical issues that need to be addressed before attempting to implement this methodology.
5.7.1 Missing Values
Missing or empty values present a problem. What value, if any, should be in the place of the missing value cannot yet be determined.
The practical answer for determining variability is that missing values are simply ignored as if they did not exist. Simply put, a missing value does not count as an instance of data, and the variability calculation is made using only those instances that have a measured value.
This implies that, for numerical variables particularly, the difference between a missing value and the value 0 must be distinguished. in some database programs and data warehouses, it is possible to distinguish variables that have not been assigned values. The demonstration program works with data in character-only format (.txt files) and regards values of all space as missing.
The second problem with missing values is deciding at what threshold of density the variable is not worth bothering with. As a practical choice, the demonstration program uses the confidence level here as a density measure. A 95% confidence level will generate aminimum density requirement of 5% (100 95). This is very low, and in practice such low-density variables probably contribute little information of value. It's probably better to remove them.
The program does, however, measure the density of all variables. The cutoff occurs when an appropriate confidence level can be established that the variable is below the minimum density. For the 95% confidence level, this translates into being 95% certain that the variable is less than 5% populated.
5.7.2 Constants (Variables with Only One Value)
A problem similar in some respects to missing values is that of variables that are in fact constants. That is to say, they contain only a single value. These should be removed before sampling begins. However, they are easy to overlook. Perhaps the sample is about people who are now divorced. From a broader population it is easy to extract all those who are presently divorced. However, if there is a field answering the question "Was this person ever married?" or "Is the tax return filed jointly?" obviously the answer to the first question has to be "Yes." It's hard to get divorced if you've never married. Equally, divorced people do not file joint tax returns.
For whatever reason, variables with only one value do creep unwittingly into data sets for preparation. The demonstration program will flag them as such when the appropriate level of confidence has been reached that there is no variability in the variable.
5.7.3 Problems with Sampling
Sampling inherently has limitations. The whole idea of sampling is that the variability is captured without inspecting all of the data. The sample specifically does not examine all of the values present in the data set-that is the whole point of sampling.
A problem arises with alpha variables. The demonstration software does achieve a satisfactory representative sampling of the categorical values. However, not all the distinct values are necessarily captured. The PIE only knows how to translate those values that it has encountered in the data. (How to determine what particular value a given categorical should be assigned is explained in Chapter 6.) There is no way to tell how to translate values for the alpha values that exist in the data but were not encountered in the sample.
This is not a problem with alpha variables having a small and restricted number of values they can assume. With a restricted number of possible discrete values, sampling will find them all. The exact number sampled depends on the selected confidence level. Many real-world data sets contain categorical fields demonstrating the limitations of sampling high discretecount categorical variables. (Try the data set SHOE on the CD-ROM.)
In general, names and addresses are pretty hopeless. There are simply too many of them. If ZIP codes are used and turn out to be too numerous, it is often helpful to try limiting the numbers by using just the three-digit ZIP. SIC codes have similar problems.
The demonstration code does not have the ability to be forced to comprehensively sample alpha variables. Such a modification would be easy to make, but there are drawbacks. The higher sampling rate can be forced by placing higher levels of confidence on selected variables. If it is known that there are high rates of categorical variable incidence, and that the sample data actually contains a complete and representative distribution of them, this will force the data preparation program to sample most or all of any number of distinct values. This feature should be used cautiously as very high confidence on high distinct-value count variables may require enormous amounts of data.
5.7.4 Monotonic Variable Detection
Monotonic variables are those that increase continuously, usually with time. Indeed, time increment variables are often monotonic if both time and date are included. Julian dates, which is a system of measurement using the number of days and fractions of days elapsed from a specified starting point (rather like Star Trek's star dates) are a perfect example of monotonic variables.
There are many other examples, such as serial numbers, order numbers, invoice numbers, membership numbers, account numbers, ISBN numbers, and a host of others. What they all have in common is that they increase without bound. There are many ways of dealing with these and either encoding or recoding them. This is discussed in Chapter 9. Suitably prepared, they are no longer monotonic. Here the focus is on how best to deal with the issue if monotonic variables accidentally slip into the data set.
The problem for variability capture is that only those values present in the sample available to the miner can be accessed. In any limited number of instances there will be some maximum and some minimum for each variable, including the monotonic variables. The full range will be sampled with any required degree of accuracy. The problem is that as soon as actual data for modeling is used from some other source, the monotonic variable will very likely take on values outside the range sampled. Even if the range of new variables is inside the range sampled, the distribution will likely be totally different than that in the original sample. Any modeled inferences or relationships made that rely on this data will be invalid.
This is a very tricky problem to detect. It is in the nature of monotonic variables to have a trend. That is, there is a natural ordering of one following another in sequence. However, one founding pillar of sampling is random sampling. Random sampling destroys any order. Even if random sampling were to be selectively abandoned, it does no good, for the values of any variable can be ordered if so desired. Such an ordering, however, is likely to...