Discovering Knowledge in Data: An Introduction to Data Mining, 2ed

Daniel T. Larose

ISBN: 9788126558346

336 pages

INR 899


This book provides the tools needed to thrive in today’s big data world. The author demonstrates how to leverage a company’s existing databases to increase profits and market share and carefully explains the most current data science methods and techniques. The reader will “learn data mining by doing data mining”. By adding chapters on data modelling preparation, imputation of missing data and multivariate statistical analysis, Discovering Knowledge in Data, Second Edition remains the eminent reference on data mining.



Chapter 1 An Introduction to Data Mining

1.1 What is Data Mining?  

1.2 Wanted: Data Miners

1.3 The Need for Human Direction of Data Mining

1.4 The Cross-Industry Standard Practice for Data Mining

1.5 Fallacies of Data Mining

1.6 What Tasks Can Data Mining Accomplish?


Chapter 2 Data Preprocessing

2.1 Why do We Need to Preprocess the Data?

2.2 Data Cleaning  

2.3 Handling Missing Data

2.4 Identifying Misclassifications

2.5 Graphical Methods for Identifying Outliers

2.6 Measures of Center and Spread

2.7 Data Transformation  

2.8 Min-Max Normalization

2.9 Z-Score Standardization

2.10 Decimal Scaling

2.11 Transformations to Achieve Normality

2.12 Numerical Methods for Identifying Outliers

2.13 Flag Variables

2.14 Transforming Categorical Variables into Numerical Variables

2.15 Binning Numerical Variables

2.16 Reclassifying Categorical Variables

2.17 Adding an Index Field

2.18 Removing Variables that are Not Useful

2.19 Variables that Should Probably Not Be Removed

2.20 Removal of Duplicate Records

2.21 A Word About ID Fields


Chapter 3 Exploratory Data Analysis

3.1 Hypothesis Testing Versus Exploratory Data Analysis

3.2 Getting to Know the Data Set

3.3 Exploring Categorical Variables

3.4 Exploring Numeric Variables

3.5 Exploring Multivariate Relationships

3.6 Selecting Interesting Subsets of the Data for Further Investigation

3.7 Using EDA to Uncover Anomalous Fields

3.8 Binning Based on Predictive Value

3.9 Deriving New Variables: Flag Variables

3.10 Deriving New Variables: Numerical Variables

3.11 Using EDA to Investigate Correlated Predictor Variables

3.12 Summary


Chapter 4 Univariate Statistical Analysis

4.1 Data Mining Tasks in Discovering Knowledge in Data

4.2 Statistical Approaches to Estimation and Prediction

4.3 Statistical Inference

4.4 How Confident are We in Our Estimates?

4.5 Confidence Interval Estimation of the Mean

4.6 How to Reduce the Margin of Error

4.7 Confidence Interval Estimation of the Proportion

4.8 Hypothesis Testing for the Mean

4.9 Assessing the Strength of Evidence Against the Null Hypothesis

4.10 Using Confidence Intervals to Perform Hypothesis Tests

4.11 Hypothesis Testing for the Proportion


Chapter 5 Multivariate Statistics

5.1 Two-Sample t-Test for Difference in Means

5.2 Two-Sample Z-Test for Difference in Proportions

5.3 Test for Homogeneity of Proportions

5.4 Chi-Square Test for Goodness of Fit of Multinomial Data

5.5 Analysis of Variance

5.6 Regression Analysis  

5.7 Hypothesis Testing in Regression

5.8 Measuring the Quality of a Regression Model

5.9 Dangers of Extrapolation

5.10 Confidence Intervals for the Mean Value of y Given x

5.11 Prediction Intervals for a Randomly Chosen Value of y Given x

5.12 Multiple Regression

5.13 Verifying Model Assumptions


Chapter 6 Preparing to Model The Data

6.1 Supervised Versus Unsupervised Methods

6.2 Statistical Methodology and Data Mining Methodology

6.3 Cross-Validation

6.4 Overfitting

6.5 BIAS--Variance Trade-Off

6.6 Balancing the Training Data Set

6.7 Establishing Baseline Performance


Chapter 7 K-Nearest Neighbor Algorithm

7.1 Classification Task

7.2 k-Nearest Neighbor Algorithm

7.3 Distance Function

7.4 Combination Function

7.5 Quantifying Attribute Relevance: Stretching the Axes

7.6 Database Considerations

7.7 k-Nearest Neighbor Algorithm for Estimation and Prediction

7.8 Choosing k

7.9 Application of k-Nearest Neighbor Algorithm using IBM / SPSS Modeler


Chapter 8 Decision Trees

8.1 What is a Decision Tree?  

8.2 Requirements for Using Decision Trees

8.3 Classification and Regression Trees

8.4 C4.5 Algorithm

8.5 Decision Rules

8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data


Chapter 9 Neural Networks

9.1 Input and Output Encoding

9.2 Neural Networks for Estimation and Prediction

9.3 Simple Example of a Neural Network

9.4 Sigmoid Activation Function

9.5 Back-Propagation

9.6 Termination Criteria

9.7 Learning Rate

9.8 Momentum Term

9.9 Sensitivity Analysis

9.10 Application of Neural Network Modeling


Chapter 10 Hierarchical And K-Means Clustering

10.1 The Clustering Task

10.2 Hierarchical Clustering Methods

10.3 Single-Linkage Clustering

10.4 Complete-Linkage Clustering

10.5 k-Means Clustering  

10.6 Example of k-Means Clustering at Work

10.7 Behavior of MSB, MSE and PSEUDO-F as the k-Means Algorithm Proceeds

10.8 Application of k-Means Clustering using SAS Enterprise Miner

10.9 Using Cluster Membership to Predict Churn


Chapter 11 Kohonen Networks

11.1 Self-Organizing Maps

11.2 Kohonen Networks

11.2.1 Kohonen Networks Algorithm

11.3 Example of a Kohonen Network Study

11.4 Cluster Validity

11.5 Application of Clustering using Kohonen Networks

11.6 Interpreting the Clusters

11.6.1 Cluster Profiles

11.7 Using Cluster Membership as Input to Downstream Data Mining Models


Chapter 12 Association Rules

12.1 Affinity Analysis and Market Basket Analysis

12.2 Support, Confidence, Frequent Item sets and the a Priori Property

12.3 How Does the a Priori Algorithm Work?

12.4 Extension from Flag Data to General Categorical Data

12.5 Information-Theoretic Approach: Generalized Rule Induction Method

12.5.1 J-Measure

12.6 Association Rules are Easy to do Badly

12.7 How Can We Measure the Usefulness of Association Rules?

12.8 Do Association Rules Represent Supervised or Unsupervised Learning?

12.9 Local Patterns Versus Global Models


Chapter 13 Imputation of Missing Data

13.1 Need for Imputation of Missing Data

13.2 Imputation of Missing Data: Continuous Variables

13.3 Standard Error of the Imputation

13.4 Imputation of Missing Data: Categorical Variables

13.5 Handling Patterns in Missingness


Chapter 14 Model Evaluation Techniques

14.1 Model Evaluation Techniques for the Description Task

14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks

14.3 Model Evaluation Techniques for the Classification Task

14.4 Error Rate, False Positives, and False Negatives

14.5 Sensitivity and Specificity

14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns

14.7 Decision Cost/Benefit Analysis

14.8 Lift Charts and Gains Charts

14.9 Interweaving Model Evaluation with Model Building

14.10 Confluence of Results: Applying a Suite of Models


The R Zone



Hands-On Analysis

Appendix: Data Summarization And Visualization