Data Mining part 2

Description

Data Mining final review part 2
Kim Graff
Flashcards by Kim Graff, updated more than 1 year ago
Kim Graff
Created by Kim Graff about 9 years ago
10
0

Resource summary

Question Answer
Overfitting and Tree Pruning Overfitting an induced tree may overfit the training data. Too many branches, some may reflect anomalies due to noise or outliers. Poor accuracy for unseen samples.
Two approaches to avoid overfitting PREPRUNING: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold. Difficult to choose an appropriate threshold. PPOSTPRUNING: Remove branches from "fully grown" tree—get a sequence of progressively pruned trees. Use a set of data different from the training data to decide which is the "best pruned tree"
Naive Bayesian Classifer: Comments
Associative Classification Association rules are generated and analyzed for use in classification Search for strong associations between frequent patterns (conjuring of attribute-value pairs) and class labels
Associative Classification: Why effective? It explores highly confident associations among multiple attributes and may overcome some constraints introduced by decision-tree induction, which considers only one attribute at a time In many studies, associative classification has been found to be more accurate than some traditional classification methods, such as C4.5
Classifier Accuracy Measures
Evaluating the Accuracy of a Classifier or Predictor: HOLDOUT METHOD Given data is randomly partitioned into two independent sets: TRAINING SET (e.g., 2/3) for model construction TEST SET (e.g., 1/3) for accuracy estimation RANDOM SAMPLING: a variation of holdout — repeat holdout k times, accuracy = avg. of the accuracies obtained
Evaluating the Accuracy of a Classifier or Predictor CROSS-VALIDATION (k-fold, where k = 10 is most popular) Randomly partition the data in k mutually exclusive subsets, each approximately equal size. At i-th iteration, use Di as test set and others as training set. LEAVE-ONE-OUT: k folds where k = # of tuples, for small size data STRATIFIED CROSS-VALIDATION: folds are stratified so that class list. in each fold is approx. the same as that in the initial data
Ensemble Methods: Increasing the Accuracy Ensemble Methods: use a combination of models to increase accuracy, combine series of k learned models, M1, M2, ..., Mk, with the aim of creating improved model M* POPULAR ENSEMBLE METHODS: Bagging: average the prediction over a collection of classifiers Boosting: weighted vote with collection of classifiers
An example of K-Means Clustering
Hierarchical Clustering
Density-Based Clustering: Basic Concepts
Density reachable
Density-connected
Density Based Spatial Clustering of Applications with noise
The Curse of Dimensionality Data in only one dimension is relatively packed adding a dimension stretch the points across that dimension, making them further apart adding more dimensions will make the points further apart-high dimensional data is extremely spares distance measure becomes meaningless due to equi-distance
Show full summary Hide full summary

Similar

Chapter 19 Key Terms
Monica Holloway
Data Warehousing and Mining
i7752068
Insurance Policy Advisor
Sufiah Takeisu
Data Mining Part 1
Kim Graff
Minería de Datos.
Marcos Soledispa
Machine Learning
Alberto Ochoa
Data Mining from Big Data 4V-s
Prohor Leykin
Model Roles
Steve Hiscock
Data Mining Process
Steve Hiscock
Data Mining Tasks
Steve Hiscock
Distribution Types
Steve Hiscock