Data Pre-processing

Description

Graduate diploma Graduate Diploma in Computing (Data Mining) Mind Map on Data Pre-processing, created by Freda Fung on 13/09/2016.
Freda Fung
Mind Map by Freda Fung, updated more than 1 year ago
Freda Fung
Created by Freda Fung about 9 years ago
3
0

Resource summary

Data Pre-processing
  1. Knowledge Discovery Flow

    Annotations:

    • Data preperation/pre-processing estimated take 70-80% of the time and effort
    1. Data Quality Measure
      1. Accuracy
        1. Completeness
          1. Consistency
            1. Timeliness
              1. Believability
                1. Value added
                  1. Value Added
                    1. Interpretability
                      1. Accessibility
                      2. No qality data
                        1. Incomplete

                          Annotations:

                          • missing attribute values, missing certain attributes of interest, containing only aggregate data
                          1. Noisy

                            Annotations:

                            • Filled with errors or outliers
                            1. Inconsistent

                              Annotations:

                              • Containing discrepancies in codes, names or values
                            2. Main Task
                              1. Data Cleaning
                                1. Fill in missing values
                                  1. Missing Data

                                    Annotations:

                                    • May due to: Equipment malfunction​Inconsistent with other recorded data and thus deleted​Data not entered due to misunderstanding​Certain data may not be considered important at the time of entry​Not registered in history or changes of the dat
                                    1. Ignore

                                      Annotations:

                                      • Especially when the class label is missing – e.g. remove fr dataset.​ Bad! If there are many instances with missing values(it may mean sth)​ If dataset is big, ok, remove it, if it is small, you have to live with it.​
                                      1. Fill in value manully
                                        1. Fill in with attribute mean

                                          Annotations:

                                          • Use the attribute mean to fill in every missing value for that attribute –become noise
                                          1. Bayesian formula or decision tree

                                            Annotations:

                                            • Use Bayesian formula or decision tree to know what is the most probable value to fill in the missing value = downside: problem of biasing.​
                                            1. Use different models

                                              Annotations:

                                              • Solution: use different model.  e.g. Naiive bay is what the model is using, then use other e.g. J48 for prediction
                                          2. Expectation Maximization

                                            Annotations:

                                            • Build model of the data (ignoring missing values)​ Use the model to estimate missing values​ Build new models of data values (including the estimated values)​ Use new models to re-estimate missing values​ Repeats until convergence (old model = new model)​
                                        2. Smooth noisy data

                                          Annotations:

                                          • data still w/n normal range, but actually wrong value.  
                                          • Incorrect attribute values may be due to​Faulty data collection instruments​Data entry problems​Data transmission problems​Technology limitation​Inconsistency in naming conventions ​
                                          1. Binning Method

                                            Annotations:

                                            • Sort data and partition into (equal-depth) bins​ Sort data, partition them, and to arrange them into different bins, calculate frequency of each bin.
                                            • Smooth by bin means,  smooth by bin median, smooth by bin boundaries, etc.​ Change all value to 1 value (e.g. mean/ median/ boundaries) ​ Boundaries = find the mean of the group of data, data below the mean  change to the minimum no, above change to the max no within the group.​
                                            1. Equal-width (distance) partitioning

                                              Annotations:

                                              • divide the range into N intervals of equal size: for N bins, the width of each interval will be W = (maxValue - minValue)/N
                                              • Straightforward but outliers may dominate presentation - worst for skewed data.
                                              1. Equal - depth (frequency) partitioning

                                                Annotations:

                                                • Divide the range into N intervals so that each interval contains approximate the same number of samples (the interval doesn't need to have the same width) - good data scaling
                                              2. Clustering

                                                Annotations:

                                                • Detect & remove outliers
                                                • Remove data that does not belong to any group
                                                1. Combined Computer & Human Inspection

                                                  Annotations:

                                                  • Detect suspicious values & check by human - Handle inconsistent data
                                                  • Semi-automatic​ Detect violation of known functional dependencies and data constraints​ E.g. use dictionary or grammar rules​ Correct redundant data​ Use correlational analysis or similarity measures to detect redundant data
                                                  1. Soothing
                                                    1. Partition into (equi-dept) bins
                                                      1. Smoothing by bin means

                                                        Annotations:

                                                        • change all values in a bin to the mean of the bin
                                                        1. Smoothing by bin boundries

                                                          Annotations:

                                                          • val <mean (bin) --> minVal(bin) val > mean(bin) --> manVal(bin)
                                                          1. Regression

                                                            Annotations:

                                                            • Smooth al the values according to the best line fit smooth by fitting the data into regression functions
                                                        2. Identify or remove outliers

                                                          Annotations:

                                                          • outliers = sth outside the range - define the normal range first e.g. mean & SD
                                                          1. Resolve inconsistencies
                                                            1. Remove duplicate re ords
                                                            2. Data integration

                                                              Annotations:

                                                              • Combines data from multiple sources into a coherent store​ Integrate metadata from different sources​
                                                              • Possible problems​ The same attribute may have different names in different data sources, e.g. CustID and CustomerNo​ One attribute may be a “derived” attribute in another table, e.g. annual revenue​ Different representation and scales, e.g. metric vs. British units, different currency, different timezone​
                                                              1. Data Reduction

                                                                Annotations:

                                                                • Complex data analysis may take a very long time to run on the complete data set​ Obtain a reduced representation of the data set that is much smaller in volume but produces (almost) the same analytical results​
                                                                1. Strategy
                                                                  1. Dimensionality reduction
                                                                    1. Feature Selection

                                                                      Annotations:

                                                                      • Select a minimum set of features so that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all feature
                                                                      • Reduce the number of attributes in the discovered patterns​ Makes the patterns easier to understand​
                                                                      • Ways to select attributes include​ Decision Tree induction (information gain and gain ratio)​ Principal Component Analysis (in 2 weeks time)​
                                                                      • Generally, keep top 50 attributes ** assignment: top 10.​
                                                                      1. Ways
                                                                        1. Decision Tree
                                                                          1. Principal Component Analysis
                                                                          2. Approach
                                                                            1. Wrapper approach

                                                                              Annotations:

                                                                              • (find the best attributes for the chosen classifier)​ Try all possible combinations of feature subsets​ Train on train set, evaluate on a validation set (or use cross-validation)​ Use set of features that performs best on the validation set​ Algorithm dependent​
                                                                              1. Proxy method

                                                                                Annotations:

                                                                                • Determine what features are important or not without knowing/using what learning algorithm will be employed​ Information gain, Gain ratio, Cosine similarity, etc.​Algorithm independent & Fast but may not suitable for all algorithms​
                                                                            2. Sampling

                                                                              Annotations:

                                                                              • Choose a representative subset of the data​ Simple random sampling may have very poor performance in the presence of skew​
                                                                              • Develop adaptive (stratified) sampling methods​ Approximate the percentage of each class​ Sample the data so that the class distribution stays the same after sampling​
                                                                            3. Data Compression
                                                                              1. Discretization

                                                                                Annotations:

                                                                                • Divide the range of a continuous attributes into intervals​ Interval labels can then be used to replace actual data values​
                                                                                • Normal Converting Numeric to Ordinal Converting Ordinal to Numeric
                                                                                1. Binning Methods
                                                                                  1. Use info gain/gain ratio to find the best splitting points
                                                                                    1. Clustering analysis
                                                                                    2. Concept hierarchy generalization

                                                                                      Annotations:

                                                                                      • Replace low level concepts by higher level concepts​ E.g. Age: 15, 65, 3 to Age: teen, senior, child, middle-aged, etc​ Instead of street, use city or state or country for the geographical locatio
                                                                                    3. Quantity

                                                                                      Annotations:

                                                                                      • Generally​ 5000 or more number of instances are desired​ If less, results are less reliable. Might need to use special methods like boosting​ There are at least 10 or more instances for each unique attribute value​ 100 more instances for each class label​
                                                                                      1. stratified sampling

                                                                                        Annotations:

                                                                                        • If unbalanced, use stratified sampling ​ Stratified sampling = u have the same number of instances per class label, not look at distribution​
                                                                                        1. Random Sampling

                                                                                          Annotations:

                                                                                          • sample as they come
                                                                                      2. Data Transformation

                                                                                        Annotations:

                                                                                        • Sometimes it is better to convert nominal to numeric attributes​ So you can use mathematical comparisons on the fields​ E.g. instead of cold, warm, hot ->  -5, 25, 33​ Or A -> 85, A- -> 80, B+ -> 75, B ->70​
                                                                                        1. Normalization
                                                                                          1. Aggregation
                                                                                        Show full summary Hide full summary

                                                                                        Similar

                                                                                        Chapter 19 Key Terms
                                                                                        Monica Holloway
                                                                                        Data Warehousing and Mining
                                                                                        i7752068
                                                                                        Insurance Policy Advisor
                                                                                        Sufiah Takeisu
                                                                                        Data Mining Part 1
                                                                                        Kim Graff
                                                                                        Minería de Datos.
                                                                                        Marcos Soledispa
                                                                                        Machine Learning
                                                                                        Alberto Ochoa
                                                                                        Data Mining from Big Data 4V-s
                                                                                        Prohor Leykin
                                                                                        Model Roles
                                                                                        Steve Hiscock
                                                                                        Data Mining Process
                                                                                        Steve Hiscock
                                                                                        Data Mining Tasks
                                                                                        Steve Hiscock
                                                                                        Distribution Types
                                                                                        Steve Hiscock