Data Preprocessing

Description

Data Preprocessing Technique in Data Mining
nithi131993
Mind Map by nithi131993, updated more than 1 year ago
nithi131993
Created by nithi131993 about 9 years ago
36
0

Resource summary

Data Preprocessing
  1. Why Preprocess the data?
    1. Consistency

      Annotations:

      • Reduce representation of data set.
      • Dimension Reduction Numerosity Reduction
      1. Data Transformation

        Annotations:

        • Normalization,data discretization and concept hierarchy generation are forms of Data Transformation.
        1. Completeness

          Annotations:

          • Integrating multiple databases,data cubes or files.
          1. Accuracy

            Annotations:

            • There are many possible reasons for inaccurate data. For example,the age of a person must be enter within 100.
            1. Believability

              Annotations:

              • Believability reflects how much the data are trusted by users.
              1. Interpretability

                Annotations:

                • Interpretability reflects how easy the data are understood to users.
              2. Major Tasks in Data Preprocessing
                1. Data Cleaning

                  Annotations:

                  • To Clean the data by filling in missing values,smoothing noisy data,identifying or removing outliers, and resolving inconsistencies.
                  • 1. Missing Values. 2. Noisy Data. 3. Data Cleaning as a Process.
                  1. Missing Values

                    Annotations:

                    • 1. Ignore the tuple.
                    • 2. Fill in the missing value manually.
                    • 3. Use a global constant to fill in the missing value.
                    • 5. Use the attribute mean or median for all samples belonging to the same class as the given tuple.
                    • 6. Use the most probable value to fill in the missing value.
                    1. Noisy Data

                      Annotations:

                      • 1. Binning  -> Smoothing by bin means.-> Smoothing by bin medians.-> Smoothing by bin boundaries.
                      • 2. Regression -> Linear Regression
                      1. Data Cleaning as a Process

                        Annotations:

                        • Rules Used: -> Unique Rule. -> Consecutive Rule. -> Null Rule.
                        • Tools Used: -> Data Subscribing Tools. -> Data Auditing Tools. -> Data Migration Tools.
                      2. Data Integration

                        Annotations:

                        • To integrate multiple databases,data cubes, or files.
                        1. Redundancy and Correlation Analysis

                          Annotations:

                          • Redundancy can be detected by Correlation Analysis.
                          • 1. Correlation Test for Nominal Data. 2. Correlation Coefficient for Numeric Data. 3. Covariance of Numeric Data.
                          1. Entity Identification Problem
                            1. Tuple Duplication

                              Annotations:

                              • Duplication should also be detected at tuple level.
                              1. Data Value Conflict Detection and Resolution

                                Annotations:

                                • Data Integration also involves the detection and resolution of  Data Value Conflicts.
                              2. Data Transformation and Data Discretization

                                Annotations:

                                • Normalization, data discretization, and concept hierarchy generation are forms of Data Transformation.
                                1. Data Transformation Strategies

                                  Annotations:

                                  • 1. Smoothing. 2. Attribute Construction. 3. Aggregation. 4. Normalization. 5. Discretization. 6. Concept Hierarchy Generation for nominal data.
                                  1. Data Transformation by Normalization

                                    Annotations:

                                    • -> Min-Max Normalization. -> Z- Score Normalization. -> Decimal Scaling.
                                    1. Discretization by Binning

                                      Annotations:

                                      • Binning is a top-down splitting technique based on a specified number of bins.
                                      1. Discretization by Histograms

                                        Annotations:

                                        • Histogram analysis is an unsupervised discretization technique because it does not use class information.
                                        1. Discretization by Cluster,Decision Tree and Correlation Analyses
                                          1. Concept Hierarchy Generation for Nominal Data

                                            Annotations:

                                            • 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts.
                                            • 2. Specification of a portion of a hierarchy by explicit data grouping.
                                            • 3. Specification of set of attributes,but not of their partial ordering.
                                            • 4. Specification of only a partial set of attributes.
                                          2. Data Reduction

                                            Annotations:

                                            • To obtain the reduced representation of the data set.
                                            1. Data Reduction Strategies

                                              Annotations:

                                              • Dimensionality Reduction. Numerosity Reduction.
                                              • Data Compression. -> Lossless. -> Lossy.
                                              1. Wavelet Transforms

                                                Annotations:

                                                • Discrete Wavelet Transform.
                                                1. Principal Components Analysis

                                                  Annotations:

                                                  • Also called Karhunen - Loeve method.
                                                  1. Attribute Subset Selection

                                                    Annotations:

                                                    • Reduces the data set size by removing irrelevant or redundant attributes.
                                                    • -> Step Forward Selection. -> Step Backward Elimination. -> Combination of both. -> Decision Tree Induction.
                                                    1. Regression and Log - Linear Models

                                                      Annotations:

                                                      • Linear Regression. Multiple Linear Regression. Log-Linear models.
                                                      1. Histograms

                                                        Annotations:

                                                        • Use binning to approximate data distributions and are a popular form of data reduction.
                                                        • -> Equal Width. -> Equal Frequency.
                                                        1. Clustering

                                                          Annotations:

                                                          • Clustering consider data tuples as objects. Centroid distance is an alternative measure of cluster quality.
                                                          1. Sampling

                                                            Annotations:

                                                            • Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random data sample.
                                                        Show full summary Hide full summary

                                                        Similar

                                                        Chapter 19 Key Terms
                                                        Monica Holloway
                                                        Data Warehousing and Mining
                                                        i7752068
                                                        Insurance Policy Advisor
                                                        Sufiah Takeisu
                                                        Data Mining Part 1
                                                        Kim Graff
                                                        Minería de Datos.
                                                        Marcos Soledispa
                                                        Machine Learning
                                                        Alberto Ochoa
                                                        Data Mining from Big Data 4V-s
                                                        Prohor Leykin
                                                        Data Mining part 2
                                                        Kim Graff
                                                        pattern discovery
                                                        NIVEDITA RAO
                                                        Minería de Datos
                                                        Jack Jmz
                                                        Data Mining Process
                                                        Steve Hiscock