Data Preprocessing

Beschreibung

Data Preprocessing Technique in Data Mining
nithi131993
Mindmap von nithi131993, aktualisiert more than 1 year ago
nithi131993
Erstellt von nithi131993 vor mehr als 9 Jahre
36
0

Zusammenfassung der Ressource

Data Preprocessing
  1. Why Preprocess the data?
    1. Consistency

      Anmerkungen:

      • Reduce representation of data set.
      • Dimension Reduction Numerosity Reduction
      1. Data Transformation

        Anmerkungen:

        • Normalization,data discretization and concept hierarchy generation are forms of Data Transformation.
        1. Completeness

          Anmerkungen:

          • Integrating multiple databases,data cubes or files.
          1. Accuracy

            Anmerkungen:

            • There are many possible reasons for inaccurate data. For example,the age of a person must be enter within 100.
            1. Believability

              Anmerkungen:

              • Believability reflects how much the data are trusted by users.
              1. Interpretability

                Anmerkungen:

                • Interpretability reflects how easy the data are understood to users.
              2. Major Tasks in Data Preprocessing
                1. Data Cleaning

                  Anmerkungen:

                  • To Clean the data by filling in missing values,smoothing noisy data,identifying or removing outliers, and resolving inconsistencies.
                  • 1. Missing Values. 2. Noisy Data. 3. Data Cleaning as a Process.
                  1. Missing Values

                    Anmerkungen:

                    • 1. Ignore the tuple.
                    • 2. Fill in the missing value manually.
                    • 3. Use a global constant to fill in the missing value.
                    • 5. Use the attribute mean or median for all samples belonging to the same class as the given tuple.
                    • 6. Use the most probable value to fill in the missing value.
                    1. Noisy Data

                      Anmerkungen:

                      • 1. Binning  -> Smoothing by bin means.-> Smoothing by bin medians.-> Smoothing by bin boundaries.
                      • 2. Regression -> Linear Regression
                      1. Data Cleaning as a Process

                        Anmerkungen:

                        • Rules Used: -> Unique Rule. -> Consecutive Rule. -> Null Rule.
                        • Tools Used: -> Data Subscribing Tools. -> Data Auditing Tools. -> Data Migration Tools.
                      2. Data Integration

                        Anmerkungen:

                        • To integrate multiple databases,data cubes, or files.
                        1. Redundancy and Correlation Analysis

                          Anmerkungen:

                          • Redundancy can be detected by Correlation Analysis.
                          • 1. Correlation Test for Nominal Data. 2. Correlation Coefficient for Numeric Data. 3. Covariance of Numeric Data.
                          1. Entity Identification Problem
                            1. Tuple Duplication

                              Anmerkungen:

                              • Duplication should also be detected at tuple level.
                              1. Data Value Conflict Detection and Resolution

                                Anmerkungen:

                                • Data Integration also involves the detection and resolution of  Data Value Conflicts.
                              2. Data Transformation and Data Discretization

                                Anmerkungen:

                                • Normalization, data discretization, and concept hierarchy generation are forms of Data Transformation.
                                1. Data Transformation Strategies

                                  Anmerkungen:

                                  • 1. Smoothing. 2. Attribute Construction. 3. Aggregation. 4. Normalization. 5. Discretization. 6. Concept Hierarchy Generation for nominal data.
                                  1. Data Transformation by Normalization

                                    Anmerkungen:

                                    • -> Min-Max Normalization. -> Z- Score Normalization. -> Decimal Scaling.
                                    1. Discretization by Binning

                                      Anmerkungen:

                                      • Binning is a top-down splitting technique based on a specified number of bins.
                                      1. Discretization by Histograms

                                        Anmerkungen:

                                        • Histogram analysis is an unsupervised discretization technique because it does not use class information.
                                        1. Discretization by Cluster,Decision Tree and Correlation Analyses
                                          1. Concept Hierarchy Generation for Nominal Data

                                            Anmerkungen:

                                            • 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts.
                                            • 2. Specification of a portion of a hierarchy by explicit data grouping.
                                            • 3. Specification of set of attributes,but not of their partial ordering.
                                            • 4. Specification of only a partial set of attributes.
                                          2. Data Reduction

                                            Anmerkungen:

                                            • To obtain the reduced representation of the data set.
                                            1. Data Reduction Strategies

                                              Anmerkungen:

                                              • Dimensionality Reduction. Numerosity Reduction.
                                              • Data Compression. -> Lossless. -> Lossy.
                                              1. Wavelet Transforms

                                                Anmerkungen:

                                                • Discrete Wavelet Transform.
                                                1. Principal Components Analysis

                                                  Anmerkungen:

                                                  • Also called Karhunen - Loeve method.
                                                  1. Attribute Subset Selection

                                                    Anmerkungen:

                                                    • Reduces the data set size by removing irrelevant or redundant attributes.
                                                    • -> Step Forward Selection. -> Step Backward Elimination. -> Combination of both. -> Decision Tree Induction.
                                                    1. Regression and Log - Linear Models

                                                      Anmerkungen:

                                                      • Linear Regression. Multiple Linear Regression. Log-Linear models.
                                                      1. Histograms

                                                        Anmerkungen:

                                                        • Use binning to approximate data distributions and are a popular form of data reduction.
                                                        • -> Equal Width. -> Equal Frequency.
                                                        1. Clustering

                                                          Anmerkungen:

                                                          • Clustering consider data tuples as objects. Centroid distance is an alternative measure of cluster quality.
                                                          1. Sampling

                                                            Anmerkungen:

                                                            • Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random data sample.
                                                        Zusammenfassung anzeigen Zusammenfassung ausblenden

                                                        ähnlicher Inhalt

                                                        Chapter 19 Key Terms
                                                        Monica Holloway
                                                        Data Warehousing and Mining
                                                        i7752068
                                                        Insurance Policy Advisor
                                                        Sufiah Takeisu
                                                        Data Mining Part 1
                                                        Kim Graff
                                                        Minería de Datos.
                                                        Marcos Soledispa
                                                        Machine Learning
                                                        Alberto Ochoa
                                                        Data Mining from Big Data 4V-s
                                                        Prohor Leykin
                                                        Model Roles
                                                        Steve Hiscock
                                                        Data Mining Process
                                                        Steve Hiscock
                                                        Data Mining Tasks
                                                        Steve Hiscock
                                                        Distribution Types
                                                        Steve Hiscock