Data Preprocessing

Why Preprocess the data?
1. Consistency
  Annotations:
  - Reduce representation of data set.
  - Dimension Reduction Numerosity Reduction
2. Data Transformation
  Annotations:
  - Normalization,data discretization and concept hierarchy generation are forms of Data Transformation.
3. Completeness
  Annotations:
  - Integrating multiple databases,data cubes or files.
4. Accuracy
  Annotations:
  - There are many possible reasons for inaccurate data. For example,the age of a person must be enter within 100.
5. Believability
  Annotations:
  - Believability reflects how much the data are trusted by users.
6. Interpretability
  Annotations:
  - Interpretability reflects how easy the data are understood to users.
Major Tasks in Data Preprocessing
1. Data Cleaning
  Annotations:
  - To Clean the data by filling in missing values,smoothing noisy data,identifying or removing outliers, and resolving inconsistencies.
  - 1. Missing Values. 2. Noisy Data. 3. Data Cleaning as a Process.
  1. Missing Values
    Annotations:
    - 1. Ignore the tuple.
    - 2. Fill in the missing value manually.
    - 3. Use a global constant to fill in the missing value.
    - 5. Use the attribute mean or median for all samples belonging to the same class as the given tuple.
    - 6. Use the most probable value to fill in the missing value.
  2. Noisy Data
    Annotations:
    - 1. Binning -> Smoothing by bin means.-> Smoothing by bin medians.-> Smoothing by bin boundaries.
    - 2. Regression -> Linear Regression
  3. Data Cleaning as a Process
    Annotations:
    - Rules Used: -> Unique Rule. -> Consecutive Rule. -> Null Rule.
    - Tools Used: -> Data Subscribing Tools. -> Data Auditing Tools. -> Data Migration Tools.
2. Data Integration
  Annotations:
  - To integrate multiple databases,data cubes, or files.
  1. Redundancy and Correlation Analysis
    Annotations:
    - Redundancy can be detected by Correlation Analysis.
    - 1. Correlation Test for Nominal Data. 2. Correlation Coefficient for Numeric Data. 3. Covariance of Numeric Data.
  2. Entity Identification Problem
  3. Tuple Duplication
    Annotations:
    - Duplication should also be detected at tuple level.
  4. Data Value Conflict Detection and Resolution
    Annotations:
    - Data Integration also involves the detection and resolution of Data Value Conflicts.
3. Data Transformation and Data Discretization
  Annotations:
  - Normalization, data discretization, and concept hierarchy generation are forms of Data Transformation.
  1. Data Transformation Strategies
    Annotations:
    - 1. Smoothing. 2. Attribute Construction. 3. Aggregation. 4. Normalization. 5. Discretization. 6. Concept Hierarchy Generation for nominal data.
  2. Data Transformation by Normalization
    Annotations:
    - -> Min-Max Normalization. -> Z- Score Normalization. -> Decimal Scaling.
  3. Discretization by Binning
    Annotations:
    - Binning is a top-down splitting technique based on a specified number of bins.
  4. Discretization by Histograms
    Annotations:
    - Histogram analysis is an unsupervised discretization technique because it does not use class information.
  5. Discretization by Cluster,Decision Tree and Correlation Analyses
  6. Concept Hierarchy Generation for Nominal Data
    Annotations:
    - 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts.
    - 2. Specification of a portion of a hierarchy by explicit data grouping.
    - 3. Specification of set of attributes,but not of their partial ordering.
    - 4. Specification of only a partial set of attributes.
4. Data Reduction
  Annotations:
  - To obtain the reduced representation of the data set.
  1. Data Reduction Strategies
    Annotations:
    - Dimensionality Reduction. Numerosity Reduction.
    - Data Compression. -> Lossless. -> Lossy.
  2. Wavelet Transforms
    Annotations:
    - Discrete Wavelet Transform.
  3. Principal Components Analysis
    Annotations:
    - Also called Karhunen - Loeve method.
  4. Attribute Subset Selection
    Annotations:
    - Reduces the data set size by removing irrelevant or redundant attributes.
    - -> Step Forward Selection. -> Step Backward Elimination. -> Combination of both. -> Decision Tree Induction.
  5. Regression and Log - Linear Models
    Annotations:
    - Linear Regression. Multiple Linear Regression. Log-Linear models.
  6. Histograms
    Annotations:
    - Use binning to approximate data distributions and are a popular form of data reduction.
    - -> Equal Width. -> Equal Frequency.
  7. Clustering
    Annotations:
    - Clustering consider data tuples as objects. Centroid distance is an alternative measure of cluster quality.
  8. Sampling
    Annotations:
    - Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random data sample.

Media attachments

f297e33a-f70a-49da-a4f3-600c6cd89064 (image/png)

Next up

Data Preprocessing

Description

Resource summary

Media attachments

Similar

	Created by nithi131993 almost 11 years ago