Data Pre-processing

Knowledge Discovery Flow
Annotations:
- Data preperation/pre-processing estimated take 70-80% of the time and effort
Data Quality Measure
1. Accuracy
2. Completeness
3. Consistency
4. Timeliness
5. Believability
6. Value added
7. Value Added
8. Interpretability
9. Accessibility
No qality data
1. Incomplete
  Annotations:
  - missing attribute values, missing certain attributes of interest, containing only aggregate data
2. Noisy
  Annotations:
  - Filled with errors or outliers
3. Inconsistent
  Annotations:
  - Containing discrepancies in codes, names or values
Main Task
1. Data Cleaning
  1. Fill in missing values
    1. Missing Data
      Annotations:
      - May due to: Equipment malfunctionInconsistent with other recorded data and thus deletedData not entered due to misunderstandingCertain data may not be considered important at the time of entryNot registered in history or changes of the dat
      1. Ignore
        Annotations:
        Especially when the class label is missing – e.g. remove fr dataset. Bad! If there are many instances with missing values(it may mean sth) If dataset is big, ok, remove it, if it is small, you have to live with it.
      2. Fill in value manully
        Fill in with attribute mean
        Annotations:
        Use the attribute mean to fill in every missing value for that attribute –become noise
        Bayesian formula or decision tree
        Annotations:
        Use Bayesian formula or decision tree to know what is the most probable value to fill in the missing value = downside: problem of biasing.
        Use different models
        Annotations:
        Solution: use different model. e.g. Naiive bay is what the model is using, then use other e.g. J48 for prediction
      3. Expectation Maximization
        Annotations:
        Build model of the data (ignoring missing values) Use the model to estimate missing values Build new models of data values (including the estimated values) Use new models to re-estimate missing values Repeats until convergence (old model = new model)
  2. Smooth noisy data
    Annotations:
    - data still w/n normal range, but actually wrong value.
    - Incorrect attribute values may be due toFaulty data collection instrumentsData entry problemsData transmission problemsTechnology limitationInconsistency in naming conventions
    1. Binning Method
      Annotations:
      - Sort data and partition into (equal-depth) bins Sort data, partition them, and to arrange them into different bins, calculate frequency of each bin.
      - Smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Change all value to 1 value (e.g. mean/ median/ boundaries) Boundaries = find the mean of the group of data, data below the mean change to the minimum no, above change to the max no within the group.
      1. Equal-width (distance) partitioning
        Annotations:
        divide the range into N intervals of equal size: for N bins, the width of each interval will be W = (maxValue - minValue)/N
        Straightforward but outliers may dominate presentation - worst for skewed data.
      2. Equal - depth (frequency) partitioning
        Annotations:
        Divide the range into N intervals so that each interval contains approximate the same number of samples (the interval doesn't need to have the same width) - good data scaling
    2. Clustering
      Annotations:
      - Detect & remove outliers
      - Remove data that does not belong to any group
    3. Combined Computer & Human Inspection
      Annotations:
      - Detect suspicious values & check by human - Handle inconsistent data
      - Semi-automatic Detect violation of known functional dependencies and data constraints E.g. use dictionary or grammar rules Correct redundant data Use correlational analysis or similarity measures to detect redundant data
    4. Soothing
      1. Partition into (equi-dept) bins
      2. Smoothing by bin means
        Annotations:
        change all values in a bin to the mean of the bin
      3. Smoothing by bin boundries
        Annotations:
        val <mean (bin) --> minVal(bin) val > mean(bin) --> manVal(bin)
      4. Regression
        Annotations:
        Smooth al the values according to the best line fit smooth by fitting the data into regression functions
  3. Identify or remove outliers
    Annotations:
    - outliers = sth outside the range - define the normal range first e.g. mean & SD
  4. Resolve inconsistencies
  5. Remove duplicate re ords
2. Data integration
  Annotations:
  - Combines data from multiple sources into a coherent store Integrate metadata from different sources
  - Possible problems The same attribute may have different names in different data sources, e.g. CustID and CustomerNo One attribute may be a “derived” attribute in another table, e.g. annual revenue Different representation and scales, e.g. metric vs. British units, different currency, different timezone
3. Data Reduction
  Annotations:
  - Complex data analysis may take a very long time to run on the complete data set Obtain a reduced representation of the data set that is much smaller in volume but produces (almost) the same analytical results
  1. Strategy
    1. Dimensionality reduction
      1. Feature Selection
        Annotations:
        Select a minimum set of features so that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all feature
        Reduce the number of attributes in the discovered patterns Makes the patterns easier to understand
        Ways to select attributes include Decision Tree induction (information gain and gain ratio) Principal Component Analysis (in 2 weeks time)
        Generally, keep top 50 attributes ** assignment: top 10.
        Ways
        Decision Tree
        Principal Component Analysis
        Approach
        Wrapper approach
        Annotations:
        (find the best attributes for the chosen classifier) Try all possible combinations of feature subsets Train on train set, evaluate on a validation set (or use cross-validation) Use set of features that performs best on the validation set Algorithm dependent
        Proxy method
        Annotations:
        Determine what features are important or not without knowing/using what learning algorithm will be employed Information gain, Gain ratio, Cosine similarity, etc.Algorithm independent & Fast but may not suitable for all algorithms
      2. Sampling
        Annotations:
        Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew
        Develop adaptive (stratified) sampling methods Approximate the percentage of each class Sample the data so that the class distribution stays the same after sampling
    2. Data Compression
    3. Discretization
      Annotations:
      - Divide the range of a continuous attributes into intervals Interval labels can then be used to replace actual data values
      - Normal Converting Numeric to Ordinal Converting Ordinal to Numeric
      1. Binning Methods
      2. Use info gain/gain ratio to find the best splitting points
      3. Clustering analysis
    4. Concept hierarchy generalization
      Annotations:
      - Replace low level concepts by higher level concepts E.g. Age: 15, 65, 3 to Age: teen, senior, child, middle-aged, etc Instead of street, use city or state or country for the geographical locatio
  2. Quantity
    Annotations:
    - Generally 5000 or more number of instances are desired If less, results are less reliable. Might need to use special methods like boosting There are at least 10 or more instances for each unique attribute value 100 more instances for each class label
    1. stratified sampling
      Annotations:
      - If unbalanced, use stratified sampling Stratified sampling = u have the same number of instances per class label, not look at distribution
    2. Random Sampling
      Annotations:
      - sample as they come
4. Data Transformation
  Annotations:
  - Sometimes it is better to convert nominal to numeric attributes So you can use mathematical comparisons on the fields E.g. instead of cold, warm, hot -> -5, 25, 33 Or A -> 85, A- -> 80, B+ -> 75, B ->70
  1. Normalization
  2. Aggregation

Next up

Data Pre-processing

Description

Resource summary

Media attachments

Similar

	Created by Freda Fung over 9 years ago