Data Pre-processing

Knowledge Discovery Flow
Anmerkungen:
- Data preperation/pre-processing estimated take 70-80% of the time and effort
Data Quality Measure
1. Accuracy
2. Completeness
3. Consistency
4. Timeliness
5. Believability
6. Value added
7. Value Added
8. Interpretability
9. Accessibility
No qality data
1. Incomplete
  Anmerkungen:
  - missing attribute values, missing certain attributes of interest, containing only aggregate data
2. Noisy
  Anmerkungen:
  - Filled with errors or outliers
3. Inconsistent
  Anmerkungen:
  - Containing discrepancies in codes, names or values
Main Task
1. Data Cleaning
  1. Fill in missing values
    1. Missing Data
      Anmerkungen:
      - May due to: Equipment malfunctionInconsistent with other recorded data and thus deletedData not entered due to misunderstandingCertain data may not be considered important at the time of entryNot registered in history or changes of the dat
      1. Ignore
        Anmerkungen:
        Especially when the class label is missing – e.g. remove fr dataset. Bad! If there are many instances with missing values(it may mean sth) If dataset is big, ok, remove it, if it is small, you have to live with it.
      2. Fill in value manully
        Fill in with attribute mean
        Anmerkungen:
        Use the attribute mean to fill in every missing value for that attribute –become noise
        Bayesian formula or decision tree
        Anmerkungen:
        Use Bayesian formula or decision tree to know what is the most probable value to fill in the missing value = downside: problem of biasing.
        Use different models
        Anmerkungen:
        Solution: use different model. e.g. Naiive bay is what the model is using, then use other e.g. J48 for prediction
      3. Expectation Maximization
        Anmerkungen:
        Build model of the data (ignoring missing values) Use the model to estimate missing values Build new models of data values (including the estimated values) Use new models to re-estimate missing values Repeats until convergence (old model = new model)
  2. Smooth noisy data
    Anmerkungen:
    - data still w/n normal range, but actually wrong value.
    - Incorrect attribute values may be due toFaulty data collection instrumentsData entry problemsData transmission problemsTechnology limitationInconsistency in naming conventions
    1. Binning Method
      Anmerkungen:
      - Sort data and partition into (equal-depth) bins Sort data, partition them, and to arrange them into different bins, calculate frequency of each bin.
      - Smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Change all value to 1 value (e.g. mean/ median/ boundaries) Boundaries = find the mean of the group of data, data below the mean change to the minimum no, above change to the max no within the group.
      1. Equal-width (distance) partitioning
        Anmerkungen:
        divide the range into N intervals of equal size: for N bins, the width of each interval will be W = (maxValue - minValue)/N
        Straightforward but outliers may dominate presentation - worst for skewed data.
      2. Equal - depth (frequency) partitioning
        Anmerkungen:
        Divide the range into N intervals so that each interval contains approximate the same number of samples (the interval doesn't need to have the same width) - good data scaling
    2. Clustering
      Anmerkungen:
      - Detect & remove outliers
      - Remove data that does not belong to any group
    3. Combined Computer & Human Inspection
      Anmerkungen:
      - Detect suspicious values & check by human - Handle inconsistent data
      - Semi-automatic Detect violation of known functional dependencies and data constraints E.g. use dictionary or grammar rules Correct redundant data Use correlational analysis or similarity measures to detect redundant data
    4. Soothing
      1. Partition into (equi-dept) bins
      2. Smoothing by bin means
        Anmerkungen:
        change all values in a bin to the mean of the bin
      3. Smoothing by bin boundries
        Anmerkungen:
        val <mean (bin) --> minVal(bin) val > mean(bin) --> manVal(bin)
      4. Regression
        Anmerkungen:
        Smooth al the values according to the best line fit smooth by fitting the data into regression functions
  3. Identify or remove outliers
    Anmerkungen:
    - outliers = sth outside the range - define the normal range first e.g. mean & SD
  4. Resolve inconsistencies
  5. Remove duplicate re ords
2. Data integration
  Anmerkungen:
  - Combines data from multiple sources into a coherent store Integrate metadata from different sources
  - Possible problems The same attribute may have different names in different data sources, e.g. CustID and CustomerNo One attribute may be a “derived” attribute in another table, e.g. annual revenue Different representation and scales, e.g. metric vs. British units, different currency, different timezone
3. Data Reduction
  Anmerkungen:
  - Complex data analysis may take a very long time to run on the complete data set Obtain a reduced representation of the data set that is much smaller in volume but produces (almost) the same analytical results
  1. Strategy
    1. Dimensionality reduction
      1. Feature Selection
        Anmerkungen:
        Select a minimum set of features so that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all feature
        Reduce the number of attributes in the discovered patterns Makes the patterns easier to understand
        Ways to select attributes include Decision Tree induction (information gain and gain ratio) Principal Component Analysis (in 2 weeks time)
        Generally, keep top 50 attributes ** assignment: top 10.
        Ways
        Decision Tree
        Principal Component Analysis
        Approach
        Wrapper approach
        Anmerkungen:
        (find the best attributes for the chosen classifier) Try all possible combinations of feature subsets Train on train set, evaluate on a validation set (or use cross-validation) Use set of features that performs best on the validation set Algorithm dependent
        Proxy method
        Anmerkungen:
        Determine what features are important or not without knowing/using what learning algorithm will be employed Information gain, Gain ratio, Cosine similarity, etc.Algorithm independent & Fast but may not suitable for all algorithms
      2. Sampling
        Anmerkungen:
        Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew
        Develop adaptive (stratified) sampling methods Approximate the percentage of each class Sample the data so that the class distribution stays the same after sampling
    2. Data Compression
    3. Discretization
      Anmerkungen:
      - Divide the range of a continuous attributes into intervals Interval labels can then be used to replace actual data values
      - Normal Converting Numeric to Ordinal Converting Ordinal to Numeric
      1. Binning Methods
      2. Use info gain/gain ratio to find the best splitting points
      3. Clustering analysis
    4. Concept hierarchy generalization
      Anmerkungen:
      - Replace low level concepts by higher level concepts E.g. Age: 15, 65, 3 to Age: teen, senior, child, middle-aged, etc Instead of street, use city or state or country for the geographical locatio
  2. Quantity
    Anmerkungen:
    - Generally 5000 or more number of instances are desired If less, results are less reliable. Might need to use special methods like boosting There are at least 10 or more instances for each unique attribute value 100 more instances for each class label
    1. stratified sampling
      Anmerkungen:
      - If unbalanced, use stratified sampling Stratified sampling = u have the same number of instances per class label, not look at distribution
    2. Random Sampling
      Anmerkungen:
      - sample as they come
4. Data Transformation
  Anmerkungen:
  - Sometimes it is better to convert nominal to numeric attributes So you can use mathematical comparisons on the fields E.g. instead of cold, warm, hot -> -5, 25, 33 Or A -> 85, A- -> 80, B+ -> 75, B ->70
  1. Normalization
  2. Aggregation

Nächster

Data Pre-processing

Beschreibung

Zusammenfassung der Ressource

Medienanhänge

ähnlicher Inhalt

	Erstellt von Freda Fung vor mehr als 9 Jahre