Data Preprocessing Techniques
Table of Contents
Introduction
The real world training data is usually not clean and has many issues such as missing values for certain features, features on different scales, non-numeric attributes etc. Thus, there is a need to pre-process the data to make it amenable for training the machine learning model.
Some of the typical problems that require pre-processing are:
- Missing values in features
- Numerical features are not on the same scale.
- Categorical attributes need to be represented with sensible numerical representation.
- Too many features, reduce the number of features.
- Extract features from non-numeric data.
SKlearn API
- Sklearn provides a rich set of transformers for data pre-processing.
- Sklearn provides pipeline to chain multiple transforms together and apply them uniformly across train, eval and test sets.
Sklearn Transformers
Sklearn provide the following library of transformers for data preprocessing.
- Feature Extraction(
sklearn.feature_extraction
) - Data Cleaning(
sklearn.preprocessing
) such as standardization, missing value imputation, feature scaling etc. - Feature Reduction(
sklearn.decomposition.pca
) - Feature Expansion(
sklearn.kernel_approximation
)
Transformer methods
Each transformer has the following methods:
fit()
method learns model parameters from a training set.transform()
method applies the learnt trasformation to the new data.fit_transform()
performs the function of bothfit()
andtransform()
.
1. Data Cleaning
1.1. Handling Missing Values
Missing values occur due to errors in data capture such as senosr malfunctioning, measurement errors etc.
sklearn.impute
API provides functionality to fill missing values and contains classes: SimpleImputer and KNNImputer.
MissingIndicator provides indicators for missing values.
1.1.1. SimpleImputer
- Fills missing values with one of the following strategies: ‘mean’, ‘median’, ‘most_frequent’, and ‘constant’.
Example usage: Consider original feature matrix $\mathbf{X}$
$$ X = \begin{bmatrix} 7 & 1 \\ nan & 8 \\ 2 & nan \\ 9 & 6 \\ \end{bmatrix} $$from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy ='mean')
si.fit_transform(X)
- ’nan’ in 1st column is filled by their mean, $(7+2+9)/3 = 6$.
- ’nan’ in 2nd column is filled by their mean, $(1+8+6)/3 = 5$.
Transformed feature matrix $\mathbf{X}'$
$$ X' = \begin{bmatrix} 7 & 1 \\ 6 & 8 \\ 2 & 5 \\ 9 & 6 \\ \end{bmatrix} $$1.1.2. KNNImputer
- Uses k-nearest neighbors approach to fill missing values in the dataset.
- The missing value of an attribute in a specific example is filled with the mean value of the same attribute of
n_neighbors
closest neighbors.
- The missing value of an attribute in a specific example is filled with the mean value of the same attribute of
- The nearest neighbors are decided based on Eucledian distance.
Example usage: Consider original feature matrix $\mathbf{X}$
$$ X_{4 \times 3} = \begin{bmatrix} 1 & 2 & nan\\ 3 & 4 & 3 \\ nan & 6 & 5 \\ 8 & 8 & 7\\ \end{bmatrix} $$from sklearn.impute import KNNImputer
knni = KNNImputer(n_neighbours = 2, weights = 'uniform')
knni.fit_transform(X)
Transformed feature matrix $\mathbf{X}'$
$$ X_{4 \times 3}' = \begin{bmatrix} 1 & 2 & 4\\ 3 & 4 & 3 \\ 5.5 & 6 & 5 \\ 8 & 8 & 7\\ \end{bmatrix} $$1.2 Numeric Transformers
There are three types of numeric transformers
- Feature scaling
- Polynomial transformation
- Discretization
1.2.1. Feature scaling
Numeical features with different scales leads to slower convergence of iterative optimization procedures like gradient descent, mini batch gradient descent and stochastic gradient descent. Sklearn provides the following APIs for feature scaling:
- StandardScalar
- MaxAbsScalar
- MinMaxScalar
StandardScalar: Transforms the original feature vector $x$ into a new feature vector $x'$ using the formula
$$ x' = \frac{x - \mu}{\sigma} $$Example usage: Consider original feature matrix $\mathbf{X}$
$$ X_{5 \times } = \begin{bmatrix} 4 \\ 3 \\ 2 \\ 5 \\ 6 \\ \end{bmatrix} \text{and} \quad \mu = 4, \sigma = \sqrt 2 $$from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit_transform(X)
Transformed feature matrix $\mathbf{X}'$
$$ X_{5 \times 1}' = \begin{bmatrix} 0 \\ \frac{-1}{\sqrt 2}\\ \frac{-2}{\sqrt 2}\\ \frac{1}{\sqrt 2}\\ \frac{2}{\sqrt 2}\\ \end{bmatrix} \text{and} \quad \mu' =0 , \sigma' = 1 $$MinMaxScalar: Transformes the original feature vector $x$ into new feature vector $x'$ so that all values fall within the range $[0,1]$.
$$ x' = \frac{x - x.min}{x.max - x.min} $$where $x.min$ and $x.max$ are smallest and largest values of that feature respectively, of the original feature vector $x$.
Example usage: Consider original feature matrix $\mathbf{X}$
$$ X_{5 \times 1} = \begin{bmatrix} 15 \\ 2 \\ 5 \\ -1 \\ -5 \\ \end{bmatrix} x.max = 15, x.min = -5 $$from sklearn.preprocessing import MinMaxScalar
mms = MinMaxScalar()
mms.fit_transform(X)
Transformed feature matrix $\mathbf{X}'$
$$ X_{5 \times 1}' = \begin{bmatrix} 1 \\ 0.35 \\ 0.5 \\ 0.6 \\ 0 \\ \end{bmatrix} $$MaxAbsScalar: Transformes the original feature vector $x$ into new feature vector $x'$ so that all values fall within the range $[-1,1]$.
$$ x' = \frac{x}{max(x.max,|x.min|)} \\ \text{where, } max(x.max,|x.min|) \text{ is the max absolute value.} $$Example usage: Consider original feature matrix $\mathbf{X}$
$$ X_{5 \times 1} = \begin{bmatrix} 4 \\ 2 \\ 5 \\ -1 \\ -100 \\ \end{bmatrix} \\ \text{MaxAbsoluteValue} = max(5, |-100|) = 100 $$from sklearn.preprocessing import MaxAbsoluteScalar
mas = MaxAbsoluteScalar()
mas.fit_transform(X)
Transformed feature matrix $\mathbf{X}'$
$$ X_{5 \times 1}' = \begin{bmatrix} 0.04 \\ 0.02 \\ 0.05 \\ -0.01 \\ -1 \\ \end{bmatrix} $$1.2.2. Polynomial Transformation
- Polynomial transformation helps us learn non-linear relationships b/n features and labels.
- Polynomial transformation generates a new feature matrix cosisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
1.2.3. Discretization
KBinsDiscretizer
- Divides a continuous variable into bins.
- One hot encoding or ordinal encoding is further applied to the bin labels.
XBinsDiscretizer(
n_bins=5,
strategy='uniform',
encode = 'ordinal'
)
The entire range is split into 5 different parts or bins.
$$ X_{9 \times 1}' = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 2 \\ 3 \\ 3 \\ 4 \\ 4 \\ \end{bmatrix} $$