Preprocessing data

数据集的标准化是许多机器学习方法的一个共同要求,学习算法受益于数据集的标准化。sklearn.preprocessing模块提供了几种常用的函数和数据变换的类来将原始特征向量转换成更适合下游机器学习方法的数据集。

标准化方法和函数

  • 在实际应用中,我们常常忽略分布的形状,而只是通过去除每个特征的平均值来将数据转换为中心,然后用非常量特征除以其标准差来缩放它。
    零-均值规范化:也称标准差规范化,将数据标准化为均值为0,标准差为1。Z-Score,公式为:
    $$x^*=\frac{x-x^-}{\sigma}$$
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X_train)

print(X_scaled)
print('----'*10)
print((X_train-np.mean(X_train,axis=0))/np.std(X_train,axis=0))
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
----------------------------------------
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]

StandardScaler

实现Transformer API来计算训练集上的平均值和标准差,以便以后能够在测试集上重新应用相同的转换。

scaler = preprocessing.StandardScaler().fit(X_train)
print(scaler.mean_)
print('----'*10)
print(scaler.var_)
print('----'*10)
print(scaler.scale_)
print('----'*10)
print(scaler.transform(X_train))
[1.         0.         0.33333333]
----------------------------------------
[0.66666667 0.66666667 1.55555556]
----------------------------------------
[0.81649658 0.81649658 1.24721913]
----------------------------------------
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
----------------------------------------
[[-2.44948974  1.22474487 -0.26726124]]
----------------------------------------
[[-2.44948974  1.22474487 -0.26726124]]
X_test = [[-1., 1., 0.]]

print(scaler.transform(X_test))#用train的数据的标准化方式(mean,std),标准化新的test数据,使test数据符合train的数据特征
print('----'*10)
print((X_test-np.mean(X_train,axis=0))/np.std(X_train,axis=0))
[[-2.44948974  1.22474487 -0.26726124]]
----------------------------------------
[[-2.44948974  1.22474487 -0.26726124]]

MinMaxScaler and MaxAbsScaler

  • 适用于缩放稀疏数据
    MinMaxScaler:最小最大规范化,离差标准化
    $$x^*=\frac{x-min(x)}{max(x)-min(x)}$$
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)
print('----'*10)
print((X_train-np.min(X_train,axis=0))/((np.max(X_train,axis=0))-np.min(X_train,axis=0)))
[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]
----------------------------------------
[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]

同样可以利用transform方法对其他数据进行同样操作

X_test = np.array([[-3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
print(X_test_minmax)
print('----'*10)
print((X_test-np.min(X_train,axis=0))/((np.max(X_train,axis=0))-np.min(X_train,axis=0)))
[[-1.5         0.          1.66666667]]
----------------------------------------
[[-1.5         0.          1.66666667]]

MaxAbsScaler

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
print(X_train/np.abs(np.max(X_train,axis=0)))
[[ 0.5 -1.   1. ]
 [ 1.   0.   0. ]
 [ 0.   1.  -0.5]]

RobustScaler

Scaling data with outliers离群值归一化

from sklearn.preprocessing import RobustScaler
X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
     [ 4.,  1., -2.]]
transformer = RobustScaler().fit(X)
print(transformer.transform(X))
[[ 0.  -2.   0. ]
 [-1.   0.   0.4]
 [ 1.   0.  -1.6]]
Q1=np.quantile(X,0.25,axis=0)
Q3=np.quantile(X,0.75,axis=0)
Q2=np.quantile(X,0.5,axis=0)

print((X-Q2)/(Q3-Q1))
[[ 0.  -2.   0. ]
 [-1.   0.   0.4]
 [ 1.   0.  -1.6]]

Categories: MathmaticsPython

2 Comments

橙虹 · 2020年8月5日 at 19:42

不明觉厉

    Anonymous · 2020年8月6日 at 09:41

    可以懂,但没必要

Leave a Reply

Your email address will not be published.