1、第三方库安装(好坑)
首先吐槽sklearn第三方库安装中注意的问题:
1. 咱们使用pip install sklearn 安装好sklearn 后,发现调用库时出现不匹配报错:
(在网上查了不少方法,有版本不匹配,安装an3就OK,等等 。 可是最后调整好是用额外安装别的第三方库才OK的)
2. 咱们须要导入 numpy,scipy包,而后安装mkl包,而且更新scikit-learn
3. 随后成功解决问题
2、数据的归一化
数据归一化:将数据都归结到一个范围内
1. 首先咱们须要导入python 第三方库包,并随意选取一组数据:
import pandas as pd
from sklearn. preprocessing import MinMaxScaler
data = [ [ - 1 , 2 ] , [ - 0.5 , 6 ] , [ 0 , 10 ] , [ 1 , 18 ] ]
2. 而后咱们开始建立归一化对象,并实现数据归一化: (这里数据归一化不用使数据变为pd形式也可)
t = pd. DataFrame( data)
scaler = MinMaxScaler( )
result = scaler. fit_transform( data)
print ( result)
print ( scaler. inverse_transform( result) )
3、数据标准化( 均值为0 、方差为 1 、正态分布)
1. 首先咱们须要导入python 第三方库包,并随意选取一组数据 (和上面同样)
from sklearn. preprocessing import StandardScaler
import pandas as pd
data = [ [ - 1 . , 1.3 ] , [ - 0.5 , 6 ] , [ 0 , 10 ] , [ 1 , 18 ] ]
2. 而后咱们开始建立标准化对象,并实现数据标准化: (这里数据标准化不用使数据变为pd形式也可)
data = pd. DataFrame( data)
scaler = StandardScaler( )
scaler = scaler. fit_transform( data)
print ( scaler)
4、数据缺失值处理 (均值、中位数、特定值)
1. 首先咱们须要导入python 第三方库包,并随意选取一组数据 (和上面同样)
from sklearn. impute import SimpleImputer
import pandas as pd
import numpy as np
data = [ [ - 1 . , 1.3 ] , [ - 0.5 , 6 ] , [ 0 , 10 ] , [ 1 , 18 ] ]
2. 而后咱们开始建立缺失值对象,并实现数据缺失值填充:
data = pd. DataFrame( data)
data. iloc[ 1 : 2 , 1 ] = np. nan
imp_mean = SimpleImputer( )
imp = imp_mean. fit_transform( data)
print ( imp)
5、删除数据
1. 首先咱们须要导入python 第三方库包,并随意选取一组数据 (和上面同样)
import pandas as pd
import numpy as np
data = [ [ - 1 . , 1.3 ] , [ - 0.5 , 6 ] , [ 0 , 10 ] , [ 1 , 18 ] ]
2.随后咱们转换为pd 数据,而后将数据中的某几项值变为nan (默认的缺失值)
data = pd. DataFrame( data)
data. iloc[ 1 : 2 , 1 ] = np. nan
3.而后建立dropna 对象处理缺失值
test = data. dropna( axis = 0 , inplace = False )
print ( test)
6、数据类型转换:(提早处理缺失值)
1. 首先咱们须要导入python 第三方库包,而且导入数据:
import numpy as np
import pandas as pd
from sklearn. preprocessing import LabelEncoder
data = pd. read_csv( r'd:\\data_\\data.csv' )
2.而后咱们进行数据的选取与带入:
y = data. iloc[ : , - 1 ]
le = LabelEncoder( )
le = le. fit( y)
3. 能够查看当前数据有几种类型,而后进行数据转换:
print ( le. classes_)
lable = le. transform( y)
4.将数据赋值给原数据,并进行输出:
data. iloc[ : , - 1 ] = lable
print ( data. head( ) )
5.固然也能够一步到位:
data. iloc[ : , - 1 ] = LabelEncoder( ) . fit_transform( data. iloc[ : , - 1 ] )
print ( data. head( ) )
功能类似:
t = OrdinalEncoder( ) . fit( data_. iloc[ : , 1 : - 1 ] ) . categories_
data_. iloc[ : , 1 : - 1 ] = OrdinalEncoder( ) . fit_transform( data_. iloc[ : , 1 : - 1 ] )
print ( data_. head( ) )
7、类型转换为哑变量(必须首先转换为数值型)(具备特征)
1.首先咱们导入库,与上述处理完的数据:
import pandas as pd
from sklearn. preprocessing import LabelEncoder
from sklearn. preprocessing import OneHotEncoder
data = pd. read_csv( r'd:\\data_\\data.csv' )
data = data. dropna( axis= 0 , inplace= False )
data_1 = data. copy( )
data_1. iloc[ : , - 1 ] = LabelEncoder( ) . fit_transform( data_1. iloc[ : , - 1 ] )
2.建立OneHotEncoder类,进行哑变量的转换:
X = data_1. iloc[ : , 1 : - 1 ]
enc = OneHotEncoder( categories= 'auto' ) . fit( X)
print ( enc. get_feature_names( ) )
result = enc. transform( X) . toarray( )
data_1 = pd. concat( [ data_1, pd. DataFrame( result) ] , axis= 1 )
data_1. drop( [ "Sex" , "Embarked" ] , axis = 1 , inplace = True )
data_1. columns = [ "Age" , "Survived" , "Female" , "Male" , "Embarked_C" , "Embarked_Q" , "Embarked_S" ]
print ( result)
8、数据二值化:
1.导入数据包和相关数据:
import pandas as pd
from sklearn. preprocessing import Binarizer
data = pd. read_csv( r'd:\\data_\\data.csv' )
2. 数据记性其余操做:
data = data. dropna( axis= 0 , inplace= False )
data_1 = data. copy( )
data_1. iloc[ : , 1 : - 1 ] = OrdinalEncoder( ) . fit_transform( data_1. iloc[ : , 1 : - 1 ] )
data_1. iloc[ : , - 1 ] = LabelEncoder( ) . fit_transform( data_1. iloc[ : , - 1 ] )
3.进行二分值处理:
X = data_1. iloc[ : , 0 ] . values. reshape( - 1 , 1 )
transformer = Binarizer( threshold= 30 ) . fit_transform( X)
data_1. iloc[ : , 0 ] = transformer
另外一种库:KBinsDiscretizer
( 1 ) encode : 1.onehot 作哑变量 2.ordinal 每一个特征都被编码为一个整数
( 2 ) strategy : 1.uniform 表示等宽 2.quantile 等位(数量相同) 3.kmeans 聚类
1. 包和数据的导入:
import pandas as pd
from sklearn. preprocessing import LabelEncoder
from sklearn. preprocessing import KBinsDiscretizer
from sklearn. preprocessing import OrdinalEncoder
data = pd. read_csv( r'd:\\data_\\data.csv' )
2. 数据的处理:
data = data. dropna( axis= 0 , inplace= False )
data_1 = data. copy( )
data_1. iloc[ : , 1 : - 1 ] = OrdinalEncoder( ) . fit_transform( data_1. iloc[ : , 1 : - 1 ] )
data_1. iloc[ : , - 1 ] = LabelEncoder( ) . fit_transform( data_1. iloc[ : , - 1 ] )
3. 对象的建立
X = data_1. iloc[ : , 0 ] . values. reshape( - 1 , 1 )
est = KBinsDiscretizer( n_bins= 3 , encode= 'ordinal' , strategy= 'uniform' )
est = KBinsDiscretizer( n_bins= 3 , encode= 'onehot' , strategy= 'uniform' )
4.相关数据处理:
print ( est. fit_transform( X) . toarray( ) )
补充:
1. 对于数据量太大的情形,fit不能直接转换,所以咱们使用partial_fit() 对数据进行处理
scaler. partial_fit( data)
scaler = scaler. transform( data)