背景:
参与数据一周实践学习,此次为第一次作业内容。
1、数据导入
1.1 导入需要的包
import numpy as np
import pandas as pd
1.2导入待处理数据
data = pd.read_csv(r'/Users/bf/mlearning_task/data_set/data.csv')
2、数据查阅
2.1 head()
data.head()
2.2 info()
data.info()
2.3 describe()
data.describe()
2.4 查看bad/good分布
data['status'].value_counts()
3、无效特征删除
3.1 非建模特征删除
Unnamed: 0理解为之前数据的索引字段、custid为用户编号、trade_no为交易流水号、bank_card_no为卡号、source值唯一,id_name为用户姓名;以上字段对建模无意义,故初步以删除处理;、
data = data.drop(['Unnamed: 0', 'custid','trade_no','bank_card_no','source','id_name'],axis=1)
3.2 缺失过多特征删除
student_feature字段缺失率在63.06%,缺失率太高,故不使用该变量
data = data.drop([ 'student_feature'],axis=1)
4、缺失值处理
4.1 变量类型区分
1)类别型变量
data_classify = data[['reg_preference_for_trad','latest_query_time','loans_latest_time']]
2)数值型变量
data_numeric = data.drop(['reg_preference_for_trad','latest_query_time','loans_latest_time'],axis=1)
4.2 缺失值处理:
1)数值型变量使用均值填充
data_numeric = data_numeric.fillna(data_numeric.mean())
2)类型型变量使用前值填充
data_classify = data_classify.fillna(method = 'bfill')
5、类别型变量的数值转换
5.1 reg_preference_for_trad城市变量处理:独热编码
dummies = pd.get_dummies(data_classify['reg_preference_for_trad'],prefix='reg_preference_for_trad')
5.2 latest_query_time/loans_latest_time时间变量处理
#日期的格式转换
data_classify['latest_query_time'] = pd.to_datetime(data_classify['latest_query_time'])
data_classify['loans_latest_time'] = pd.to_datetime(data_classify['loans_latest_time'])
#月:
data_classify['latest_query_time_month'] = pd.to_datetime(data_classify['latest_query_time'] ).dt.month
data_classify['loans_latest_time_month'] = pd.to_datetime(data_classify['loans_latest_time'] ).dt.month
#周
data_classify['latest_query_time_week'] = pd.to_datetime(data_classify['latest_query_time'] ).dt.weekday
data_classify['loans_latest_time_week'] = pd.to_datetime(data_classify['loans_latest_time'] ).dt.weekday
6、数据合并
6.1 类别变量合并
x = data_classify.columns.values.tolist()
data_classify = data_classify[x].join(dummies)
6.2 类别与数值型变量合并
data = pd.concat([data_classify,data_numeric],axis=1)