您的当前位置:首页正文

建模-数据预处理

2024-11-21 来源:个人技术集锦

背景:
参与数据一周实践学习,此次为第一次作业内容。

1、数据导入
1.1 导入需要的包

import numpy as np
import pandas as pd

1.2导入待处理数据

data =  pd.read_csv(r'/Users/bf/mlearning_task/data_set/data.csv')

2、数据查阅
2.1 head()

data.head()

2.2 info()

data.info()

2.3 describe()

data.describe()

2.4 查看bad/good分布

data['status'].value_counts()

3、无效特征删除
3.1 非建模特征删除
Unnamed: 0理解为之前数据的索引字段、custid为用户编号、trade_no为交易流水号、bank_card_no为卡号、source值唯一,id_name为用户姓名;以上字段对建模无意义,故初步以删除处理;、

data = data.drop(['Unnamed: 0', 'custid','trade_no','bank_card_no','source','id_name'],axis=1)

3.2 缺失过多特征删除
student_feature字段缺失率在63.06%,缺失率太高,故不使用该变量

data = data.drop([ 'student_feature'],axis=1)

4、缺失值处理
4.1 变量类型区分
1)类别型变量

data_classify = data[['reg_preference_for_trad','latest_query_time','loans_latest_time']]

2)数值型变量

data_numeric = data.drop(['reg_preference_for_trad','latest_query_time','loans_latest_time'],axis=1)

4.2 缺失值处理:
1)数值型变量使用均值填充

data_numeric = data_numeric.fillna(data_numeric.mean())

2)类型型变量使用前值填充

data_classify = data_classify.fillna(method = 'bfill')

5、类别型变量的数值转换
5.1 reg_preference_for_trad城市变量处理:独热编码

dummies = pd.get_dummies(data_classify['reg_preference_for_trad'],prefix='reg_preference_for_trad')

5.2 latest_query_time/loans_latest_time时间变量处理

#日期的格式转换
data_classify['latest_query_time'] = pd.to_datetime(data_classify['latest_query_time'])
data_classify['loans_latest_time'] = pd.to_datetime(data_classify['loans_latest_time'])
#月:
data_classify['latest_query_time_month'] =  pd.to_datetime(data_classify['latest_query_time'] ).dt.month
data_classify['loans_latest_time_month'] =  pd.to_datetime(data_classify['loans_latest_time'] ).dt.month
#周
data_classify['latest_query_time_week'] =  pd.to_datetime(data_classify['latest_query_time'] ).dt.weekday
data_classify['loans_latest_time_week'] =  pd.to_datetime(data_classify['loans_latest_time'] ).dt.weekday

6、数据合并
6.1 类别变量合并

x = data_classify.columns.values.tolist()
data_classify = data_classify[x].join(dummies)

6.2 类别与数值型变量合并

data = pd.concat([data_classify,data_numeric],axis=1)
显示全文