您的当前位置:首页正文

监督学习回归算法详解与Boston房价预测练习

2024-12-01 来源:个人技术集锦

监督学习回归算法详解与Boston房价预测练习

长文警告,请耐心阅读

大家好,我是W

前言:为什么这篇文章的名字起的那么怪异,这主要是因为机器学习的分类是这样分的,所以起了个这样的名字。回归主题,这回我们来学习机器学习算法的回归算法,在讲回归算法之前还要聊聊机器学习大致分类,线性回归算法原理、api详解,最后结合Sklearn的Boston房价数据做房价预测练习。

文章流程:机器学习分类、线性回归算法原理、线性回归算法api详解、配套api详解、Boston房价预测实战

机器学习分类

分类

如何选择机器学习模型

选择机器学习模型需要根据实际需求决定,在解决问题之前需要明确两个问题:

  1. 使用机器学习的目的是什么,需要完成哪些任务?
  2. 需要分析的数据是什么?

说白了就是根据目的去选择不同的机器学习模型,并且在干活之前需要了解数据集,对数据集的特征处理有大概的思路。

  1. 考虑算法的目的
    1. 若需要做目标值的预测可以选择监督学习算法,否则考虑无监督学习算法
    2. 确定为监督学习算法前提下,若目标值数据类型是离散型,可以考虑分类算法,否则考虑回归算法

考虑算法并非一成不变,也有分类算法可用于回归的情况。

  1. 考虑数据类型
    1. 需要充分理解数据的含义
    2. 充分考虑缺失值处理问题

只有充分理解数据才能更好地做特征工程,否则会大大影响预测的准确性

线性回归算法原理

线性回归算法属于监督学习下的一种分类,它可以通过训练集的数据建立模型,帮助使用者对测试数据进行处理、预测的算法。他的原理简单来说就是在多维空间中找到一条符合大多数数据的曲线,使得曲线贴合大多数的数据,当测试数据进入时通过训练集描绘出来的曲线找到测试数据的定位。

以二维空间为例

在图1我们可以看到,在二维坐标系中存在一系列的真实数据,而线性回归算法的目的就是在这个二维空间中找到最贴合大多数数据的曲线(即找到一条能代表大多数数据的曲线),来对预测数据进行估计。

对于图1的单变量线性回归,房子的价格与房子的大小有关,所以通过模型学习可以得出:y = w*x1 + bias 的公式,w表示权重、x1表示房子大小、bias表示偏正

以三维空间为例

在图2中可以看到,在三维空间中,线性回归需要找到一个最贴合数据的平面,使得这个平面可以代表大多数数据。

同理,对于图2的多变量线性回归,可以得出:y = w1x1 + w2x2 + bias的公式,其他以此类推。

如何评价贴合(计算损失)

在上面我多次提到贴合这个词,贴合代表了预测与实际数据的准确程度,那么如何评价贴合呢?评价贴合度需要用到损失函数,损失函数是机器学习中一个重要的概念,它贯穿了机器学习的始终。

对于线性回归模型,我们将模型与实际数据点间的距离称为误差,当二者之间的距离越小就表明越贴合。我们用误差公式来衡量误差:
p3

线性回归算法api详解

# 导包
from sklearn.linear_model import LinearRegression


# 参数

LinearRegression的__init__函数:
def __init__(self, fit_intercept=True, normalize=False, copy_X=True,n_jobs=None)

# fit_intercept=True:是否使用截距,即上述公式中的bias

# normalize=False:
This parameter is ignored when ``fit_intercept`` is set to False.
这个参数在fit_intercept为false时会被忽略
    If True, the regressors X will be normalized before regression by
	当normalize=True时,回归因子(自变量)x会被标准化
    subtracting the mean and dividing by the l2-norm.
    If you wish to standardize, please use
	若你希望数据标准化,请在线性回归模型fit之前使用sklearn的StandardScaler
    :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on
    an estimator with ``normalize=False``.
n_jobs : int or None, optional (default=None)
    The number of jobs to use for the computation. This will only provide
    speedup for n_targets > 1 and sufficient large problems.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
    for more details.
	调用CPU的数量,默认为1


# 属性
Attributes
----------
coef_ : array of shape (n_features, ) or (n_targets, n_features)
    Estimated coefficients for the linear regression problem.
    If multiple targets are passed during the fit (y 2D), this
    is a 2D array of shape (n_targets, n_features), while if only
    one target is passed, this is a 1D array of length n_features.
	数组形状的参数,代表了线性回归问题的估计系数,若fit的时候传入的是多目标值,它就是2维数组,若只有单目标值它就是1维数组

# 方法
def fit(self, X, y, sample_weight=None):
# X : {array-like, sparse matrix} of shape (n_samples, n_features)Training data
训练集,数组或sparse矩阵形状

# y:array-like of shape (n_samples,) or (n_samples, n_targets)Target values. Will be cast to X's dtype if necessary
训练集的目标值,数组

# sample_weight : array-like of shape (n_samples,), default=None Individual weights for each sample
样本权重,数组,表示每个样本的权重的数组

# Returns
    -------
    self : returns an instance of self.
    """

# def predict(self, X):
"""
使用训练的样本来预测x	
"""
# X : array_like or sparse matrix, shape (n_samples, n_features) Samples.
数组或sparse矩阵,测试集

# Returns
    -------
    C : array, shape (n_samples,)
        Returns predicted values.
	数组,返回预测值数组

好久没有写博客了,这个api讲解有点乱,大家见谅。

配套api详解

划分数据集测试集

from sklearn.model_selection import train_test_split

def train_test_split(*arrays, **options):
	主要将数据集和目标集分别传入,这个方法可以将数据集切分为训练集和测试集,目标集也一样,且目标集与训练测试集一一对应

# 参数
arameters
----------
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse
    matrices or pandas dataframes.
	可以接受list、numpy的数组、scipy的sparse矩阵和pandas的dataframe格式

test_size : float, int or None, optional (default=None)
    If float, should be between 0.0 and 1.0 and represent the proportion
    of the dataset to include in the test split. If int, represents the
    absolute number of test samples. If None, the value is set to the
    complement of the train size. If ``train_size`` is also None, it will
    be set to 0.25.
	规定划分测试集的大小 为None时默认划分0.25

train_size : float, int, or None, (default=None)
    If float, should be between 0.0 and 1.0 and represent the
    proportion of the dataset to include in the train split. If
    int, represents the absolute number of train samples. If None,
    the value is automatically set to the complement of the test size.
	划分训练集大小 为None时默认划分0.75

random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.
	随机数,指定一个随机数则每次划分出来的数据都是一样的,方便测试

shuffle : boolean, optional (default=True)
    Whether or not to shuffle the data before splitting. If shuffle=False
    then stratify must be None.
	每次划分数据之前是否洗牌


# 使用方法

# x_train训练集的特征 x_test测试集的特征 y_train训练集的目标值 y_test测试集的目标值
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.25)
回归的准确率计算
from sklearn.metrics import explained_variance_score

score = explained_variance_score(y_true, y_pred)
	# y_true : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Ground truth (correct) target values.
	真实值,列表
	
	# y_pred : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Estimated target values.
	预测值

	# sample_weight : array-like of shape (n_samples,), optional
        Sample weights.
	样本权重

	# multioutput : string in ['raw_values', 'uniform_average', \
                'variance_weighted'] or array-like of shape (n_outputs)
        Defines aggregating of multiple output scores.
        Array-like value defines weights used to average scores.

        'raw_values' :
            Returns a full set of scores in case of multioutput input.

        'uniform_average' :
            Scores of all outputs are averaged with uniform weight.

        'variance_weighted' :
            Scores of all outputs are averaged, weighted by the variances
            of each individual output.
	多值输出选项 
		raw_values时返回多值输入相应的分数
		uniform_average时返回的分数被统一的权重加权
		variance_weighted时返回的分数被不同的权重单独加权

	# Returns
    -------
    score : float or ndarray of floats
        The explained variance or ndarray if 'multioutput' is 'raw_values'.
	返回分数
均方误差
from sklearn.metrics import mean_squared_error

def mean_squared_error(y_true, y_pred,
                   sample_weight=None,
                   multioutput='uniform_average', squared=True):

	# y_true,y_pred,sample_weight同上
	# multioutput : string in ['raw_values', 'uniform_average']
        or array-like of shape (n_outputs)
        Defines aggregating of multiple output values.
        Array-like value defines weights used to average errors.

        'raw_values' :
            Returns a full set of errors in case of multioutput input.

        'uniform_average' :
            Errors of all outputs are averaged with uniform weight.
		只有raw_values和uniform_average可选
	
	# squared : boolean value, optional (default = True)
	    If True returns MSE value, if False returns RMSE value.
		bool类型值,默认为True,为True时返回MSE(均方误差的值),为False时返回RMSE(均方根误差)的值		

		MSE:每个预测值与真实值的差求平方加起来
		RMSE:在MSE的基础上开根号
标准化
from sklearn.preprocessing import StandardScaler

class StandardScaler的__init__函数:
def __init__(self, copy=True, with_mean=True, with_std=True):

#---------------------------------------------------
The standard score of a sample `x` is calculated as: z = (x - u) / s
样本的值x将经过这个公式计算,并重新赋值,其中u是整个训练样本的平均值,当参数with_mean=False时为0,s是训练样本的标准差,当with_std为False时为1

#---------------------------------------------------
方法:
def fit(self, X, y=None):
# 计算出均值和标准差,为后续的计算转换做准备
# X : 作为标准化蓝本的数据

def transform(self, X, copy=None):
# 对数据进行标准化操作
# X:同上

def inverse_transform(self, X, copy=None):
# 逆标准化操作,将标准化过的数据传入X,就可以逆标准化

def fit_transform(self, X, y=None, **fit_params):
# 先执行fit , 然后transform

Boston房价预测实战

接下来开始实战了,我们要使用sklearn的自带数据进行Boston房价预测,并且通过一系列的处理手段来查看各项指标,最终选择效果比较好的模型。

数据集分析

# 数据集在datasets下的load_boston
from sklearn.datasets import load_boston

# 通过实例化可以查看数据集的结构
boston = load_boston()
print(boston)

数据结构:
{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
    4.9800e+00],
   [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
    9.1400e+00],
   [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
    4.0300e+00],
   ...,
   [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
    5.6400e+00],
   [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
    6.4800e+00],
   [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
    7.8800e+00]]), 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
   18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
   15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
   13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
   21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
   35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
   19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
   20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
   23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
   33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
   21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
   20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
   23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
   15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,
   17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,
   25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,
   23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,
   32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,
   34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,
   20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,
   26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,
   31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,
   22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,
   42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,
   36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,
   32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,
   20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,
   20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,
   22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,
   21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,
   19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,
   32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,
   18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,
   16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,
   13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3,  8.8,
    7.2, 10.5,  7.4, 10.2, 11.5, 15.1, 23.2,  9.7, 13.8, 12.7, 13.1,
   12.5,  8.5,  5. ,  6.3,  5.6,  7.2, 12.1,  8.3,  8.5,  5. , 11.9,
   27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3,  7. ,  7.2,  7.5, 10.4,
    8.8,  8.4, 16.7, 14.2, 20.8, 13.4, 11.7,  8.3, 10.2, 10.9, 11. ,
    9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4,  9.6,  8.7,  8.4, 12.8,
   10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,
   15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,
   19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,
   29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,
   20.6, 21.2, 19.1, 20.6, 15.2,  7. ,  8.1, 13.6, 20.1, 21.8, 24.5,
   23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9]), 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
   'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'), 'DESCR': ".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n", 'filename': 'D:\\My_IDE\\anaconda3\\unzip\\envs\\ML\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv'}

可以看到数据集由字典包裹

boston = load_boston()
print(boston.data)  # 获取特征
print(boston.target)  # 获取目标值
print(boston.feature_names)  # 获取特征名字

由于Boston的特征名字太多了,我就不一一翻译了,但是做这个项目之前我们还是需要理解每个特征的含义,大家可以看这篇博客:

直接使用线性回归进行预测(不做任何处理)(代码可以直接使用)
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score, mean_squared_error

# 从数据集中加载数据
data = load_boston().data
target = load_boston().target  # 目标值

# 直接对数据进行切割 测试集0.25
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=.25)

# 直接调用线性回归算法对训练集训练
lr = LinearRegression()  # 生成对象
lr.fit(X=x_train, y=y_train)  # 传入训练集的特征集 训练集的目标集

# 调用predict对x_test进行预测
y_predict = lr.predict(x_test)
print('测试集预测房价结果:', y_predict)

# 查看预测的准确率
score = explained_variance_score(y_true=y_test, y_pred=y_predict)
print('准确率:', score)

# 查看MSE均方误差
mse = mean_squared_error(y_true=y_test, y_pred=y_predict)
print('MSE:', mse)

测试结果:

测试集预测房价结果: [21.46883434 25.12932984 25.46229753 24.29381014 10.77338021 23.6629044
 23.37791703 22.49010597 12.27856207 29.37523746 27.40418595 10.76141686
 17.18846688 34.25705244 29.39977567 21.07789847 13.28080843 26.40665946
 25.0711256  14.36977619 27.72165633 24.97684893  7.14863525 31.9399898
 34.1872509  18.28335265 13.78038185 30.71205105 25.05699424 31.3071684
 23.73084061 24.50247385 23.80750114 22.13219131 24.19862954 18.62644109
 20.33034999 23.75066724 17.86150927 24.76372503 36.45888055 15.58456341
 25.14583162 35.39194339 27.4397019  18.53702861 36.67282595 18.06890836
 33.40175423 19.29671574  8.36035202  4.91058191 43.74501362 23.70449026
 21.18441041 32.86471409 14.54004421 37.14938937 17.11641782 28.49420956
 17.12114592 14.17674722 23.88892771 36.87368112 15.58474291 26.22316745
 13.31399778 18.21362041 11.89413312 23.86939921 16.86764505 13.61438561
 17.6953322  20.14680644 18.69063686 19.41916909 16.0092952  10.97551792
 16.31726352 22.77547615 17.75585647 18.08693647 19.85811416 17.91815524
 34.72957214 25.13865124 43.09269266 17.41278109 30.01004371  6.27039642
 19.21093827 23.687117   31.91416595 30.9285005  23.04447041 19.39716369
  7.93354099 16.63689471 21.48214823 30.77142739 34.55218054 21.27732932
 28.55869948 25.82598709 19.18739213 15.04329649 26.88848523 16.32920073
 16.71884264  6.66521178  3.2633921  11.61679565 27.638216   19.34970675
 18.09687596 25.19032192 24.94721181 31.79454369 19.09566954 22.19418141
 15.75264749 17.70929177 21.25536948 35.56393675 16.53093603 21.88363038
 30.32002178]
准确率: 0.6852667652008373
MSE: 24.41094186688598
对数据集做标准化
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score, mean_squared_error
from sklearn.preprocessing import StandardScaler

# 从数据集中加载数据
data = load_boston().data
target = load_boston().target  # 目标值

# 直接对数据进行切割 测试集0.25
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=.25)

# ------------------------------
# 对数据做标准化处理
std = StandardScaler()  # 生成std对象
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)  # 上面fit过了 就不需要再fit 直接transform

y_train = std.fit_transform(y_train.reshape(-1, 1))
y_test = std.transform(y_test.reshape(-1, 1))
# ------------------------------


# 直接调用线性回归算法对训练集训练
lr = LinearRegression()  # 生成对象
lr.fit(X=x_train, y=y_train)  # 传入训练集的特征集 训练集的目标集

# 调用predict对x_test进行预测
y_predict = lr.predict(x_test)
print('测试集预测房价结果:', y_predict)

# 查看预测的准确率
score = explained_variance_score(y_true=y_test, y_pred=y_predict)
print('准确率:', score)

# 查看MSE均方误差
mse = mean_squared_error(y_true=y_test, y_pred=y_predict)
print('MSE:', mse)

测试结果:

测试集预测房价结果: [[-1.40879126]
 [ 0.86696205]
 [-0.40592665]
 [ 0.53045858]
 [ 0.01976869]
 [-1.38855636]
 [-1.08893422]
 .....
 [-0.34454702]
 [ 0.13888412]
 [ 0.0853247 ]
 [ 0.1180086 ]
 [-0.77342024]
 [-1.1975903 ]
 [ 0.58981398]
 [ 0.23569153]
 [ 0.80432651]
 [ 1.39147711]
 [-0.60182932]]
准确率: 0.6719360149536837
MSE: 0.34810740050393685

标准化的功能是消除异常特征对整体数据的影响,而查看data我们可以知道其实整个数据集并没有特别突出的异常值,所以标准化的作用也微乎其微。

而MSE在值上变化挺大的,从几十降到了零点几,但是这是标准化后的结果,标准化可以将数据重新定位到0~1区间上,而由MSE得计算公式我们知道这个值变得那么小是因为标准化把数值变小了,而不是MSE的结果变小了。

对数据做特征选择(代码可以直接使用)

在这一步我们需要对一些相关联的特征、对房价影响不大的特征做舍弃,看看是否能提高模型的质量。

经过观察我认为有几个参数是可以舍弃的:NOX(一氧化氮浓度)、PTRATIO(城镇师生比例)、B(城镇中黑人的比例)
理由如下:

NOX:我们可以认为整个Boston的一氧化氮的浓度都是相似的

PTRATIO:因为学生可以跨区域就读学校,所以城镇师生比例对单个城市影响不大

B:这个参数宣传了种族歧视的思想,舍弃!

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import explained_variance_score, mean_squared_error
from sklearn.preprocessing import StandardScaler

# 从数据集中加载数据
data = load_boston().data
target = load_boston().target  # 目标值
print(load_boston().feature_names)

# 做特征选择
data = pd.DataFrame(data)
data.drop([4], inplace=True, axis=1)  # 删除NOX
data.drop([10], inplace=True, axis=1)  # 删除PTRATIO
data.drop([11], inplace=True, axis=1)  # 删除B
print(data)

# 直接对数据进行切割 测试集0.25
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=.25)

# ------------------------------
# 对数据做标准化处理
std = StandardScaler()  # 生成std对象
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)  # 上面fit过了 就不需要再fit 直接transform

y_train = std.fit_transform(y_train.reshape(-1, 1))
y_test = std.transform(y_test.reshape(-1, 1))


# 直接调用线性回归算法对训练集训练
lr = LinearRegression()  # 生成对象
lr.fit(X=x_train, y=y_train)  # 传入训练集的特征集 训练集的目标集

# 调用predict对x_test进行预测
y_predict = lr.predict(x_test)
print('测试集预测房价结果:', y_predict)

# 查看预测的准确率
score = explained_variance_score(y_true=y_test, y_pred=y_predict)
print('准确率:', score)

# 查看MSE均方误差
mse = mean_squared_error(y_true=y_test, y_pred=y_predict)
print('MSE:', mse)

删除上面的几个特征后一开始让我蛮意外的,因为连续几次都能够达到70以上的准确率,MSE也由原来的0.3+降到0.2,但是多测试几次会发现,好的训练集确实提高了质量,但是若分割到不合适的训练集准确率也会大大降低(低至0.5)

准确率: 0.756400957032327
MSE: 0.20941409554630147

模型保存和加载

import joblib

# 保存模型 
def dump(value, filename, compress=0, protocol=None, cache_size=None):
# value:需要保存的python对象,可以是任何python的object
# filename:保存的路径
# compress:压缩等级(0~9)9最高

# 加载模型
def load(filename, mmap_mode=None):
# filename:路径

下面是加了保存模型和加载模型代码的源码

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import explained_variance_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import joblib
# 从数据集中加载数据
data = load_boston().data
target = load_boston().target  # 目标值
print(load_boston().feature_names)

# 做特征选择
data = pd.DataFrame(data)
data.drop([4], inplace=True, axis=1)  # 删除NOX
data.drop([10], inplace=True, axis=1)  # 删除PTRATIO
data.drop([11], inplace=True, axis=1)  # 删除B
print(data)

# 直接对数据进行切割 测试集0.25
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=.25)

# ------------------------------
# 对数据做标准化处理
std = StandardScaler()  # 生成std对象
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)  # 上面fit过了 就不需要再fit 直接transform

y_train = std.fit_transform(y_train.reshape(-1, 1))
y_test = std.transform(y_test.reshape(-1, 1))

# 加载模型
# lr = joblib.load('linearModel.pkl')

# 直接调用线性回归算法对训练集训练
lr = LinearRegression()  # 生成对象
lr.fit(X=x_train, y=y_train)  # 传入训练集的特征集 训练集的目标集

# *******************************
# 保存模型
joblib.dump(value=lr, filename='linearModel.pkl')

# *******************************


# 调用predict对x_test进行预测
y_predict = lr.predict(x_test)
print('测试集预测房价结果:', y_predict)

# 查看预测的准确率
score = explained_variance_score(y_true=y_test, y_pred=y_predict)
print('准确率:', score)

# 查看MSE均方误差
mse = mean_squared_error(y_true=y_test, y_pred=y_predict)
print('MSE:', mse)

这里的保存和加载模型,就是将效果比较好的训练集fit了之后的LinearRegression对象保存到文件里,在下次需要使用模型的时候再次从文件中加载出来。

在再次使用时的参照标准还是上次效果比较好的模型。

总结

首先,机器学习的算法流程都不算复杂,复杂的内容主要在前期对数据的处理,对特征的处理部分。在了解了机器学习大致分类后就可以按照需求选择模型。

LinearRegression的主要原理就是在多维空间中找曲线,而回归问题最重要的指标并不只是预测的准确率,更重要的是要看均方误差等参数。

显示全文