3.1 API : DecisionTreeClassifier、DecisionTreeRegressor

1147-柳同学

发表文章数:589

首页 » 算法 » 正文

引言

sklearn中实现的决策树都是二叉树

1. DecisionTreeClassifier

一般默认使用基尼指数即可,因为熵有对数运算,耗时
采用CART算法

from sklearn.tree import DecisionTreeClassifier
决策树分类器
DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2,
 min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, 
 max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, class_weight=None, ccp_alpha=0.0)

Parameters

criterion : {“gini”, “entropy”}, default=”gini,
度量分类质量,前者是基尼系数,后者是信息熵

splitter : {“best”, “random”}, default=”best”
选择分类点,控制用于在每个节点上的拆分策略
“best”支持最佳拆分,“random”支持最佳随机拆分

max_depth : int, default=None
决策树最大深度,常用于解决过拟合

min_samples_split : int or float, default=2
最小样本阈值
如果是int,则取传入值本身作为最小样本数;如果是float,则用ceil(min_samples_split * n_samples)的值作为最小样本数,向上取整

min_samples_leaf : int or float, default=1
叶子节点最小样本
如果是int,则取传入值本身作为最小样本数;
如果是float,则取ceil(min_samples_split * n_samples)的值作为最小样本数,向上取整
这个值限制了叶子节点最小的样本数,如果某叶子节点样本数小于最小样本数,则会和兄弟节点一起被剪枝

min_weight_fraction_leaf : float, default=0.0
叶子节点最小样本权重
在所有叶节点处(所有输入样本)的权重总和中的最小加权分数。
如果未提供sample_weight,则样本的权重相等

max_features : int, float or {“auto”, “sqrt”, “log2”}, default=None
考虑最佳分割时要考虑的特征数量
如果为int,则在每个拆分处考虑
如果为float,则max_features是一个分数,在每个拆分处都考虑int(max_features * n_features)个特征。
如果auto,max_features=sqrt(n_features).
如果 “sqrt”, max_features=sqrt(n_features).
如果“log2”, max_features=log2(n_features
默认则max_features=n_features.

random_state : int, RandomState instance or None, default=None
随机数种子

max_leaf_nodes : int, default=None
最大叶子节点数
如果特征不多,可以不考虑这个值;如果特征多,可以通过交叉验证得到这个值来加以限制

min_impurity_decrease : float, default=0.0
split损失阈值
这个值限制了决策树的增长,如果某个节点的信息增益或基尼指数小于这个阈值,则该节点不生成子结点

class_weight : dict, list of dict or “balanced”, default=None
类别权重
如果为None,则所有类的权重都应为1。
对于多输出问题,可以按与y列相同的顺序提供字典列表。
对于四类的多标签分类
[{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
“balanced”模式使用y的值自动将权重与输入数据中的类频率成反比地调整为n_samples /(n_classes * np.bincount(y))
对于多输出,y的每一列的权重将相乘。
请注意,如果指定了sample_weight,则这些权重将与sample_weight(通过fit方法传递)相乘

ccp_alpha : non-negative float, default=0.0
最小成本复杂性修剪的复杂性参数,默认不执行修剪

Attributes

classes_ : ndarray of shape (n_classes,) or list of ndarray
类标签(单输出问题)或类标签数组的列表(多输出问题)。

feature_importances_ : ndarray of shape (n_features,)
返回特征的重要性

max_features_ : int
最佳分割时要考虑的特征数量的推断值

n_classes_ : int or list of int
类数(用于单输出问题),或包含每个输出的类数的列表(用于多输出问题)

n_features_ : int
执行拟合时的特征数量

n_outputs_ : int
执行拟合时的输出数量

tree_ : Tree instance
Tree对象

Methods

apply(X[, check_input])
返回每个样本被预测的叶子的索引。

cost_complexity_pruning_path(X, y[, …])
在最小成本复杂性修剪期间计算修剪路径。

decision_path(X[, check_input])
返回树的决策路径

fit(X, y[, sample_weight, check_input, …])
从训练集(X,y)构建决策树分类器

get_depth()
返回决策树的深度

get_n_leaves()
返回决策树的叶子数

get_params([deep])
获得这个估计器的参数

predict(X[, check_input])
预测X的类

predict_log_proba(X)
预测输入样本X的类对数概率

predict_proba(X[, check_input])
预测输入样本X的类别概率

score(X, y[, sample_weight])
返回给定测试数据和标签的平均准确度

set_params(**params)
设置这个估计器的参数

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...                             
...
array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        0.93...,  0.93...,  1.     ,  0.93...,  1.      ])

2.DecisionTreeRegressor

from sklearn.tree import DecisionTreeClassifier
决策树回归器
DecisionTreeRegressor(*,  criterion='mse', splitter='best', max_depth=None, min_samples_split=2, 
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None,
 max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0)

DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2,
 min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, 
 max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, class_weight=None, ccp_alpha=0.0)

不多赘述,只叙述差别
Parameters

criterion : {“mse”, “friedman_mse”, “mae”, “poisson”}, default=”mse”
度量回归质量
均方误差“ mse”;平均绝对误差“ mae”;“ poisson”;使用泊松偏差的减少来寻找分裂

Attributes

没有classes_ ,n_classes_ 属性

Methods

predict(X[, check_input])
预测X的回归值

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import numpy as np

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X = rng.randn(n_samples, n_features)
# positive integer target correlated with X[:, 5] with many zeros:
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
regressor = DecisionTreeRegressor(criterion='poisson', random_state=0)
regressor.fit(X_train, y_train)

3.案例

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
import pydotplus
import matplotlib as mpl

# 加载数据
def loaddata():
    features = ["age", "work", "house", "credit"]
    x_train = pd.DataFrame([
        ["青年", "否", "否", "一般"],
        ["青年", "否", "否", "好"],
        ["青年", "是", "否", "好"],
        ["青年", "是", "是", "一般"],
        ["青年", "否", "否", "一般"],
        ["中年", "否", "否", "一般"],
        ["中年", "否", "否", "好"],
        ["中年", "是", "是", "好"],
        ["中年", "否", "是", "非常好"],
        ["中年", "否", "是", "非常好"],
        ["老年", "否", "是", "非常好"],
        ["老年", "否", "是", "好"],
        ["老年", "是", "否", "好"],
        ["老年", "是", "否", "非常好"],
        ["老年", "否", "否", "一般"]
    ])
    y_train = pd.DataFrame(["否", "否", "是", "是", "否", "否", "否", "是", "是", "是", "是", "是", "是", "是", "否"])
    y_type = [str(k) for k in np.unique(y_train)]
    # one-hot编码
    le_x = LabelEncoder()
    le_x.fit(np.unique(x_train))
    x_train = x_train.apply(le_x.transform)

    le_y = LabelEncoder()
    le_y.fit(y_train)
    y_train = le_y.transform(y_train)
    return x_train, y_train,features,le_x,le_y


# 决策树可视化
def show(clf,feature,y_type):
    dot_data = tree.export_graphviz(clf,out_file=None,
                                    feature_names=feature,
                                    class_names=y_type,filled=True,
                                    rounded=True,special_characters=True)
    # 生成图片
    graph = pydotplus.graph_from_dot_data(dot_data)
    graph.write_png('DT_show.png')






if __name__ == '__main__':
    mpl.rcParams["font.sans-serif"] = [u'simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    # 加载数据
    x_train, y_train,features,le_x,le_y = loaddata()
    # 分类
    clf = DecisionTreeClassifier()
    clf.fit(x_train, y_train)
    y_type = [str(k) for k in np.unique(y_train)]
    # 可视化
    show(clf, features,y_type)

    # 预测
    X_show = pd.DataFrame([["青年", "否", "否", "一般"]])
    X_test = X_show.apply(le_x.transform)
    y_predict = clf.predict(X_test)
    # 结果输出
    X_show = [{features[i] :X_show.values[0][i]} for i in range(len(features))]
    print("{0}被分类为{1}".format(X_show,le_y.inverse_transform(y_predict)))

[{'age': '青年'}, {'work': '否'}, {'house': '否'}, {'credit': '一般'}]被分类为['否']

3.1 API : DecisionTreeClassifier、DecisionTreeRegressor

未经允许不得转载:作者:1147-柳同学, 转载或复制请以 超链接形式 并注明出处 拜师资源博客
原文地址:《3.1 API : DecisionTreeClassifier、DecisionTreeRegressor》 发布于2021-01-10

分享到:
赞(0) 打赏

评论 抢沙发

评论前必须登录!

  注册



长按图片转发给朋友

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

Vieu3.3主题
专业打造轻量级个人企业风格博客主题!专注于前端开发,全站响应式布局自适应模板。

登录

忘记密码 ?

您也可以使用第三方帐号快捷登录

Q Q 登 录
微 博 登 录