机器学习实战:基于3大分类模型的中风病人预测

语言: CN / TW / HK

theme: smartblue

公众号:尤而小屋
作者:Peter
编辑:Peter

大家好,我是Peter~

今天给大家分享的是kaggle上一个关于中风疾病案例的数据集建模,文章的主要内容参考导图:

原数据地址:http://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?datasetId=1120859&sortBy=voteCount&select=healthcare-dataset-stroke-data.csv

导入库

数据基本信息

先把数据导进来,查看数据的基本信息

下面我们查看数据基本信息

In [3]:

df.shape

Out[3]:

(5110, 12)

In [4]:

df.dtypes

Out[4]:

python id int64 gender object age float64 hypertension int64 heart_disease int64 ever_married object # 字符型 work_type object Residence_type object avg_glucose_level float64 bmi float64 smoking_status object stroke int64 dtype: object

In [5]:

df.describe() # 描述统计信息

Out[5]:

字段分布

gender统计

In [6]:

``` plt.figure(1, figsize=(12,5))

sns.countplot(y="gender", data=df) plt.show() ```

age分布

In [7]:

px.violin(y=df["age"])

```python fig = px.histogram(df, x="age", color_discrete_sequence=['firebrick'])

fig.show() ```

ever_married

In [9]:

``` plt.figure(1, figsize=(12,5))

sns.countplot(y="ever_married", data=df)

plt.show() ```

本数据集中的结婚人士大约是未结婚的两倍。

work-type

查看不同工作状态的人员数量

In [10]:

```python plt.figure(1, figsize=(12,8))

sns.countplot(y="work_type", data=df)

plt.show() ```

Residence_type

In [11]:

``` plt.figure(1, figsize=(12,8))

sns.countplot(y="Residence_type", data=df)

plt.show() ```

avg_glucose_level

血糖水平的分布

``` fig = px.histogram(df, x="avg_glucose_level", color_discrete_sequence=['firebrick'])

fig.show() ```

可以看到大部分人的血糖还是在100以下,说明是正常的

bmi

bmi指标的分布情况

bmi指标的均值大约在28左右,呈现一定的正态分布

smoking_status

抽烟情况的统计

``` plt.figure(1, figsize=(12,8))

sns.countplot(y="smoking_status", data=df)

plt.show() ```

可以看到抽烟或者曾经抽烟的人相对来说是少一些的

缺失值情况

缺失值统计

df.isnull().sum()

id 0 gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 201 smoking_status 0 stroke 0 dtype: int64

201 / len(df) # 缺失比例

0.03933463796477495

缺失值可视化

缺失值处理

使用决策树回归来预测缺失值的BMI值:通过年龄、性别和现有的bmi值来进行预测填充

python dt_bmi = Pipeline(steps=[("scale",StandardScaler()), # 数据标准化 ("lr",DecisionTreeRegressor(random_state=42)) ])

取出3个指标来进行预测填充:

```python X = df[["age","gender","bmi"]].copy()

dic = {"Male":0, "Female":1, "Other":-1}

X["gender"] = X["gender"].map(dic).astype(np.uint8) X.head() ```

取出非缺失值的部分进行训练:

```python

缺失值部分

missing = X[X.bmi.isna()]

非缺失值部分

X = X[~X.bmi.isna()] y = X.pop("bmi") ```

```

模型训练

dt_bmi.fit(X,y) ```

Pipeline(steps=[('scale', StandardScaler()), ('lr', DecisionTreeRegressor(random_state=42))])

In [23]:

```

模型预测

y_pred = dt_bmi.predict(missing[["age","gender"]]) y_pred[:5] ```

Out[23]:

array([29.87948718, 30.55609756, 27.24722222, 30.84186047, 33.14666667])

将预测的值转成Series,并且注意索引号:

predict_bmi = pd.Series(y_pred, index=missing.index) predict_bmi

python 1 29.879487 8 30.556098 13 27.247222 19 30.841860 27 33.146667 ... 5039 32.716000 5048 28.313636 5093 31.459322 5099 28.313636 5105 28.476923 Length: 201, dtype: float64

填充到原来的df数据中:

df.loc[missing.index, "bmi"] = predict_bmi

进行上面的预测和填充之后,我们再次查看缺失值情况,发现已经没有任何缺失值:

df.isnull().sum()

id 0 gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64

数据EDA

``` variables = [variable for variable in df.columns if variable not in ['id','stroke']]

除去id号和是否中风外的全部字段

variables ```

python ['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status']

连续型变量

几点结论:

  • 年龄age:整体分布比较均衡,不同年龄段的人数差异小
  • 血糖水平:主要集中在100以下
  • bmi指标:呈现一定的正态分布

中风和未中风

上面我们查看了连续型变量的分布情况;可以看到bmi呈现明显的左偏态的分布。下面我们对比中风和未中风的情况:

从3个密度图中能够观察到:从上面的密度图中可以看出来:对于是否中风,年龄age是一个最主要的因素

对比不同年龄段的血糖和BMI指数

px.scatter(df,x="age", y="avg_glucose_level", color="stroke", trendline='ols' )

年龄和血糖、bmi关系

python px.scatter(df,x="age", y="bmi", color="stroke", trendline='ols' )

年龄和患病几率

从散点分布图中看到:年龄可能真的是一个比较重要的因素,和BMI以及平均的血糖水平有着一定的关系。

可能随着年龄的增长,风险在增加。果真如此吗?

上面的图形说明了两点:

  1. 年龄越大,中风的几率的确越来越高
  2. 中风的几率是非常低的(y轴的值很低),这是由于中风和未中风的样本不均衡造成的

原数据5000个样本中只有249个中风样本,比例接近1:20

样本不均衡

整体属性分布

首先我们剔除gender中为Other的情况

In [34]:

str_only = df[df['stroke'] == 1] # 中风 no_str_only = df[df['stroke'] == 0] # 未中风

In [35]:

len(str_only)

Out[35]:

249

In [36]:

```

剔除other

no_str_only = no_str_only[(no_str_only['gender'] != 'Other')] ```

比较在不同的属性下中风和未中风的情况:

建模

模型baseline

In [38]:

len(str_only)

Out[38]:

249

In [39]:

249 / len(df)

Out[39]:

0.0487279843444227

说明总共有249个人是中风的。本数据的总人数是len(df),根据下面的表达式能够得到本次模型的baseline。

也就说,对于阳性中风患者的召回率,一个好的目标是4.8%。

字段编码

对4个字符型的字段进行编码工作:

In [40]:

```python df['gender'] = df['gender'].replace({'Male':0, 'Female':1, 'Other':-1} ).astype(np.uint8)

df['Residence_type'] = df['Residence_type'].map({'Rural':0, 'Urban':1} ).astype(np.uint8)

df['work_type'] = df['work_type'].map({'Private':0, 'Self-employed':1, 'Govt_job':2, 'children':-1, 'Never_worked':-2} ).astype(np.uint8)

df['ever_married'] = df['ever_married'].map({'No':0,'Yes':1}).astype(np.uint8)

df.head() ```

抽烟状态的独热码转换:

In [41]:

df["smoking_status"].value_counts()

Out[41]:

never smoked 1892 Unknown 1544 formerly smoked 885 smokes 789 Name: smoking_status, dtype: int64

In [42]:

python df = df.join(pd.get_dummies(df["smoking_status"])) df.drop("smoking_status",axis=1,inplace=True)

数据分割

In [43]:

```python

选取特征

X = df.drop("stroke",axis=1)

目标变量

y = df['stroke'] from sklearn.model_selection import train_test_split

3-7比例

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42) ```

上采样

前文中提到,本案例中风和未中风的数据比例接近1:20,在这里我们采样基于SMOTE的上采样方法

In [44]:

oversample = SMOTE() X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train.ravel())

In [45]:

len(y_train_smote)

Out[45]:

2914

In [46]:

len(X_train_smote)

Out[46]:

2914

建模

采用3种不同的分类模型来建立模型:Random Forest, SVM, Logisitc Regression

In [47]:

python rf_pipeline = Pipeline(steps = [('scale',StandardScaler()), # 标准化 ('RF',RandomForestClassifier(random_state=42))] # 模型 ) svm_pipeline = Pipeline(steps = [('scale',StandardScaler()), ('SVM',SVC(random_state=42))]) logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()), ('LR',LogisticRegression(random_state=42))])

10折交叉验证

3种模型得分对比

In [49]:

print('随机森林:', rf_cv.mean()) print('支持向量机:',svm_cv.mean()) print('逻辑回归:', logreg_cv.mean()) 随机森林: 0.9628909366701726 支持向量机: 0.9363667907023254 逻辑回归: 0.8859930523017683

很明显:随机森林表现的最好!

模型训练fit

In [50]:

``` rf_pipeline.fit(X_train_smote,y_train_smote)

svm_pipeline.fit(X_train_smote,y_train_smote)

logreg_pipeline.fit(X_train_smote,y_train_smote) ```

Out[50]:

Pipeline(steps=[('scale', StandardScaler()), ('LR', LogisticRegression(random_state=42))])

In [51]:

```

3种模型预测

rf_pred =rf_pipeline.predict(X_test) svm_pred = svm_pipeline.predict(X_test) logreg_pred = logreg_pipeline.predict(X_test) ```

评价指标

In [52]:

```

1、混淆矩阵

rf_cm = confusion_matrix(y_test, rf_pred ) svm_cm = confusion_matrix(y_test, svm_pred) logreg_cm = confusion_matrix(y_test, logreg_pred) ```

In [53]:

``` print(rf_cm) print("----") print(svm_cm) print("----") print(logreg_cm) [[3338 66] [ 164 9]]


[[3196 208] [ 148 25]]


[[3138 266] [ 116 57]] ```

print('RF mean :',rf_f1) print('SVM mean :',svm_f1) print('LR mean :',logreg_f1) RF mean : 0.07258064516129033 SVM mean : 0.1231527093596059 LR mean : 0.22983870967741934

随机森林模型的分类报告:

```python from sklearn.metrics import plot_confusion_matrix, classification_report

print(classification_report(y_test,rf_pred))

print('Accuracy Score: ',accuracy_score(y_test,rf_pred)) precision recall f1-score support

       0       0.95      0.98      0.97      3404
       1       0.12      0.05      0.07       173

accuracy                           0.94      3577

macro avg 0.54 0.52 0.52 3577 weighted avg 0.91 0.94 0.92 3577

Accuracy Score: 0.9357003075202683 ```

随机森林模型调参

基于网格搜索的参数调优:

```python from sklearn.model_selection import GridSearchCV

n_estimators =[64,100,128,200] max_features = [2,3,5,7] bootstrap = [True,False]

param_grid = {'n_estimators':n_estimators, 'max_features':max_features, 'bootstrap':bootstrap}

rfc = RandomForestClassifier() ```

``` grid = GridSearchCV(rfc,param_grid)

grid.fit(X_train,y_train) ```

GridSearchCV(estimator=RandomForestClassifier(), param_grid={'bootstrap': [True, False], 'max_features': [2, 3, 5, 7], 'n_estimators': [64, 100, 128, 200]})

grid.best_params_ # 找到最优的参数

{'bootstrap': False, 'max_features': 3, 'n_estimators': 200}

```python

再次建立随机森林模型

rfc = RandomForestClassifier( max_features=3, n_estimators=200, bootstrap=False)

rfc.fit(X_train_smote,y_train_smote)

rfc_tuned_pred = rfc.predict(X_test) ```

```python

新的分类报告得分

print(classification_report(y_test,rfc_tuned_pred))

print('Accuracy Score: ',accuracy_score(y_test,rfc_tuned_pred)) print('F1 Score: ',f1_score(y_test,rfc_tuned_pred)) precision recall f1-score support

       0       0.95      0.98      0.97      3404
       1       0.05      0.02      0.03       173

accuracy                           0.94      3577

macro avg 0.50 0.50 0.50 3577 weighted avg 0.91 0.94 0.92 3577

Accuracy Score: 0.9362594352809617 F1 Score: 0.025641025641025644 ```

逻辑回归模型调参

```python penalty = ['l1','l2'] C = [0.001, 0.01, 0.1, 1, 10, 100]

log_param_grid = {'penalty': penalty, 'C': C}

logreg = LogisticRegression() grid = GridSearchCV(logreg,log_param_grid) ```

grid.fit(X_train_smote,y_train_smote)

python GridSearchCV(estimator=LogisticRegression(), param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']})

grid.best_params_

{'C': 1, 'penalty': 'l2'}

```python logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()), ('LR',LogisticRegression(C=1,penalty='l2',random_state=42))])

logreg_pipeline.fit(X_train_smote,y_train_smote) ```

Out[65]:

Pipeline(steps=[('scale', StandardScaler()), ('LR', LogisticRegression(C=1, random_state=42))])

In [66]:

logreg_new_pred = logreg_pipeline.predict(X_test) # 新预测

In [67]:

```python print(classification_report(y_test,logreg_new_pred))

print('Accuracy Score: ',accuracy_score(y_test,logreg_new_pred)) print('F1 Score: ',f1_score(y_test,logreg_new_pred)) precision recall f1-score support

       0       0.96      0.92      0.94      3404
       1       0.18      0.33      0.23       173

accuracy                           0.89      3577

macro avg 0.57 0.63 0.59 3577 weighted avg 0.93 0.89 0.91 3577

Accuracy Score: 0.8932065977075762 F1 Score: 0.22983870967741934 ```

支持向量机调参

In [68]:

```python svm_param_grid = { 'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']}

svm = SVC(random_state=42)

grid = GridSearchCV(svm, svm_param_grid) ```

In [69]:

grid.fit(X_train_smote,y_train_smote)

Out[69]:

python GridSearchCV(estimator=SVC(random_state=42), param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']})

In [70]:

grid.best_params_

Out[70]:

{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}

In [71]:

```python svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),('SVM',SVC(C=100,gamma=0.0001,kernel='rbf',random_state=42))])

svm_pipeline.fit(X_train_smote,y_train_smote)

svm_tuned_pred = svm_pipeline.predict(X_test) ```

In [72]:

```python print(classification_report(y_test,svm_tuned_pred))

print('Accuracy Score: ',accuracy_score(y_test,svm_tuned_pred)) print('F1 Score: ',f1_score(y_test,svm_tuned_pred)) precision recall f1-score support

       0       0.96      0.93      0.94      3404
       1       0.16      0.27      0.20       173

accuracy                           0.90      3577

macro avg 0.56 0.60 0.57 3577 weighted avg 0.92 0.90 0.91 3577

Accuracy Score: 0.8951635448700028 F1 Score: 0.19700214132762314 ```

结论

  1. 在交叉验证的过程中,随机森林表现的最好。
  2. 3种模型的对比:随机森林的精度最好,但是F1-score缺失最低的
  3. 模型可能存在的特点:更能预测哪些人将会中风,而不是哪些人不会中风