数学建模机器学习代码汇总

本文最后更新于 457 天前，其中的信息可能已经有所发展或是发生改变。

AutoGluon

AutoGluon是一种AutoML框架，帮助我们快速的构建机器学习模型。

针对Tabular的数据，通过以下几行代码快速训练回归或分类模型并进行数据预测：

from autogluon.tabular import TabularDataset, TabularPredictor

data_root = 'https://autogluon.s3.amazonaws.com/datasets/Inc/'
train_data = TabularDataset(data_root + 'train.csv')
test_data = TabularDataset(data_root + 'test.csv')

predictor = TabularPredictor(label='class').fit(train_data=train_data)
predictions = predictor.predict(test_data)

更多详细的使用说明，参见AutoGluon官方文档

拿到题目可以先对数据进行简单的处理，如将表格变为非嵌套的表格，之后分割为train和test数据集，然后再喂给AutoGluon，让他给我们一个结果，以这个结果为基准，之后我们再进行后续操作，可考虑进一步优化AutoGluon，或者自己手动机器学习，手动机器学习步骤参考以下方法。

手动机器学习

EDA（Exploratory Data Analysis）

针对机器学习任务，拿到数据的第一步就是进行探索性数据分析（EDA），这能对数据进行更好的理解，以便于进行相关的后续的操作。

DataPrep

DataPrep是一款自动化的EDA工具包。只需几行代码就可以进行详细的EDA：

import pandas as pd
from dataprep.eda import create_report
df = pd.read_csv("parking_violations.csv")
create_report(df)

这几行代码能够得出如下结果

概览: 检测表格每列的数据类型
变量：变量类型，特殊值，不同变量的数量，缺失值
分位数统计，如最小值、Q1、中值、Q3、最大值、范围、四分位数范围
描述性统计，如平均值、模式、标准差、总和、中值绝对偏差、变异系数、峰度、偏度
文本分析：长度，样本，字母
相关性分析：高亮高度相关的变量，Spearman，Pearson和Kendall矩阵
缺失值：条形图、热图和缺失值频谱

若对DataPrep得出的结果不满意，还需做其他的EDA，那就需要结合pandas和相关库提供的函数，自己手动进行相关的EDA。

数据准备

数据预处理

缺失值处理（标签数据无需填充缺失）：
- 数值类型，用平均值取代：data[A].fillna(data[A].mean())
- 分类数据，用最常见的类别取代：data[A].value_counts()；data[A].fillna(“前面得到的最常见的类别”)；data[A].fillna(“U”)缺失比较多时，填充代表未知的字符串
- 使用模型预测缺失值，例如：K-NN
数据归一化/标准化：
- 模型具有伸缩可变性，如SVM，最好进行标准化，避免模型参数受极值影响；伸缩不变模型，如逻辑回归，最好也进行标准化，可以加快训练速度
- 归一化/标准化常见两种方法：min-max，化为[0,1]：(x-min(x))/(max(x)-min(x))/preprocessing.MinMaxScaler；适合分别在有限范围内的数据，数值较集中，但min/max不稳定会影响结果
- Z-score，化为均值为0，方差为1：(x-mean(x))/std(x)/sklearn.preprocessing.scale()，适合最大/最小值未知，或者有超出取值范围的离散值

特征工程

特征工程是对原有的特征进行进一步提取整合，以及通过计算发现一些不太相关的特征进行剔除，需要具体问题具体分析，针对不同的问题背景，可以提取的特征不尽相同。

数值型数据处理：一般可直接使用，或通过运算转化为新的特征
分类型数据处理：
- 两个类别：性别数据分别填充为1、0：df.A=df.A.map({“male”:1;”female”:0})
- 超两个类别：one-hot编码，data’=pd.get_dummies(df.A , prefix=’前缀’ )；pd.concat([data,data’],axis=1)
- 字符串型-姓名：每一个姓名中都包含了称谓，利用split函数将称谓提取出来；.strip用于移除空格；将称谓进行归类，定义对应字典，利用map函数替换；进行one_hot编码
- 字符串型-客舱号：a[n]可以取到字符串数据第“n”个字符；提取之后进行one_hot编码
时间序列数据，一段时间定期收集的数据-可转成年月日

数据分割——训练数据和测试数据

一般将60%～80%（通常取70%）的数据作为训练集，其余作为测试集。

模型选择

对特定任务最优建模方法的选择或者对特定模型最佳参数的选择。在训练数据集上运行模型(算法)并在测试数据集中测试效果，迭代进行数据模型的修改，这种方式被称为交叉验证(将数据分为训练集和测试集，使用训练集构建模型，并使用测试集评估模型提供修改建议)。模型的选择会尽可能多的选择算法进行执行，并比较执行结果。

线性回归

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets

# Create linear regression object
linear = linear_model.LinearRegression()

# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)

#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

#Predict Output
predicted= linear.predict(x_test)

逻辑回归

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create logistic regression object

model = LogisticRegression()

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)

#Predict Output
predicted= model.predict(x_test)

决策树

#Import Library
#Import other necessary libraries like pandas, numpy...

from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create tree object 
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)

SVM

#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object 

model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)

朴素贝叶斯

#Import Library
from sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create SVM classification object 
model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

KNN

#Import Library
from sklearn.neighbors import KNeighborsClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object 
model= KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

K-Means

#Import Library
from sklearn.cluster import KMeans

#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model 
model = KMeans(n_clusters=3, random_state=0)

# Train the model using the training sets and check score
model.fit(X)

#Predict Output
predicted= model.predict(x_test)

随机森林

#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

PCA

#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and test
# Create PCA obeject 
pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA

train_reduced = pca.fit_transform(train)

#Reduced the dimension of test dataset
test_reduced = pca.transform(test)

GBM

#Import Library
from sklearn.ensemble import GradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

超参数调优

超参数本质上是机器学习算法的参数，直接影响学习过程和预测性能。由于没有“一刀切 ”的超参数设置，可以普遍适用于所有数据集，因此需要进行超参数优化（也称为超参数调整或模型调整）。

optuna是一个使用python编写的超参数调节框架。一个极简的 optuna 的优化程序中只有三个最核心的概念，目标函数(objective)，单次试验(trial)，和研究(study)。其中 objective 负责定义待优化函数并指定参/超参数数范围，trial 对应着 objective 的单次执行，而 study 则负责管理优化，决定优化的方式，总试验的次数、试验结果的记录等功能。

objective：根据目标函数的优化Session,由一系列的trail组成。
trail：根据目标函数作出一次执行。
study：根据多次trail得到的结果发现其中最优的超参数。

随机森林iris调优

from sklearn.datasets import load_iris
x, y = load_iris().data, load_iris().target
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
def objective(trial):
    global x, y
    X_train, X_test, y_train, y_test=train_test_split(x, y, train_size=0.3)# 数据集划分
    param = {
        "n_estimators": trial.suggest_int('n_estimators', 5, 20),
        "criterion": trial.suggest_categorical('criterion', ['gini','entropy'])
    }

    dt_clf = RandomForestClassifier()
    dt_clf.fit(X_train, y_train)
    pred_dt = dt_clf.predict(X_test)
    score = (y_test==pred_dt).sum() / len(y_test)
    return score
study=optuna.create_study(direction='maximize')
n_trials=20 # try20次
study.optimize(objective, n_trials=n_trials)
print(study.best_value)
print(study.best_params)

SHAP 可解释 AI (XAI)

可用于机器学习模型训练完之后进行相关性分析，参考文章

boston = datasets.load_boston() 
X_train, X_test, y_train, y_test = model_selection.train_test_split(boston.data, boston.target, random_state=0)
regressor = ensemble.RandomForestRegressor() 
regressor.fit(X_train, y_train)
# Create object that can calculate shap values
explainer = shap.TreeExplainer(regressor)
# Calculate Shap values
shap_values = explainer.shap_values(X_train)
# Shap 特征重要性
shap.summary_plot(shap_values, X_train, feature_names=features, plot_type="bar")
# SHAP Summary Plot
shap.summary_plot(shap_values, X_train, feature_names=features)
# SHAP Dependence Plot
shap.dependence_plot(5, shap_values, X_train, feature_names=features)

导航

AutoGluon

手动机器学习

EDA（Exploratory Data Analysis）

DataPrep

数据准备

数据预处理

特征工程

数据分割——训练数据和测试数据

模型选择

线性回归

逻辑回归

决策树

SVM

朴素贝叶斯

KNN

K-Means

随机森林

PCA

GBM

超参数调优

SHAP 可解释 AI (XAI)

发送评论编辑评论

AutoGluon

手动机器学习

EDA（Exploratory Data Analysis）

DataPrep

数据准备

数据预处理

特征工程

数据分割——训练数据和测试数据

模型选择

线性回归

逻辑回归

决策树

SVM

朴素贝叶斯

KNN

K-Means

随机森林

PCA

GBM

超参数调优

SHAP 可解释 AI (XAI)

发送评论 编辑评论

推荐文章

发送评论编辑评论