Spark机器学习工具链-MLflow使用教程

时间 2019-11-06

原文原文链接

Spark机器学习工具链-MLflow使用教程

本文翻译自 https://www.mlflow.org/docs/latest/concepts.html
本文地址 http://www.javashuo.com/article/p-sbrldptq-nt.html，by openthings, 2018.06.07.

参考：html

mlflow项目由Databricks建立。
- 官方主页 https://www.mlflow.org/
- 官方文档 https://www.mlflow.org/docs/latest/index.html
基于Kubernetes的机器学习系统，http://www.javashuo.com/article/p-bpgpkqza-dt.html
Kubeflow-机器学习工做流框架，https://my.oschina.net/u/2306127/blog/1807785
Spark机器学习工具链-MLflow，https://my.oschina.net/u/2306127/blog/1825638

什么是咱们构建的？

在本教程中，咱们将演示一个案例，展现数据科学家使用MLFlow端到端地构建一个线性回归模型。如何使用MLflow打包代码，其中代码训练该模型以一种可重用和重复生产的模型格式保存。最后，使用MLflow建立简单的 HTTP server，能够用来进行预测。python

咱们使用一个数据集来预测酒类质量，基于酒的量化指标如“fixed acidity”, “pH”, “residual sugar”, 等等。数据集来自于 UCI’s machine learning repository. [Ref]。git

你首先须要？

本教程中，咱们使用MLflow, conda, 和位于example/tutorial的示范代码，在 MLflow repository。下载相关代码，以下：github

git clone https://github.com/databricks/mlflow

训练模型

要作的第一件事是训练一个线性回归模型，有两个hyperparameters: alpha 和 l1_ratio。json

使用的代码位于 example/tutorial/train.py，以下：浏览器

# Read the wine-quality csv file (make sure you're running this from the root of MLflow!)
wine_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "wine-quality.csv")
data = pd.read_csv(wine_path)

# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

with mlflow.start_run():
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)

    predicted_qualities = lr.predict(test_x)

    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.sklearn.log_model(lr, "model")

在这里，咱们使用pandas、numpy和 sklearn APIs 建立简单的机器学习模型。除此以外，咱们使用 MLflow tracking APIs记录每一次训练的信息，如 hyperparameters alpha 和 l1_ratio 用于训练的度量，如 root mean square error，用于评估该模型。另外，咱们序列化该模型model，以MLflow能够部署的格式保存。服务器

运行代码：app

python example/tutorial/train.py

试验其余的 alpha 和 l1_ratio，经过将其做为参数传入train.py，以下：框架

python example/tutorial/train.py <alpha> <l1_ratio>

运行后，MLflow 记录了相关信息，在目录 mlruns中。dom

比较模型

下一步，咱们使用 MLflow UI 来比较刚才产生的模型。运行mlflow ui在一样的工做目录（包含 mlruns），在浏览器打开 http://localhost:5000。

此页面中，能够看到所产生的度量指标，以下：

今后页面能够看到，较低的 alpha 更适合咱们的模型。咱们可使用搜索快速过滤出模型。例如，查询 metrics.rmse < 0.8 将返回全部 root mean squared error 小于 0.8的。更复杂的操做，能够下载 CSV的表格，并使用喜欢的软件来分析。

打包训练代码

如今，咱们有了编写好的训练代码，但愿将其打包从而让其余的数据科学家能够容易地重用这个模型，或者将其放到远程服务器运行。为了打包，咱们使用 MLflow Projects conventions指定代码的依赖和入口点。在 example/tutorial/MLproject 文件中，咱们指定project的依赖在 Conda environment file ，名为 conda.yaml， 咱们的这个项目有一个入口点，接受两个参数：alpha 和 l1_ratio。以下：

# example/tutorial/MLproject

name: tutorial

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      alpha: float
      l1_ratio: {type: float, default: 0.1}
    command: "python train.py {alpha} {l1_ratio}"

# example/tutorial/conda.yaml

name: tutorial
channels:
  - defaults
dependencies:
  - numpy=1.14.3
  - pandas=0.22.0
  - scikit-learn=0.19.1
  - pip:
    - mlflow

为了运行该项目，简单地调用 mlflow run example/tutorial -P alpha=0.42。运行命令后， MLflow将在新的conda环境中运行训练代码，而且使用在 conda.yaml中指定的依赖软件和模块。

Projects can also be run directly from Github if the repository has a MLproject file in the root. We’ve duplicated this tutorial to the https://github.com/databricks/mlflow-example repository which can be run with mlflow run git@github.com:databricks/mlflow-example.git -P alpha=0.42.

服务模型

如今，咱们将 MLproject打包而且识别出最好的model，是时候使用 MLflow Models来部署这个模型了。一个MLflow Model是机器学习模型封装的标准格式，能够用于后续一系列的处理工具。例如，经过real-time serving提供 REST API 或在Spark上的批处理智能推理。

在咱们的训练代码中，训练出线性回归模型后，咱们启动 MLflow 中的一个函数，保存模型为运行部件。

mlflow.sklearn.log_model(lr, "model")

为了浏览这个 artifact，咱们再次使用UI。点击页面中的列表，以下。

在下面，咱们看到对 mlflow.sklearn.log_model 的调用产生了两个文件，在/Users/mlflow/mlflow-prototype/mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model。第一个 MLmodel 是元数据文件，告诉MLflow如何载入模型。第二个文件 model.pkl 是咱们训练的线性回归模型的序列化。

在这个例子中，咱们演示使用 MLmodel 格式经过MLflow部署一个本地的REST server，用于进行预测。

部署上服务器，运行：

mlflow sklearn serve /Users/mlflow/mlflow-prototype/mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model -p 1234

注意：

该版本Python必须与运行mlflow sklearn的一致。不然，可能会报错： UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 1: ordinal not in range(128) or raise ValueError, "unsupported pickle protocol: %d".

预测服务调用，运行：

curl -X POST -H "Content-Type:application/json" --data '[{"fixed acidity": 6.2, "volatile acidity": 0.66, "citric acid": 0.48, "residual sugar": 1.2, "chlorides": 0.029, "free sulfur dioxide": 29, "total sulfur dioxide": 75, "density": 0.98, "pH": 3.33, "sulphates": 0.39, "alcohol": 12.8}]' http://127.0.0.1:1234/invocations

# RESPONSE
# {"predictions": [6.379428821398614]}

Spark机器学习工具链-MLflow使用教程

Spark机器学习工具链-MLflow使用教程

什么是咱们构建的？

你首先须要？

训练模型

比较模型

打包训练代码

服务模型

更多资源