写给 Python 开发者的 10 条机器学习建议

有时候，做为一个数据科学家，咱们经常忘记了初心。咱们首先是一个开发者，而后才是研究人员，最后才多是数学家。咱们的首要职责是快速找到无 bug 的解决方案。linux

咱们能作模型并不意味着咱们就是神。这并非编写垃圾代码的理由。json

自从我开始学习机器学习以来，我犯了不少错误。所以我想把我认机器学习工程中最经常使用的技能分享出来。在我看来，这也是目前这个行业最缺少的技能。windows

下面开始个人分享。api

学习编写抽象类

一旦开始编写抽象类，你就能体会到它给带来的好处。抽象类强制子类使用相同的方法和方法名称。许多人在同一个项目上工做，若是每一个人去定义不一样的方法，这样作没有必要也很容易形成混乱。服务器

1 import os
 2 from abc import ABCMeta, abstractmethod
 3
 4
 5 class DataProcessor(metaclass=ABCMeta):
 6    """Base processor to be used for all preparation."""
 7    def __init__(self, input_directory, output_directory):
 8        self.input_directory = input_directory
 9        self.output_directory = output_directory
10
11    @abstractmethod
12    def read(self):
13        """Read raw data."""
14
15    @abstractmethod
16    def process(self):
17        """Processes raw data. This step should create the raw dataframe with all the required features. Shouldn't implement statistical or text cleaning."""
18
19    @abstractmethod
20    def save(self):
21        """Saves processed data."""
22
23
24 class Trainer(metaclass=ABCMeta):
25    """Base trainer to be used for all models."""
26
27    def __init__(self, directory):
28        self.directory = directory
29        self.model_directory = os.path.join(directory, 'models')
30
31    @abstractmethod
32    def preprocess(self):
33        """This takes the preprocessed data and returns clean data. This is more about statistical or text cleaning."""
34
35    @abstractmethod
36    def set_model(self):
37        """Define model here."""
38
39    @abstractmethod
40    def fit_model(self):
41        """This takes the vectorised data and returns a trained model."""
42
43    @abstractmethod
44    def generate_metrics(self):
45        """Generates metric with trained model and test data."""
46
47    @abstractmethod
48    def save_model(self, model_name):
49        """This method saves the model in our required format."""
50
51
52 class Predict(metaclass=ABCMeta):
53    """Base predictor to be used for all models."""
54
55    def __init__(self, directory):
56        self.directory = directory
57        self.model_directory = os.path.join(directory, 'models')
58
59    @abstractmethod
60    def load_model(self):
61        """Load model here."""
62
63    @abstractmethod
64    def preprocess(self):
65        """This takes the raw data and returns clean data for prediction."""
66
67    @abstractmethod
68    def predict(self):
69        """This is used for prediction."""
70
71
72 class BaseDB(metaclass=ABCMeta):
73    """ Base database class to be used for all DB connectors."""
74    @abstractmethod
75    def get_connection(self):
76        """This creates a new DB connection."""
77    @abstractmethod
78    def close_connection(self):
79        """This closes the DB connection."""

固定随机数种子

实验的可重复性是很是重要的，随机数种子是咱们的敌人。要特别注重随机数种子的设置，不然会致使不一样的训练 / 测试数据的分裂和神经网络中不一样权重的初始化。这些最终会致使结果的不一致。网络

1 def set_seed(args):
2    random.seed(args.seed)
3    np.random.seed(args.seed)
4    torch.manual_seed(args.seed)
5    if args.n_gpu > 0:
6        torch.cuda.manual_seed_all(args.seed)

先加载少许数据

若是你的数据量太大，而且你正在处理好比清理数据或建模等后续编码时，请使用 nrows来避免每次都加载大量数据。当你只想测试代码而不是想实际运行整个程序时，可使用此方法。app

很是适合在你本地电脑配置不足以处理那么大的数据量，但你喜欢用 Jupyter/VS code/Atom 开发的场景。dom

1 f_train = pd.read_csv(‘train.csv’, nrows=1000)

预测失败 (成熟开发人员的标志)

老是检查数据中的 NA（缺失值），由于这些数据可能会形成一些问题。即便你当前的数据没有，并不意味着它不会在将来的训练循环中出现。因此不管如何都要留意这个问题。机器学习

1 print(len(df))
2 df.isna().sum()
3 df.dropna()
4 print(len(df))

显示处理进度

在处理大数据时，若是能知道还须要多少时间能够处理完，可以了解当前的进度很是重要。ide

写给 Python 开发者的 10 条机器学习建议

1 from tqdm import tqdm
 2 import time
 3
 4 tqdm.pandas()
 5
 6 df['col'] = df['col'].progress_apply(lambda x: x**2)
 7
 8 text = ""
 9 for char in tqdm(["a", "b", "c", "d"]):
10    time.sleep(0.25)
11    text = text + char

方案2：fastprogress

1 from fastprogress.fastprogress import master_bar, progress_bar
2 from time import sleep
3 mb = master_bar(range(10))
4 for i in mb:
5    for j in progress_bar(range(100), parent=mb):
6        sleep(0.01)
7        mb.child.comment = f'second bar stat'
8    mb.first_bar.comment = f'first bar stat'
9    mb.write(f'Finished loop {i}.')

解决 Pandas 慢的问题

若是你用过 pandas，你就会知道有时候它的速度有多慢ーー尤为在团队合做时。与其绞尽脑汁去寻找加速解决方案，不如经过改变一行代码来使用 modin。

1 import modin.pandas as pd

记录函数的执行时间

并非全部的函数都生来平等。

即便所有代码都运行正常，也并不能意味着你写出了一手好代码。一些软错误实际上会使你的代码变慢，所以有必要找到它们。使用此装饰器记录函数的时间。

1 import time
 2
 3 def timing(f):
 4    """Decorator for timing functions
 5    Usage:
 6    @timing
 7    def function(a):
 8        pass
 9    """
10
11
12    @wraps(f)
13    def wrapper(*args, **kwargs):
14        start = time.time()
15        result = f(*args, **kwargs)
16        end = time.time()
17        print('function:%r took: %2.2f sec' % (f.__name__,  end - start))
18        return result
19    return wrapp

不要在云上烧钱

没有人喜欢浪费云资源的工程师。

咱们的一些实验可能会持续数小时。跟踪它并在完成后关闭云实例是很困难的。我本身也犯过错误，也看到过有些人会有连续几天不关机的状况。

这种状况常常会发生在咱们周五上班，留下一些东西运行，直到周一回来才意识到。

只要在执行结束时调用这个函数，你的屁股就不再会着火了！

使用 try 和 except 来包裹 main 函数，一旦发生异常，服务器就不会再运行。我就处理过相似的案例

让咱们多一点责任感，低碳环保从我作起。

1 import os
 2
 3 def run_command(cmd):
 4    return os.system(cmd)
 5
 6 def shutdown(seconds=0, os='linux'):
 7    """Shutdown system after seconds given. Useful for shutting EC2 to save costs."""
 8    if os == 'linux':
 9        run_command('sudo shutdown -h -t sec %s' % seconds)
10    elif os == 'windows':
11        run_command('shutdown -s -t %s' % seconds)

建立和保存报告

在建模的某个特定点以后，全部的深入看法都来自于对偏差和度量的分析。确保为本身和上司建立并保存格式正确的报告。

无论怎样，管理层都喜欢报告，不是吗？

1 import json
 2 import os
 3
 4 from sklearn.metrics import (accuracy_score, classification_report,
 5                             confusion_matrix, f1_score, fbeta_score)
 6
 7 def get_metrics(y, y_pred, beta=2, average_method='macro', y_encoder=None):
 8    if y_encoder:
 9        y = y_encoder.inverse_transform(y)
10        y_pred = y_encoder.inverse_transform(y_pred)
11    return {
12        'accuracy': round(accuracy_score(y, y_pred), 4),
13        'f1_score_macro': round(f1_score(y, y_pred, average=average_method), 4),
14        'fbeta_score_macro': round(fbeta_score(y, y_pred, beta, average=average_method), 4),
15        'report': classification_report(y, y_pred, output_dict=True),
16        'report_csv': classification_report(y, y_pred, output_dict=False).replace('\n','\r\n')
17    }
18
19
20 def save_metrics(metrics: dict, model_directory, file_name):
21    path = os.path.join(model_directory, file_name + '_report.txt')
22    classification_report_to_csv(metrics['report_csv'], path)
23    metrics.pop('report_csv')
24    path = os.path.join(model_directory, file_name + '_metrics.json')
25    json.dump(metrics, open(path, 'w'), indent=4)

写出一手好 API

结果很差，一切都很差。

你能够作很好的数据清理和建模，可是你仍然能够在最后制造巨大的混乱。经过我与人打交道的经验告诉我，许多人不清楚如何编写好的 api、文档和服务器设置。我将很快写另外一篇关于这方面的文章，可是先让我简要分享一部分。

下面的方法适用于经典的机器学习和深度学习部署，在不过高的负载下(好比1000 / min)。

见识下这个组合: Fastapi + uvicorn + gunicorn

最快的用 fastapi 编写 API，由于这是最快的，缘由参见这篇文章。
文档在 fastapi 中编写 API 为咱们提供了 http: url/docs 上的免费文档和测试端点，当咱们更改代码时，fastapi 会自动生成和更新这些文档。
worker使用 gunicorn 服务器部署 API，由于 gunicorn 具备启动多于1个 worker，并且你应该保留至少 2 个worker。

运行这些命令来使用 4 个 worker 部署。能够经过负载测试优化 worker 数量。

1 pip install fastapi uvicorn gunicorn
2 gunicorn -w 4 -k uvicorn.workers.UvicornH11Worker main:app

原文来自：http://suo.im/5MoQTN