使用Django，Prometheus，和Kubernetes定制应用指标

时间 2019-12-31

标签使用 django prometheus kubernetes 定制应用指标栏目 Python 繁體版

原文原文链接

英文原文： https://labs.meanpug.com/cust...
做者：Bobby Steinbach
译者：马若飞

编者按

本文强调了应用程序定制指标的重要性，用代码实例演示了如何设计指标并整合Prometheus到Django项目中，为使用Django构建应用的开发者提供了参考。html

为何自定义指标很重要？

尽管有大量关于这一主题的讨论，但应用程序的自定义指标的重要性怎么强调都不为过。和为Django应用收集的核心服务指标（应用和web服务器统计数据、关键数据库和缓存操做指标）不一样，自定义指标是业务特有的数据点，其边界和阈值只有你本身知道，这实际上是颇有趣的事情。python

什么样的指标才是有用的？考虑下面几点：nginx

运行一个电子商务网站并追踪平均订单数量。忽然间订单的数量不那么平均了。有了可靠的应用指标和监控，你就能够在损失殆尽以前捕获到Bug。
你正在写一个爬虫，它每小时从一个新闻网站抓取最新的文章。忽然最近的文章并不新了。可靠的指标和监控能够更早地揭示问题所在。
我认为你已经理解了重点。

设置Django应用程序

除了明显的依赖（pip install Django）以外，咱们还须要为宠物项目（译者注：demo）添加一些额外的包。继续并安装pip install django-prometheus-client。这将为咱们提供一个Python的Prometheus客户端，以及一些有用的Django hook，包括中间件和一个优雅的DB包装器。接下来，咱们将运行Django管理命令来启动项目，更新咱们的设置来使用Prometheus客户端，并将Prometheus的URL添加到URL配置中。git

启动一个新的项目和应用程序github

为了这篇文章，而且切合代理的品牌，咱们创建了一个遛狗服务。请注意，它实际上不会作什么事，但足以做为一个教学示例。执行以下命令：web

django-admin.py startproject demo
python manage.py startapp walker

#settings.py

INSTALLED_APPS = [
    ...
    'walker',
    ...
]

如今，咱们来添加一些基本的模型和视图。简单起见，我只实现将要验证的部分。若是想要完整地示例，能够从这个demo应用获取源码。sql

# walker/models.py
from django.db import models
from django_prometheus.models import ExportModelOperationsMixin


class Walker(ExportModelOperationsMixin('walker'), models.Model):
    name = models.CharField(max_length=127)
    email = models.CharField(max_length=127)

    def __str__(self):
        return f'{self.name} // {self.email} ({self.id})'


class Dog(ExportModelOperationsMixin('dog'), models.Model):
    SIZE_XS = 'xs'
    SIZE_SM = 'sm'
    SIZE_MD = 'md'
    SIZE_LG = 'lg'
    SIZE_XL = 'xl'
    DOG_SIZES = (
        (SIZE_XS, 'xsmall'),
        (SIZE_SM, 'small'),
        (SIZE_MD, 'medium'),
        (SIZE_LG, 'large'),
        (SIZE_XL, 'xlarge'),
    )

    size = models.CharField(max_length=31, choices=DOG_SIZES, default=SIZE_MD)
    name = models.CharField(max_length=127)
    age = models.IntegerField()

    def __str__(self):
        return f'{self.name} // {self.age}y ({self.size})'


class Walk(ExportModelOperationsMixin('walk'), models.Model):
    dog = models.ForeignKey(Dog, related_name='walks', on_delete=models.CASCADE)
    walker = models.ForeignKey(Walker, related_name='walks', on_delete=models.CASCADE)

    distance = models.IntegerField(default=0, help_text='walk distance (in meters)')

    start_time = models.DateTimeField(null=True, blank=True, default=None)
    end_time = models.DateTimeField(null=True, blank=True, default=None)

    @property
    def is_complete(self):
        return self.end_time is not None

    @classmethod
    def in_progress(cls):
        """ get the list of `Walk`s currently in progress """
        return cls.objects.filter(start_time__isnull=False, end_time__isnull=True)

    def __str__(self):
        return f'{self.walker.name} // {self.dog.name} @ {self.start_time} ({self.id})'

# walker/views.py
from django.shortcuts import render, redirect
from django.views import View
from django.core.exceptions import ObjectDoesNotExist
from django.http import HttpResponseNotFound, JsonResponse, HttpResponseBadRequest, Http404
from django.urls import reverse
from django.utils.timezone import now
from walker import models, forms


class WalkDetailsView(View):
    def get_walk(self, walk_id=None):
        try:
            return models.Walk.objects.get(id=walk_id)
        except ObjectDoesNotExist:
            raise Http404(f'no walk with ID {walk_id} in progress')


class CheckWalkStatusView(WalkDetailsView):
    def get(self, request, walk_id=None, **kwargs):
        walk = self.get_walk(walk_id=walk_id)
        return JsonResponse({'complete': walk.is_complete})


class CompleteWalkView(WalkDetailsView):
    def get(self, request, walk_id=None, **kwargs):
        walk = self.get_walk(walk_id=walk_id)
        return render(request, 'index.html', context={'form': forms.CompleteWalkForm(instance=walk)})

    def post(self, request, walk_id=None, **kwargs):
        try:
            walk = models.Walk.objects.get(id=walk_id)
        except ObjectDoesNotExist:
            return HttpResponseNotFound(content=f'no walk with ID {walk_id} found')

        if walk.is_complete:
            return HttpResponseBadRequest(content=f'walk {walk.id} is already complete')

        form = forms.CompleteWalkForm(data=request.POST, instance=walk)

        if form.is_valid():
            updated_walk = form.save(commit=False)
            updated_walk.end_time = now()
            updated_walk.save()

            return redirect(f'{reverse("walk_start")}?walk={walk.id}')

        return HttpResponseBadRequest(content=f'form validation failed with errors {form.errors}')


class StartWalkView(View):
    def get(self, request):
        return render(request, 'index.html', context={'form': forms.StartWalkForm()})

    def post(self, request):
        form = forms.StartWalkForm(data=request.POST)

        if form.is_valid():
            walk = form.save(commit=False)
            walk.start_time = now()
            walk.save()

            return redirect(f'{reverse("walk_start")}?walk={walk.id}')

        return HttpResponseBadRequest(content=f'form validation failed with errors {form.errors}')

更新应用设置并添加Prometheus urls数据库

如今咱们有了一个Django项目以及相应的设置，能够为 django-prometheus添加须要的配置项了。在 settings.py中添加下面的配置：django

INSTALLED_APPS = [
    ...
    'django_prometheus',
    ...
]

MIDDLEWARE = [
    'django_prometheus.middleware.PrometheusBeforeMiddleware',
    ....
    'django_prometheus.middleware.PrometheusAfterMiddleware',
]

# we're assuming a Postgres DB here because, well, that's just the right choice :)
DATABASES = {
    'default': {
        'ENGINE': 'django_prometheus.db.backends.postgresql',
        'NAME': os.getenv('DB_NAME'),
        'USER': os.getenv('DB_USER'),
        'PASSWORD': os.getenv('DB_PASSWORD'),
        'HOST': os.getenv('DB_HOST'),
        'PORT': os.getenv('DB_PORT', '5432'),
    },
}

添加url配置到 urls.py：segmentfault

urlpatterns = [
    ...
    path('', include('django_prometheus.urls')),
]

如今咱们有了一个配置好的基本应用，并为整合作好了准备。

添加Prometheus指标

因为django-prometheus提供了开箱即用功能，咱们能够当即追踪一些基本的模型操做，好比插入和删除。能够在/metricsendpoint看到这些：

django-prometheus提供的默认指标

让咱们把它变得更有趣点。

添加一个walker/metrics.py文件，定义一些要追踪的基本指标。

# walker/metrics.py
from prometheus_client import Counter, Histogram


walks_started = Counter('walks_started', 'number of walks started')
walks_completed = Counter('walks_completed', 'number of walks completed')
invalid_walks = Counter('invalid_walks', 'number of walks attempted to be started, but invalid')

walk_distance = Histogram('walk_distance', 'distribution of distance walked', buckets=[0, 50, 200, 400, 800, 1600, 3200])

很简单，不是吗？Prometheus文档很好地解释了每种指标类型的用途，简言之，咱们使用计数器来表示严格随时间增加的指标，使用直方图来追踪包含值分布的指标。下面开始验证应用的代码。

# walker/views.py
...
from walker import metrics
...

class CompleteWalkView(WalkDetailsView):
    ...
    def post(self, request, walk_id=None, **kwargs):
        ...
        if form.is_valid():
            updated_walk = form.save(commit=False)
            updated_walk.end_time = now()
            updated_walk.save()

            metrics.walks_completed.inc()
            metrics.walk_distance.observe(updated_walk.distance)

            return redirect(f'{reverse("walk_start")}?walk={walk.id}')

        return HttpResponseBadRequest(content=f'form validation failed with errors {form.errors}')

...

class StartWalkView(View):
    ...
    def post(self, request):
        if form.is_valid():
            walk = form.save(commit=False)
            walk.start_time = now()
            walk.save()

            metrics.walks_started.inc()

            return redirect(f'{reverse("walk_start")}?walk={walk.id}')

        metrics.invalid_walks.inc()

        return HttpResponseBadRequest(content=f'form validation failed with errors {form.errors}')

发送几个样例请求，能够看到新指标已经产生了。

显示散步距离和建立散步的指标

定义的指标此时已经能够在prometheus里查找到了

至此，咱们已经在代码中添加了自定义指标，整合了应用以追踪指标，并验证了这些指标已在/metrics 上更新并可用。让咱们继续将仪表化应用部署到Kubernetes集群。

使用Helm部署应用

我只会列出和追踪、导出指标相关的配置内容，完整的Helm chart部署和服务配置能够在 demo应用中找到。做为起点，这有一些和导出指标相关的deployment和configmap的配置：

# helm/demo/templates/nginx-conf-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ include "demo.fullname" . }}-nginx-conf
  ...
data:
  demo.conf: |
    upstream app_server {
      server 127.0.0.1:8000 fail_timeout=0;
    }

    server {
      listen 80;
      client_max_body_size 4G;

      # set the correct host(s) for your site
      server_name{{ range .Values.ingress.hosts }} {{ . }}{{- end }};

      keepalive_timeout 5;

      root /code/static;

      location / {
        # checks for static file, if not found proxy to app
        try_files $uri @proxy_to_app;
      }

      location ^~ /metrics {
        auth_basic           "Metrics";
        auth_basic_user_file /etc/nginx/secrets/.htpasswd;

        proxy_pass http://app_server;
      }

      location @proxy_to_app {
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Host $http_host;
        # we don't want nginx trying to do something clever with
        # redirects, we set the Host: header above already.
        proxy_redirect off;
        proxy_pass http://app_server;
      }
    }

# helm/demo/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
...
    spec:
      metadata:
        labels:
          app.kubernetes.io/name: {{ include "demo.name" . }}
          app.kubernetes.io/instance: {{ .Release.Name }}
          app: {{ include "demo.name" . }}
      volumes:
        ...
        - name: nginx-conf
          configMap:
            name: {{ include "demo.fullname" . }}-nginx-conf
        - name: prometheus-auth
          secret:
            secretName: prometheus-basic-auth
        ...
      containers:
        - name: {{ .Chart.Name }}-nginx
          image: "{{ .Values.nginx.image.repository }}:{{ .Values.nginx.image.tag }}"
          imagePullPolicy: IfNotPresent
          volumeMounts:
            ...
            - name: nginx-conf
              mountPath: /etc/nginx/conf.d/
            - name: prometheus-auth
              mountPath: /etc/nginx/secrets/.htpasswd
          ports:
            - name: http
              containerPort: 80
              protocol: TCP
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          command: ["gunicorn", "--worker-class", "gthread", "--threads", "3", "--bind", "0.0.0.0:8000", "demo.wsgi:application"]
          env:
{{ include "demo.env" . | nindent 12 }}
          ports:
            - name: gunicorn
              containerPort: 8000
              protocol: TCP
           ...

没什么神奇的，只是一些YAML而已。有两个重点须要强调一下：

咱们经过一个nginx反向代理将/metrics放在了验证后面，为location块设置了auth_basic指令集。你可能但愿在反向代理以后部署gunicorn ，但这样作能够得到保护指标的额外好处。
咱们使用多线程的gunicorn而不是多个worker。虽然能够为Prometheus客户端启用多进程模式，但在Kubernetes环境中，安装会更为复杂。为何这很重要呢？在一个pod中运行多个worker的风险在于，每一个worker将在采集时报告本身的一组指标值。可是，因为服务在Prometheus Kubernetes SD scrape配置中被设置为pod级别，这些（潜在的）跳转值将被错误地分类为计数器重置，从而致使测量结果不一致。你并不必定须要遵循上述全部步骤，但重点是：若是你了解的很少，应该从一个单线程+单worker的gunicorn环境开始，或者从一个单worker+多线程环境开始。

使用Helm部署Prometheus

基于Helm的帮助文档，部署Prometheus很是简单，不须要额外工做：

helm upgrade --install prometheus stable/prometheus

几分钟后，你应该就能够经过 port-forward 进入Prometheus的pod（默认的容器端口是9090）。

为应用配置Prometheus scrape目标

Prometheus Helm chart 有大量的自定义可选项，不过咱们只须要设置extraScrapeConfigs。建立一个values.yaml文件。你能够略过这部分直接使用 demo应用做为参考。文件内容以下：

extraScrapeConfigs: |
  - job_name: demo
    scrape_interval: 5s
    metrics_path: /metrics
    basic_auth:
      username: prometheus
      password: prometheus
    tls_config:
      insecure_skip_verify: true
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - default
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: demo
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: http
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        target_label: service
      - source_labels: [__meta_kubernetes_service_name]
        target_label: job
      - target_label: endpoint
        replacement: http

建立完成后，就能够经过下面的操做为prometheus deployment更新配置。

helm upgrade --install prometheus -f values.yaml

为验证全部的步骤都配置正确了，打开浏览器输入 http://localhost:9090/targets （假设你已经经过 port-forward进入了运行prometheus的Pod）。若是你看到demo应用在target的列表中，说明运行正常了。

本身动手试试

我要强调一点：捕获自定义的应用程序指标并设置相应的报告和监控是软件工程中最重要的任务之一。幸运的是，将Prometheus指标集成到Django应用程序中实际上很是简单，正如本文展现的那样。若是你想要开始监测本身的应用，请参考完整的示例应用程序，或者直接fork代码库。祝你玩得开心。

关于 ServiceMeshe 社区

ServiceMesher 社区是由一群拥有相同价值观和理念的志愿者们共同发起，于 2018 年 4 月正式成立。

社区关注领域有：容器、微服务、Service Mesh、Serverless，拥抱开源和云原生，致力于推进 Service Mesh 在中国的蓬勃发展。

社区官网：https://www.servicemesher.com