Promethues 监控系统部署及邮件报警

2018/09/28 Prometheus

1、安装部署

1.1、环境定义说明

IP地址 角色
10.100.4.182 Prometheus Server
10.100.4.183 node_exporter

版本明细:

  • 测试通过系统:CentOS 7.5.1804
  • Prometheus:2.4.2.linux-amd64
  • Alertmanager:0.15.2.linux-amd64
  • node_exporter:0.16.0.linux-amd64

软件包下载地址:https://prometheus.io/download/

1.2、部署 Prometheus Server

下载安装程序

$ cd /usr/local/src/
$ wget https://github.com/prometheus/prometheus/releases/download/v2.4.2/prometheus-2.4.2.linux-amd64.tar.gz
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.15.2/alertmanager-0.15.2.linux-amd64.tar.gz
$ wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz

安装 Prometheus

$ tar xf prometheus-2.4.2.linux-amd64.tar.gz -C /usr/local/
$ ln -sv /usr/local/prometheus-2.4.2.linux-amd64/ /usr/local/prometheus

创建启动脚本:

$ vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus --storage.tsdb.retention=15d --log.level=info
Restart=on-failure
[Install]
WantedBy=multi-user.target

创建 prometheus 用户

$ groupadd prometheus
$ useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus

1.3、安装 node_exporter

在 Prometheus 节点和另一台节点上分别安装 node_exporter

$ tar xf node_exporter-0.16.0.linux-amd64.tar.gz -C /usr/local/
$ ln -sv /usr/local/node_exporter-0.16.0.linux-amd64/ /usr/local/node_exporter

创建 node_exporter 启动脚本:

$ vim /usr/lib/systemd/system/node_exporter.service 
[Unit]
Description=node_export
Documentation=https://github.com/prometheus/node_exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

注意:node_exporter 的运行用户也是 prometheus 用户需要在每台节点上都创建该用户。

启动 node_exporter 服务:

$ systemctl enable node_exporter.service
$ systemctl start node_exporter.service
$ systemctl status node_exporter.service
$ ss -tnl|grep 9100

2、配置 Prometheus 添加监控目标

$ cd /usr/local/prometheus
$ vim prometheus.yml 
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090','localhost:9100'] # 对本机node_exporter 监控
# 新添加的对其它node节点抓取数据
  - job_name: 'linux-node01'
   # 重写了全局抓取间隔时间,由15秒重写成5秒。
    scrape_interval: 5s
    static_configs:
    - targets: ['10.100.4.183:9100']

启动 Prometheus 服务:

$ systemctl enable prometheus.service
$ systemctl start prometheus.service
$ systemctl status prometheus.service

访问 Prometheus WEB 查看我们定义的目标主机:http://10.100.4.182:9090/targets

3、配置 Prometheus 报警

3.1、安装配置 Alertmanager

$ tar xf alertmanager-0.15.2.linux-amd64.tar.gz -C /usr/local/
$ ln -sv /usr/local/alertmanager-0.15.2.linux-amd64/ /usr/local/alertmanager

# 创建启动文件
$ vim /usr/lib/systemd/system/alertmanager.service 
[Unit]
Description=alertmanager
Documentation=https://github.com/prometheus/alertmanager
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alert-test.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

Alertmanager 安装目录下默认有 alertmanager.yml 配置文件,可以创建新的配置文件,在启动时指定即可。

$ cd /usr/local/alertmanager
$ vim alert-test.yml
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: 'W_enzhi@163.com'
  smtp_auth_username: 'W_enzhi@163.com'
  smtp_auth_password: 'xxxxxxx' # 这里是邮箱的授权密码,不是登录密码
  smtp_require_tls: false

templates:
  - '/alertmanager/template/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 10m
  receiver: default-receiver

receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'wangenzhi@bd-yg.com'
    html: ''
    headers: { Subject: "[WARN] 报警邮件 test" }
  • smtp_smarthost:是用于发送邮件的邮箱的 SMTP 服务器地址+端口;
  • smtp_auth_password:是发送邮箱的授权码而不是登录密码;
  • smtp_require_tls:不设置的话默认为 true,当为 true 时会有 starttls 错误,为了简单这里设置为 false;
  • templates:指出邮件的模板路径;
  • receivers 下 html 指出邮件内容模板名,这里模板名为 “alert.html”,在模板路径中的某个文件中定义。
  • headers:为邮件标题;

3.2、配置报警规则

配置 rule.yml

$ cd /usr/local/prometheus
$ vim rule.yml
groups:
- name: alert-rules.yml
  rules:
  - alert: InstanceStatus # alert 名字
    expr: up{job="linux-node01"} == 0 # 判断条件
    for: 10s # 条件保持 10s 才会发出 alter
    labels: # 设置 alert 的标签
      severity: "critical"
    annotations:  # alert 的其他标签,但不用于标识 alert
      description: 服务器  已当机超过 20s
      summary: 服务器  运行状态

在 prometheus.yml 中指定 rule.yml 的路径

root@k8s03-ops-bjqw:/usr/local/prometheus # cat prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093 # 这里修改为 localhost

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/usr/local/prometheus/rule.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090','localhost:9100']

  - job_name: 'linux-node01'
    scrape_interval: 5s
    static_configs:
    - targets: ['10.100.4.183:9100']

重启 Prometheus 服务:

$ systemctl restart prometheus

3.3、编写邮件模板

注意:文件后缀为 tmpl

$ mkdir -pv /alertmanager/template/
$ vim /alertmanager/template/alert.tmpl

<table>
    <tr><td>报警名</td><td>开始时间</td></tr>
    
        <tr><td></td><td></td></tr>
    
</table>

3.4、启动 Alertmanager

$ systemctl daemon-reload
$ systemctl start alertmanager.service
$ systemctl status alertmanager.service
$ ss -tnl|grep 9093

4、效果

停止 linux-node01 上的 node_exporter 服务:

收到的邮件:

-w810

Search

    Table of Contents