大学の研究室が管理しているネットワークに10数台くらい計算機サーバーがあるんですが、一部の計算機サーバーしか監視できていない状態のため、全ての計算機サーバーを監視できるようにしたい。その過程を書いた記事です。

背景

現状としては、以下のような問題があります。

研究室ネットワークにある計算機サーバーを使用する際にどれくらいCPUを使用しているのかやどれくらいのメモリを使用しているかなどが一元的にわからず、それぞれのサーバーにSSHログインしてコマンドで確認するしかない。
他の学生が計算機サーバーの大部分のリソースを使用している場合は、残りの計算機サーバーを使用する必要があるが、学生は毎回topコマンドなどで既に使用している人がいないか確認する必要がある。
計算機サーバーで動かしているプログラムがメモリを使用しすぎている場合などに、それを通知する仕組みがない。
計算機サーバーや管理サーバーなどの死活監視がないため、停止しているサーバーがあった時に、それがいつからなのかの原因特定ができない。
一部の計算機サーバーに対してはMuninによる監視を先輩が導入してくれているが、Muninを動かす監視サーバーが計算機サーバーなので、別の管理用サーバーに移行したい
Muninによる設定は全て手動で行う必要があるため、コピペミスなどのヒューマンエラーが起こり得る

ちなみに、Muninというのはエージェント型の監視ツールのことで、監視をする側のサーバーがmunin-master、監視される側のサーバーがmunin-nodeになります。pingによる死活監視は監視対象にエージェントを入れる必要はありません。

目標

これらの問題を解決するため、以下のような目標を立てました

全ての計算機サーバーをMuninで監視する
監視項目の値が異常値の場合に、研究室のSlackに通知を送る
死活監視も行う
Muninの監視に関する設定をAnsibleで自動化する

Muninの設定

概要

まずは管理用サーバーにMuninを手動でインストールします。研究室では、CentOS 6.9(Final)サーバーがNFS・NISによる共有アカウント・ストレージの管理をしているため、このサーバーをMuninとAnsibleのmasterとしました。（移行、masterサーバー(10.0.0.0)と呼びます）

Muninのインストール

masterのリポジトリのURLが古いものになったままだったので、正しく設定する。

[root@master ~]# sed -i -e "s|mirror\.centos\.org/centos/\$releasever|vault\.centos\.org/6.9|g" /etc/yum.repos.d/CentOS-Base.repo
[root@master ~]# sed -i -e "s|#baseurl=|baseurl=|g" /etc/yum.repos.d/CentOS-Base.repo
[root@master ~]# sed -i -e "s|mirrorlist=|#mirrorlist=|g" /etc/yum.repos.d/CentOS-Base.repo

既にあるキャッシュを削除しておく。これでyumコマンドが正常に動くようになった。

[root@master ~]# yum clean all

httpdをインストールする（今回は初めから入っていたので省略）
httpdを起動する。

[root@master ~]# service httpd start
httpd を起動中:                               [  OK  ]
[root@master ~]#

munin-server, munin-nodeをインストールする

[root@master ~]# yum --enablerepo=epel -y install munin munin-node

ちなみにですが、epelはEPEL6です。EPEL5の場合は多分パッケージの依存関係の問題が発生すると思います。（あんまりわかってない）

[root@master ~]# yum repolist all | grep -i epel
 * epel: ftp.iij.ad.jp
epel                   Extra Packages for Enterprise Linux 6 - x86_ 有効: 12,581
epel-debuginfo         Extra Packages for Enterprise Linux 6 - x86_ 無効
epel-source            Extra Packages for Enterprise Linux 6 - x86_ 無効
epel-testing           Extra Packages for Enterprise Linux 6 - Test 無効
epel-testing-debuginfo Extra Packages for Enterprise Linux 6 - Test 無効
epel-testing-source    Extra Packages for Enterprise Linux 6 - Test 無効
[root@master ~]#

Munin-serverの設定

後々Ansibleでファイルに書き込むことを考慮して、監視対象に関する設定は/etc/munin/munin.confじゃなくて/etc/munin/conf.d/hosts.confに記述することにする。

[root@master ~]# vi /etc/munin/conf.d/hosts.conf
# 以下を記述
[munin-master]
      address 127.0.0.1
      use_node_name yes
[root@master ~]# vi /etc/munin/conf.d/local.conf
# 以下をコメントアウト
[localhost]
      address 127.0.0.1
      use_node_name yes
[root@master ~]# service httpd restart
httpd を停止中:                                            [  OK  ]
httpd を起動中:                                            [  OK  ]
[root@master ~]#

munin-nodeを起動して、永続化させる

[root@master ~]# service munin-node start
Starting munin-node:                                       [  OK  ]
[root@master ~]# chkconfig munin-node on

これでしばらくすると、hxxp://10.0.0.0/munin/にアクセスすると、munin-masterの項目が見えた！Muninはデフォルトだとcronによって5分ごとに実行されるため、5分待ってからアクセスする必要がある。

Muninのレイアウトを変える

デフォルトだとちょっと見にくいため、以下のようにして変更した

[root@master ~]# git clone https://github.com/munin-monitoring/contrib.git
[root@master ~]# cd /etc/munin/
[root@master munin]# cp -rb /root/contrib/templates/munstrap/templates .
[root@master munin]# cp -rb /root/contrib/templates/munstrap/static .

監視対象へのmunin-nodeの設定

Ansibleで自動化する前に、手動だとどういう手順になるのかを確認した。
今回は、監視対象はUbuntuサーバーとした（以降、ubuntu1で、IPアドレスは10.0.0.1とする）

munin-nodeをインストール

root@ubuntu1:~$ apt-get install munin-node

munin-masterのIPからの接続を許可

root@ubuntu1:~$ vim /etc/munin/munin-node.conf
# 以下を追記 
allow ^10.0.0.1$

munin-nodeを再起動

root@ubuntu1:~$ systemctl restart munin-node

自動起動の設定

root@ubuntu1:~$ systemctl enable munin-node

munin-masterで監視対象を追加

[root@master ~]# vi /etc/munin/conf.d/hosts.conf
# 以下を追記
[calc-server;ubuntu1]
      address 10.0.0.1

これで、基本項目の監視はできるが、pingによる死活監視がしたいのでさらにプラグインを追加する。
ホスト名をping-ubuntu1に変更

[root@master munin]# ln -s /usr/share/munin/plugins/ping_ /etc/munin/plugins/ping_10.0.0.1
[root@master munin]# vim  /etc/munin/plugin-conf.d/munin-node
[ping_10.0.0.1]
    host_name ping-ubuntu1
    env.packetloss_critical 50

実行できることを確認

[root@master munin]# munin-run ping_10.0.0.1
packetloss.value 0
ping.value 0.000107
[root@master munin]#

munin-nodeを再起動する必要がある！！！これ忘れててハマった。

[root@master munin]# service munin-node restart
Stopping munin-node:                                       [  OK  ]
Starting munin-node:                                       [  OK  ]
[root@master munin]#

group名healthcheckの下に、ping-ubuntu1を作成する

[root@master munin]# vim /etc/munin/conf.d/hosts.conf
[healthcheck;ping-ubuntu1]
      address 127.0.0.1
      use_node_name no

これで、ubuntu1に対するpingによる死活監視もできた！

Ansibleで自動化

Ansibleをインストールしたmasterサーバーから、操作対象サーバーにAnsibleで操作する、というイメージ。
AnsibleをインストールするのもMuninと同じくmasterとし、ubuntu1と同様のUbuntuサーバーubuntu2(10.0.0.2)をAnsibleで操作したいとする。

sshの設定

Ansibleで操作対象サーバーに変更を加えるには、masterからSSH公開鍵認証できる必要がある。ので、その設定をする。

まずはubuntu2にSSH公開鍵認証できるように、PubkeyAuthentication yesのコメントアウトを外す

root@ubuntu2:~$ vim /etc/ssh/sshd_config
# PubkeyAuthentication yesのコメントアウトを外す
root@ubuntu2:~$

次にAnsibleのホスト上で作成した公開鍵を対象ホストにコピーする。

root@ubuntu2:~# mkdir .ssh
root@ubuntu2:~# cd .ssh/
root@ubuntu2:~/.ssh# vim authorized_keys
# Ansibleのホスト上で作成した公開鍵(id_rsa.pub)をそのままコピペする

これでAnsibleのホストから公開鍵認証でSSHできる。

[root@master ~]# ssh root@10.0.0.2

Ansibleのフォルダ構成

[root@master ansible]# tree
.
├── ansible.cfg                            
├── group_vars                           # インベントリファイルのグループ毎の変数を記述するディレクトリ
│   └── master                             # masterグループ
│       └── alert_conf.yml             # Muninのアラートの閾値などが記述されるファイル
├── playbooks
│   ├── munin-setting.yml           # インベントリファイルで定義されたグループをMuninの監視対象にするためのPlaybook
│   └── roles
│       └── munin                          # Muninに関する設定を記述するディレクトリ
│           └── tasks
│               ├── main.yml               #  roleが呼び出されたときに最初に実行されるファイル
│               ├── master-config.yml      # munin-masterに対する操作をするファイル
│               └── node-default-config.yml   # munin-node（監視対象）に対する操作をするファイル
└── production                         # インベントリファイル

./ansible.cfgファイル

[defaults]
retry_files_enabled = False

./productionファイル

[master]
10.0.0.0 ansible_connection=local

[ubuntu_calcservers_python3]
10.0.0.1 name=ubuntu1 cpucore=32 memory=125
10.0.0.2 name=ubuntu2 cpucore=32 memory=125

[ubuntu_calcservers_python3:vars]
ansible_python_interpreter=/usr/bin/python3

[calcservers:children]
ubuntu_calcservers_python3

./group_vars/master/alert_conf.ymlファイル

---
# Muninで出すアラートの閾値の比率
cpu_warning_ratio: 1.0

memory_warning_ratio: 0.8

memory_critical_ratio: 1.0

./playbooks/munin-setting.ymlファイル

---
# ubuntuで/usr/bin/python3を使うグループのmunin-nodeに対する設定
- hosts: ubuntu_calcservers_python3
  roles:
    - munin

# munin-masterの設定を追記
- hosts: master
  tasks:
    - include_tasks: roles/munin/tasks/master-config.yml
      loop: "{{ groups['calcservers'] }}"
      loop_control:
        loop_var: server

./playbooks/roles/munin/tasks/main.ymlファイル

---
- name: node default config
  include_tasks: node-default-config.yml

./playbooks/roles/munin/tasks/master-config.ymlファイル

---
- name: ホスト{{server}}を監視する設定を追記
  blockinfile:
    path: /etc/munin/conf.d/hosts.conf
    insertafter: "^.*$"
    marker: "# {mark} ANSIBLE MANAGED BLOCK {{ hostvars[server].name }}"
    block: |
      [calc-server;{{hostvars[server].name}}]
          address {{server}}
          cpu.user.warning :{{ (hostvars[server].cpucore * 100 * cpu_warning_ratio) | int }}
          memory.apps.warning :{{ (hostvars[server].memory * 1073741824 * memory_warning_ratio) | int }}
          memory.apps.critical :{{ (hostvars[server].memory * 1073741824 * memory_critical_ratio) | int }}

      [healthcheck;ping-{{ hostvars[server].name }}]
          address 127.0.0.1
          use_node_name no

- name: ping pluginのシンボリックリンクを作成
  file:
    src: /usr/share/munin/plugins/ping_
    dest: /etc/munin/plugins/ping_{{server}}
    state: link

- name: ping pluginの設定を追記
  blockinfile:
    path: /etc/munin/plugin-conf.d/00-default
    insertafter: "^.*$"
    marker: "# {mark} ANSIBLE MANAGED BLOCK {{ hostvars[server].name }}"
    block: |
      [ping_{{server}}]
          host_name ping-{{ hostvars[server].name }}
          env.packetloss_critical 50

- name: munin-nodeを再起動
  service:
    name: munin-node
    state: restarted

./playbooks/roles/munin/tasks/node-default-config.ymlファイル.
対象ホストにmunin-nodeをインストールするところまでAnsibleで自動化したかったが、研究室のサーバーはメンテナンスが不十分でそもそもapt-getやyumが正常に動作しないことがあったので、そこは手動でやる必要がある。。。

---
# エラーが発生する場合があるのでここは手動でやる...
#- name: apt-get install munin-node
#  apt:
#    name: munin-node
#  force_apt_get: true

- name: munin-masterのIPからの接続を許可
  blockinfile:
    path: /etc/munin/munin-node.conf
    insertafter: "^allow .*$"
    block: |
      allow ^10.0.0.0$

- name: munin-nodeを再起動
  service:
    name: munin-node
    state: restarted

- name: munin-nodeの自動起動を有効化
  service:
    name: munin-node
    enabled: yes

Ansibleを使ったMuninの導入例

Ubuntuサーバーubuntu2へのMuninの導入をAnsibleで実行する例を示す。
まず、apt-get install munin-nodeでubuntu2へ手動でmunin-nodeをインストールする。これは手動でやる

先述した手順で、AnsibleのmasterからSSH公開鍵認証ログインできるとする。
また、ubuntu2上では、/usr/bin/python3でpythonが実行できるとする。

次に、master上でインベントリファイルに監視対象を設定。
Ubuntuサーバーで、python3コマンドが使用できるので[ubuntu_calcservers_python3]の下に以下を追記。
CPUは32コア、使用可能メモリは125G(free -hコマンドで確認)

[root@master ansible]# pwd
/root/ansible
[root@master ansible]# vim production
10.0.0.2 name=ubuntu2 cpucore=32 memory=125

これで、Ansibleのプレイブックを実行する！failedがなければOK!

[root@master ansible]# ansible-playbook playbooks/munin-setting.yml -i production
中略
PLAY RECAP *********************************************************************
10.0.0.0             : ok=46   changed=9    unreachable=0    failed=0
10.0.0.1            : ok=5    changed=2    unreachable=0    failed=0
10.0.0.2             : ok=5    changed=1    unreachable=0    failed=0

これで、hxxp://10.0.0.0/munin/ を確認すると5分後にubuntu2の項目が増えていることがわかる！これでOK

Slackに通知

偉大な方が通知のためのシェルスクリプトを書いてくれているのでそれを活用！

gist.github.com

後述する内容のファイルを作成

[root@master munin]# vi /usr/local/bin/notify_slack_munin

ファイルの内容は以下.
SLACK_CHANNEL,SLACK_WEBHOOK_URL,SLACK_USERNAME,SLACK_ICON_EMOJIに適切な値を入れればOK.
実行権限を付与

[root@master munin]# chmod 775 /usr/local/bin/notify_slack_munin

/etc/munin/munin.confに以下を追記

contact.slack.command MUNIN_SERVICESTATE="${var:worst}" MUNIN_HOST="${var:host}" MUNIN_SERVICE="${var:graph_title}" MUNIN_GROUP=${var:group} /usr/local/bin/notify_slack_munin
contact.slack.always_send  warning critical
contact.slack.text ${if:cfields \u000A* CRITICALs:${loop<,>:cfields  ${var:label} is ${var:value} (outside range [${var:crange}])${if:extinfo : ${var:extinfo}}}.}${if:wfields \u000A* WARNINGs:${loop<,>:wfields  ${var:label} is ${var:value} (outside range [${var:wrange}])${if:extinfo : ${var:extinfo}}}.}${if:ufields \u000A* UNKNOWNs:${loop<,>:ufields  ${var:label} is ${var:value}${if:extinfo : ${var:extinfo}}}.}${if:fofields \u000A* OKs:${loop<,>:fofields  ${var:label} is ${var:value}${if:extinfo : ${var:extinfo}}}.}

ここまでやったが、Slackに通知が行かない….

よく見ると、メールが /var/spool/mail/root にありますというメッセージが来ていた。

中身を見るとエラーを起こしているっぽい

/usr/local/bin/notify_slack_munin: line 55: 1: Bad file descriptor

/usr/local/bin/notify_slack_muninの最後の行でエラーが起きているので以下のように修正したら治った。なんかわからんがヨシ！

curl -sX POST --data "payload=${PAYLOAD}" $SLACK_WEBHOOK_URL

最後に

AnsibleでMuninによる監視設定を自動化することで、複数のサーバーの設定を短時間ですることができました！
自分で書いたPlaybookで10個以上のサーバーへの設定を一瞬でできた時は感動しました。

懸念点としては、
- Ansibleが後輩に引き継がれるか微妙.
- Playbookのコードの改良.
- 古いCentOS 6.9でAnsibleを動かしているので、動作や今後が不安.
などがあります。
初めはAnsibleが難しかったですが、使ってみると便利だったので今後も機会があれば使いたいと思います！あと、古いバージョンのCentOSでの作業がとても苦痛でした。

高林の雑記ブログ

こんにちは。

研究室ネットワークに対するMuninでの監視をAnsibleで自動化する

背景

目標