Nagios监控mongodb分片集群服务实战

转自:http://blog.itpub.net/26230597/viewspace-1293589/

1,监控插件下载

Mongodb插件下载地址为:git clone git://github.com/mzupan/nagios-plugin-mongodb.git,刚开始本人这里没有安装gitpub环境,找网友草根帮忙下载的,之后上传到了csdn资源页面,新的下载地址为:http://download.csdn.net/detail/mchdba/8019077

2,添加新的mongodb监控命令

因为mongodb服务是和mysql从库公用一台物理机,之前已经做了基础nagios以及mysql服务监控,所以这里只需要在原来的基础上添加mongodb命令和服务即可。Nagios监控mysql请参考:http://blog.itpub.net/26230597/viewspace-760141/以及http://blog.itpub.net/26230597/viewspace-1217246/。所以这里需要添加的mongodb监控命令如下所示:

  1. [root@wgq objects]# cd /usr/local/nagios/etc/objects
  2. [root@wgq objects]# vim commands.cfg
  3. define command {
  4.     command_name check_mongodb
  5.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$
  6. }
  7. define command {
  8.     command_name check_mongodb_database
  9.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -d $ARG5$
  10. }
  11. define command {
  12.     command_name check_mongodb_collection
  13.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -d $ARG5$ -c $ARG6$
  14. }
  15. define command {
  16.     command_name check_mongodb_replicaset
  17.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -r $ARG5$
  18. }
  19. define command {
  20.     command_name check_mongodb_query
  21.     command_line $USER1$/nagios-plugin-mongodb/check_mongodb.py -H $HOSTADDRESS$ -A $ARG1$ -P $ARG2$ -W $ARG3$ -C $ARG4$ -q $ARG5$
  22. }

3,添加mongodb监控服务

mongodb的服务也需要单独重新添加,如下所示:

  1. #检测mongodb服务的连接时间,超过2秒就普通报警,5秒就严重报警
  2. define service{
  3.         host_name dbm1slave1
  4.         service_description Mongo Connect Check
  5.         check_command check_mongodb!connect!30000!2!5
  6.         max_check_attempts 5
  7.         normal_check_interval 3
  8.         retry_check_interval 2
  9.         check_period 24×7
  10.         notification_interval 10
  11.         notification_period 24×7
  12.         notification_options w,u,c,r
  13.         contact_groups ops
  14.         }
  15. #检查mongodb的连接数,超过150普通报警,200严重报警
  16. define service{
  17.         host_name dbm1slave1
  18.         service_description Mongo Free Connections
  19.         check_command check_mongodb!connections!27017!70!80
  20.         max_check_attempts 5
  21.         normal_check_interval 3
  22.         retry_check_interval 2
  23.         check_period 24×7
  24.         notification_interval 10
  25.         notification_period 24×7
  26.         notification_options w,u,c,r
  27.         contact_groups ops
  28.         }
  29. #检查mongodb复制完成的百分比率,确保primary和standby的time是一致的。
  30. define service{
  31.         host_name dbm1slave1
  32.         service_description Mongo Replication Lag
  33.         check_command check_mongodb!replication_lag!27017!15!30
  34.         max_check_attempts 5
  35.         normal_check_interval 3
  36.         retry_check_interval 2
  37.         check_period 24×7
  38.         notification_interval 10
  39.         notification_period 24×7
  40.         notification_options w,u,c,r
  41.         contact_groups ops
  42.         }
  43. #检查mongodb内存使用率,阀值与mongodb所在机器的总内存数相关
  44. define service{
  45.         host_name dbm1slave1
  46.         service_description Mongo Memory Usage
  47.         check_command check_mongodb!memory!27017!20!28
  48.         max_check_attempts 5
  49.         normal_check_interval 3
  50.         retry_check_interval 2
  51.         check_period 24×7
  52.         notification_interval 10
  53.         notification_period 24×7
  54.         notification_options w,u,c,r
  55.         contact_groups ops
  56.         }
  57. #检查mongodb Mapped的内存使用率,阀值与mongodb所在机器的总内存数相关
  58. define service{
  59.         host_name dbm1slave1
  60.         service_description Mongo Mapped Memory Usage
  61.         check_command check_mongodb!memory_mapped!27017!20!28
  62.         max_check_attempts 5
  63.         normal_check_interval 3
  64.         retry_check_interval 2
  65.         check_period 24×7
  66.         notification_interval 10
  67.         notification_period 24×7
  68.         notification_options w,u,c,r
  69.         contact_groups ops
  70.         }
  71. #检查Lock Time的百分率,如果lock time占据mongo运行时间的5%就普通报警,如果超过10%就严重报警
  72. define service{
  73.         host_name dbm1slave1
  74.         service_description Mongo Lock Percentage
  75.         check_command check_mongodb!lock!27017!5!10
  76.         max_check_attempts 5
  77.         normal_check_interval 3
  78.         retry_check_interval 2
  79.         check_period 24×7
  80.         notification_interval 10
  81.         notification_period 24×7
  82.         notification_options w,u,c,r
  83.         contact_groups ops
  84.         }
  85. # Check Average Flush Time,检查mongo服务器的平均flush时间,
  86. define service{
  87.         host_name dbm1slave1
  88.         service_description Mongo Flush Average
  89.         check_command check_mongodb!flushing!27017!100!200
  90.         max_check_attempts 5
  91.         normal_check_interval 3
  92.         retry_check_interval 2
  93.         check_period 24×7
  94.         notification_interval 10
  95.         notification_period 24×7
  96.         notification_options w,u,c,r
  97.         contact_groups ops
  98.         }
  99. # Check Last Flush Time,检查最新的flush时间,如果超过200ms就普通报警,超过400ms就严重报警
  100. define service{
  101.         host_name dbm1slave1
  102.         service_description Mongo Last Flush Time
  103.         check_command check_mongodb!last_flush_time!27017!200!400
  104.         max_check_attempts 5
  105.         normal_check_interval 3
  106.         retry_check_interval 2
  107.         check_period 24×7
  108.         notification_interval 10
  109.         notification_period 24×7
  110.         notification_options w,u,c,r
  111.         contact_groups ops
  112.         }
  113. # Check status of mongodb replicaset,检查mongo复制的状态
  114. define service{
  115.         host_name dbm1slave1
  116.         service_description MongoDB state
  117.         check_command check_mongodb!replset_state!27017!0!0
  118.         max_check_attempts 5
  119.         normal_check_interval 3
  120.         retry_check_interval 2
  121.         check_period 24×7
  122.         notification_interval 10
  123.         notification_period 24×7
  124.         notification_options w,u,c,r
  125.         contact_groups ops
  126.         }
  127. # Check status of index miss ratio,检查索引命中率,
  128. define service{
  129.         host_name dbm1slave1
  130.         service_description MongoDB Index Miss Ratio
  131.         check_command check_mongodb!index_miss_ratio!27017!.005!.01
  132.         max_check_attempts 5
  133.         normal_check_interval 3
  134.         retry_check_interval 2
  135.         check_period 24×7
  136.         notification_interval 10
  137.         notification_period 24×7
  138.         notification_options w,u,c,r
  139.         contact_groups ops
  140.         }
  141. # Check number of databases and number of collections
  142. define service{
  143.         host_name dbm1slave1
  144.         service_description MongoDB Number of databases
  145.         check_command check_mongodb!databases!27017!300!500
  146.         max_check_attempts 5
  147.         normal_check_interval 3
  148.         retry_check_interval 2
  149.         check_period 24×7
  150.         notification_interval 10
  151.         notification_period 24×7
  152.         notification_options w,u,c,r
  153.         contact_groups ops
  154.         }
  155. define service{
  156.         host_name dbm1slave1
  157.         service_description MongoDB Number of collections
  158.         check_command check_mongodb!collections!27017!300!500
  159.         max_check_attempts 5
  160.         normal_check_interval 3
  161.         retry_check_interval 2
  162.         check_period 24×7
  163.         notification_interval 10
  164.         notification_period 24×7
  165.         notification_options w,u,c,r
  166.         contact_groups ops
  167.         }
  168. # Check size of a database,检查库的大小
  169. define service{
  170.         host_name dbm1slave1
  171.         service_description MongoDB Database size your-database
  172.         check_command check_mongodb_database!database_size!27017!300!500!your-database
  173.         max_check_attempts 5
  174.         normal_check_interval 3
  175.         retry_check_interval 2
  176.         check_period 24×7
  177.         notification_interval 10
  178.         notification_period 24×7
  179.         notification_options w,u,c,r
  180.         contact_groups ops
  181.         }
  182. # Check index size of a database,检查库索引的大小
  183. define service{
  184.         host_name dbm1slave1
  185.         service_description MongoDB Database index size your-database
  186.         check_command check_mongodb_database!database_indexes!27017!50!100!your-database
  187.         max_check_attempts 5
  188.         normal_check_interval 3
  189.         retry_check_interval 2
  190.         check_period 24×7
  191.         notification_interval 10
  192.         notification_period 24×7
  193.         notification_options w,u,c,r
  194.         contact_groups ops
  195.         }
  196. # Check index size of a collection,检查集合collection的索引大小
  197. define service{
  198.         host_name dbm1slave1
  199.         service_description MongoDB Database index size your-database
  200.         check_command check_mongodb_collection!collection_indexes!27017!50!100!your-database!your-collection
  201.         max_check_attempts 5
  202.         normal_check_interval 3
  203.         retry_check_interval 2
  204.         check_period 24×7
  205.         notification_interval 10
  206.         notification_period 24×7
  207.         notification_options w,u,c,r
  208.         contact_groups ops
  209.         }
  210. # Check the primary server of replicaset,检查复制的primary服务
  211. define service{
  212.         host_name dbm1slave1
  213.         service_description MongoDB Replicaset Master Monitor: your-replicaset
  214.         check_command check_mongodb_replicaset!replica_primary!27017!0!1!your-replicaset
  215.         #示例:check_command check_mongodb_replicaset!replica_primary!27017!0!1!shard2
  216.         max_check_attempts 5
  217.         normal_check_interval 3
  218.         retry_check_interval 2
  219.         check_period 24×7
  220.         notification_interval 10
  221.         notification_period 24×7
  222.         notification_options w,u,c,r
  223.         contact_groups ops
  224.         }
  225. # Check the number of queries per second,检查每一秒的查询数量
  226. define service{
  227.         host_name dbm1slave1
  228.         service_description MongoDB Updates per Second
  229.         check_command check_mongodb_query!queries_per_second!27017!200!150!update
  230.         max_check_attempts 5
  231.         normal_check_interval 3
  232.         retry_check_interval 2
  233.         check_period 24×7
  234.         notification_interval 10
  235.         notification_period 24×7
  236.         notification_options w,u,c,r
  237.         contact_groups ops
  238.         }
  239. # Check Primary Connection,检查复制中与primary库的连接时间,超过2秒就普通报警,超过4秒就严重报警
  240. define service{
  241.         host_name dbm1slave1
  242.         service_description Mongo Connect Check
  243.         check_command check_mongodb!connect_primary!27017!2!4
  244.         max_check_attempts 5
  245.         normal_check_interval 3
  246.         retry_check_interval 2
  247.         check_period 24×7
  248.         notification_interval 10
  249.         notification_period 24×7
  250.         notification_options w,u,c,r
  251.         contact_groups ops
  252.         }
  253. # Check Collection State,检查collection状态,检查mongo服务组列表的每一个主机,可以检查重要collection的高可用性(锁、超时、服务配置的可用性),如果发现一个查询失败就会报警。
  254. define service{
  255.         host_name dbm1slave1
  256.         service_description Mongo Collection State
  257.         check_command check_mongodb!collection_state!27017!your-database!your-collection
  258.         max_check_attempts 5
  259.         normal_check_interval 3
  260.         retry_check_interval 2
  261.         check_period 24×7
  262.         notification_interval 10
  263.         notification_period 24×7
  264.         notification_options w,u,c,r
  265.         contact_groups ops
  266.         }

4,查看部分监控项效果

配置完nagios端服务,重启下service nagios restart; 等上几分钟,nagios监控界面就会出现完整的mongo服务信息,如下所示:

 


5
,从ps中确定mongodb的架构

[root@db-m1-slave-1 ~]# ps -eaf|grep mongo

mongodb   2457     1  0  2013 ?        2-03:39:08 ./mongod –configsvr –dbpath /home/data/mongodb/config –port 20000 –logpath /home/data/mongodb/config.log –logappend –fork

mongodb   2804     1  0  2013 ?        1-10:02:33 mongos –configdb 192.168.12.62:20000,192.168.12.63:20000,192.168.12.72:20000 –port 30000 –chunkSize 64 –logpath /home/data/mongodb/mongos.log –logappend –fork

mongodb   3072     1  0  2013 ?        1-10:17:20 mongod –shardsvr –replSet shard1 –port 27017 –dbpath /home/data/mongodb/shard11 –oplogSize 2048 –logpath /home/data/mongodb/shard11.log –logappend –fork

root     11179  9391  0 11:14 pts/1    00:00:00 grep mongo

mongodb  30414     1  0 Feb14 ?        1-06:20:50 mongod –shardsvr –replSet shard2 –port 27018 –dbpath /home/data/mongodb/shard21 –oplogSize 2048 –logpath /home/data/mongodb/shard21.log –logappend –fork

[root@db-m1-slave-1 ~]#

 

看到有4mongo进程,

  1. a)         启动参数有“–configdb”的就是集群入口进程;
  2. b)         Shard Server,启动参数带“–shardsvr –replSet”的是集群分片的一个片组启动进程,用户存储实际的数据块,也就是27017端口和27018端口的mongodb服务实例。至于如何判断27017端口中哪个是primary哪个是secondary需要去登录27107端口执行status();去查看一下。
  3. c)         Config Server:启动参数带“–configsvr”的进程,存储了整个Cluster Metadata,其中包括chunk信息,也就是20000端口的mongodb服务实例。
  4. d)         Route Server:启动参数带“mongos –configdb”的进程,前端路由,客户端由此接入,且让整个集群看上去像单一数据库,前端应用可以透明使用,也就是30000端口的mongodb实例。


6
,调试中出现过的错误

错误1

[root@wgq nagios ~]# tail -f /usr/local/nagios/var/nagios.log

[1412819956] Warning: Return code of 13 for check of service ‘Mongo Memory Usage’ on host ‘dbm1slave1’ was out of bounds.

[1412819956] SERVICE ALERT: dbm1slave1;Mongo Memory Usage;CRITICAL;SOFT;1;(Return code of 13 is out of bounds)

[1412819975] Warning: Return code of 13 for check of service ‘Mongodb Connect Check’ on host ‘dbm1slave1’ was out of bounds.

[1412819975] SERVICE ALERT: dbm1slave1;Mongodb Connect Check;CRITICAL;SOFT;1;(Return code of 13 is out of bounds)

[1412820058] Warning: Return code of 13 for check of service ‘Mongo Free Connections’ on host ‘dbm1slave1’ was out of bounds.

 

需要赋值nagios用户所有权限以及r执行权限

chmod 770 /usr/lib/nagios/plugins/check_mongodb.py

chown -R nagios.nagios /usr/lib/nagios/plugins/check_mongodb.py

 

错误2

监控界面Status Information一栏出现 No module named pymongo报错提示信息:

出现这个提示是因为需要安装pymongo模块,执行easy_install pymongo命令安装即可,如下所示:

[root@wgq objects]# easy_install pymongo

Searching for pymongo

Reading http://pypi.python.org/simple/pymongo/

Best match: pymongo 2.7.2

……

zip_safe flag not set; analyzing archive contents…

Adding pymongo 2.7.2 to easy-install.pth file

 

Installed /usr/lib/python2.6/site-packages/pymongo-2.7.2-py2.6-linux-x86_64.egg

Processing dependencies for pymongo

Finished processing dependencies for pymongo

 

参考文章:https://github.com/mzupan/nagios-plugin-mongodb/blob/master/README.md

发表评论

电子邮件地址不会被公开。