Monitoring and Alert Systems

From HallCWiki
Jump to: navigation, search

Monitoring and Alert Systems

EPICS Alarm Handler

During production the EPICS alarm handler should always be running.

Munin

MUNIN is an open source general purpose monitoring system used to track and monitor a large variety of systems across Hall C. This service is always running and is unrelated to the EPICS infrastructure (other than using EPICS PVs as a data source for some systems.)

Run "go_munin" on a Hall C computer to bring up the monitoring graphs, or connect to https://hallcweb.jlab.org/munin/ directly.

Monitored systems presently include:

Deployment

  • The primary MUNIN server runs on cvideo1.jlab.org, but MUNIN clients run on the majority of linux hosts in the Hall. MUNIN is a 'pluggable' system that can be broadly extended with scripts that deliver data to the software in a standardized format. See the documentation on MUNIN for details.
  • The Hall C Puppet system automatically deploys the munin client on new hosts, but those hosts must be manually added to the server config under 'cvideo1:/etc/munin/conf.d/'
  • Aspects of the configuration can be modified under /etc/munin/conf.d/ if you are in the (local) 'munin' unix group on cvideo1.


Alerts

Munin can be configured to send notifications via email and/or text-message if a monitored value exceeds threshold. Email notifications can occur every 5 minutes until the problem is addressed, so appropriate filtering/redirection in your mail client is recommended.

NOTE: This system is completely independent of the EPICS alarm handler(s) that make noise in the Counting House.

Notifications are sent to the HallC_Alarm_Notifications Mailing List. Subscribe if you wish to see them.

NOTE: This can be a very high-volume list when things go south in the Hall.

      It is strongly recommended to configure your mail reader/system to filter 
      messages from that list into a dedicated folder.

Notification Management

Just a few rough notes on how to enable/disable notifications. (Munin documentation have the full story, of course.)

Unless otherwise noted, all of the configuration scripts are located at cvideo1:/etc/munin/. Revisions are managed using git. The git log is a good 'how-to' and 'examples' resource.

Many of the conf files are editable by the cvideo1 user, but it is generally simplest/best to use sudo to make edits (and run the systemctl commands noted).

  • Default trip thresholds are baked into the monitoring scripts under /etc/munin/plugins/ on the server (cvideo1) and the respective nodes.
    • Thresholds can (and should) be overridden using the files under cvideo1:/etc/munin/conf.d/. You will need to work out which node (host) is providing the underlying data and modify the corresponding file.
    • Run systemctl restart munin-node to ensure changes are picked up on the next run (systems are polled every 5 minuntes).
  • Enable/disable email notifications from a node (host)
    • The default notification directive is defined by the contacts directive in cvideo1:/etc/munin/munin.conf. Change that line to contacts no to disable (default) notifications.
    • The contacts line can (and is) overridden on a host-by-host basis in the respective cvideo1:/etc/munin/conf.d/* files.
    • Note: most of the EPICS related monitoring is handled on the cvideo1 host (and are configured under /etc/munin/conf.d/cvideo1.conf
    • Run systemctl restart munin to ensure changes are picked up on the next run (systems are polled every 5 minuntes).
NOTE: Logging changes with git is strongly encouraged.