Monitoring and Alert Systems

From HallCWiki
Revision as of 21:54, 5 June 2022 by Brads (Talk | contribs) (Alerts)

Jump to: navigation, search

Monitoring and Alert Systems

EPICS Alarm Handler

During production the EPICS alarm handler should always be running.

Munin

MUNIN is an open source general purpose monitoring system used to track and monitor a large variety of systems across Hall C. This service is always running and is unrelated to the EPICS infrastructure (other than using EPICS PVs as a data source for some systems.)

Run "go_munin" on a Hall C computer to bring up the monitoring graphs, or connect to https://hallcweb.jlab.org/munin/ directly.

Monitored systems presently include:

Deployment

  • The primary MUNIN server runs on cvideo1.jlab.org, but MUNIN clients run on the majority of linux hosts in the Hall. MUNIN is a 'pluggable' system that can be broadly extended with scripts that deliver data to the software in a standardized format. See the documentation on MUNIN for details.
  • The Hall C Puppet system automatically deploys the munin client on new hosts, but those hosts must be manually added to the server config under 'cvideo1:/etc/munin/conf.d/'
  • Aspects of the configuration can be modified under /etc/munin/conf.d/ if you are in the (local) 'munin' unix group on cvideo1.


Alerts

Munin can be configured to send notifications via email and/or text-message if a monitored value exceeds threshold. Email notifications can occur every 5 minutes until the problem is addressed, so appropriate filtering/redirection in your mail client is recommended.

Notifications are sent to the HallC_Alarm_Notifications Mailing List. Subscribe if you wish to see them.

NOTE: This can be a very high-volume list when things go south in the Hall.

      It is strongly recommended to configure your mail reader/system to filter 
      messages from that list into a dedicated folder.