Difference between revisions of "Monitoring and Alert Systems"
Line 51: | Line 51: | ||
** Note: most of the EPICS related monitoring is handled on the ''cvideo1'' host (and are configured under <code>/etc/munin/conf.d/cvideo1.conf</code> | ** Note: most of the EPICS related monitoring is handled on the ''cvideo1'' host (and are configured under <code>/etc/munin/conf.d/cvideo1.conf</code> | ||
** Run <code>systemctl restart munin</code> to ensure changes are picked up on the next run (systems are polled every 5 minuntes). | ** Run <code>systemctl restart munin</code> to ensure changes are picked up on the next run (systems are polled every 5 minuntes). | ||
+ | |||
+ | NOTE: Logging changes with '''git''' is strongly encouraged. |
Latest revision as of 11:05, 17 June 2022
Monitoring and Alert Systems
EPICS Alarm Handler
During production the EPICS alarm handler should always be running.
- See Hall C Alarm Handler and Hall C EPICS
Munin
MUNIN is an open source general purpose monitoring system used to track and monitor a large variety of systems across Hall C. This service is always running and is unrelated to the EPICS infrastructure (other than using EPICS PVs as a data source for some systems.)
Run "go_munin" on a Hall C computer to bring up the monitoring graphs, or connect to https://hallcweb.jlab.org/munin/ directly.
Monitored systems presently include:
- The majority of Hall C linux hosts
- Gas system flows, temperatures, pressures Gas Shed, Hall A GEM gas
- HVAC status in G0 cage (where HV crates and other critical systems reside, and
- DAQ crate power and temperature information
Deployment
- The primary MUNIN server runs on
cvideo1.jlab.org
, but MUNIN clients run on the majority of linux hosts in the Hall. MUNIN is a 'pluggable' system that can be broadly extended with scripts that deliver data to the software in a standardized format. See the documentation on MUNIN for details. - The Hall C Puppet system automatically deploys the munin client on new hosts, but those hosts must be manually added to the server config under 'cvideo1:/etc/munin/conf.d/'
- Aspects of the configuration can be modified under /etc/munin/conf.d/ if you are in the (local) 'munin' unix group on cvideo1.
Alerts
Munin can be configured to send notifications via email and/or text-message if a monitored value exceeds threshold. Email notifications can occur every 5 minutes until the problem is addressed, so appropriate filtering/redirection in your mail client is recommended.
NOTE: This system is completely independent of the EPICS alarm handler(s) that make noise in the Counting House.
Notifications are sent to the HallC_Alarm_Notifications Mailing List. Subscribe if you wish to see them.
NOTE: This can be a very high-volume list when things go south in the Hall. It is strongly recommended to configure your mail reader/system to filter messages from that list into a dedicated folder.
Notification Management
Just a few rough notes on how to enable/disable notifications. (Munin documentation have the full story, of course.)
Unless otherwise noted, all of the configuration scripts are located at cvideo1:/etc/munin/
. Revisions are managed using git. The git log is a good 'how-to' and 'examples' resource.
Many of the conf files are editable by the cvideo1 user, but it is generally simplest/best to use sudo to make edits (and run the systemctl commands noted).
- Default trip thresholds are baked into the monitoring scripts under
/etc/munin/plugins/
on the server (cvideo1) and the respective nodes.- Thresholds can (and should) be overridden using the files under
cvideo1:/etc/munin/conf.d/
. You will need to work out which node (host) is providing the underlying data and modify the corresponding file. - Run
systemctl restart munin-node
to ensure changes are picked up on the next run (systems are polled every 5 minuntes).
- Thresholds can (and should) be overridden using the files under
- Enable/disable email notifications from a node (host)
- The default notification directive is defined by the contacts directive in
cvideo1:/etc/munin/munin.conf
. Change that line to contacts no to disable (default) notifications. - The contacts line can (and is) overridden on a host-by-host basis in the respective
cvideo1:/etc/munin/conf.d/*
files. - Note: most of the EPICS related monitoring is handled on the cvideo1 host (and are configured under
/etc/munin/conf.d/cvideo1.conf
- Run
systemctl restart munin
to ensure changes are picked up on the next run (systems are polled every 5 minuntes).
- The default notification directive is defined by the contacts directive in
NOTE: Logging changes with git is strongly encouraged.