Difference between revisions of "Monitoring and Alert Systems"
(Created page with "= Monitoring and Alert Systems = == Munin == An open source general purpose monitoring system [https://munin-monitoring.org/ MUNIN] is used to track and monitor a large vari...") |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Monitoring and Alert Systems = | = Monitoring and Alert Systems = | ||
+ | |||
+ | == EPICS Alarm Handler == | ||
+ | During production the EPICS alarm handler should always be running. | ||
+ | * See [[Hall C Alarm Handler]] and [[Hall C EPICS]] | ||
== Munin == | == Munin == | ||
− | + | [https://munin-monitoring.org/ MUNIN] is an open source general purpose monitoring system used to track and monitor a large variety of systems across Hall C. This service is always running and is unrelated to the EPICS infrastructure (other than using EPICS PVs as a data source for some systems.) | |
− | |||
Run "go_munin" on a Hall C computer to bring up the monitoring graphs, or connect to https://hallcweb.jlab.org/munin/ directly. | Run "go_munin" on a Hall C computer to bring up the monitoring graphs, or connect to https://hallcweb.jlab.org/munin/ directly. | ||
Line 22: | Line 25: | ||
=== Alerts === | === Alerts === | ||
Munin can be configured to send notifications via email and/or text-message if a monitored value exceeds threshold. Email notifications can occur every 5 minutes until the problem is addressed, so appropriate filtering/redirection in your mail client is recommended. | Munin can be configured to send notifications via email and/or text-message if a monitored value exceeds threshold. Email notifications can occur every 5 minutes until the problem is addressed, so appropriate filtering/redirection in your mail client is recommended. | ||
+ | |||
+ | '''NOTE''': This system is completely independent of the EPICS alarm handler(s) that make noise in the Counting House. | ||
+ | |||
+ | Notifications are sent to the [https://mailman.jlab.org/mailman/listinfo/hallc_alarm_notifications HallC_Alarm_Notifications] Mailing List. Subscribe if you wish to see them. | ||
+ | |||
+ | '''NOTE''': This can be a ''very'' high-volume list when things go south in the Hall. | ||
+ | |||
+ | It is strongly recommended to configure your mail reader/system to filter | ||
+ | messages from that list into a dedicated folder. | ||
+ | |||
+ | ==== Notification Management ==== | ||
+ | Just a few rough notes on how to enable/disable notifications. ([http://guide.munin-monitoring.org/en/latest/ Munin documentation] have the full story, of course.) | ||
+ | |||
+ | Unless otherwise noted, all of the configuration scripts are located at <code>cvideo1:/etc/munin/</code>. Revisions are managed using git. The git log is a good 'how-to' and 'examples' resource. | ||
+ | |||
+ | Many of the '''conf''' files are editable by the '''cvideo1''' user, but it is generally simplest/best to use '''sudo''' to make edits (and run the '''systemctl''' commands noted). | ||
+ | |||
+ | * '''Default trip thresholds''' are baked into the monitoring scripts under <code>/etc/munin/plugins/</code> on the server (cvideo1) and the respective nodes. | ||
+ | ** Thresholds can (and should) be overridden using the files under <code>cvideo1:/etc/munin/conf.d/</code>. You will need to work out which node (host) is providing the underlying data and modify the corresponding file. | ||
+ | ** Run <code>systemctl restart munin-node</code> to ensure changes are picked up on the next run (systems are polled every 5 minuntes). | ||
+ | |||
+ | * '''Enable/disable email notifications from a node (host)''' | ||
+ | ** The default notification directive is defined by the '''contacts''' directive in <code>cvideo1:/etc/munin/munin.conf</code>. Change that line to '''contacts no''' to disable (default) notifications. | ||
+ | ** The '''contacts''' line can (and is) overridden on a host-by-host basis in the respective <code>cvideo1:/etc/munin/conf.d/*</code> files. | ||
+ | ** Note: most of the EPICS related monitoring is handled on the ''cvideo1'' host (and are configured under <code>/etc/munin/conf.d/cvideo1.conf</code> | ||
+ | ** Run <code>systemctl restart munin</code> to ensure changes are picked up on the next run (systems are polled every 5 minuntes). | ||
+ | |||
+ | NOTE: Logging changes with '''git''' is strongly encouraged. |
Latest revision as of 11:05, 17 June 2022
Monitoring and Alert Systems
EPICS Alarm Handler
During production the EPICS alarm handler should always be running.
- See Hall C Alarm Handler and Hall C EPICS
Munin
MUNIN is an open source general purpose monitoring system used to track and monitor a large variety of systems across Hall C. This service is always running and is unrelated to the EPICS infrastructure (other than using EPICS PVs as a data source for some systems.)
Run "go_munin" on a Hall C computer to bring up the monitoring graphs, or connect to https://hallcweb.jlab.org/munin/ directly.
Monitored systems presently include:
- The majority of Hall C linux hosts
- Gas system flows, temperatures, pressures Gas Shed, Hall A GEM gas
- HVAC status in G0 cage (where HV crates and other critical systems reside, and
- DAQ crate power and temperature information
Deployment
- The primary MUNIN server runs on
cvideo1.jlab.org
, but MUNIN clients run on the majority of linux hosts in the Hall. MUNIN is a 'pluggable' system that can be broadly extended with scripts that deliver data to the software in a standardized format. See the documentation on MUNIN for details. - The Hall C Puppet system automatically deploys the munin client on new hosts, but those hosts must be manually added to the server config under 'cvideo1:/etc/munin/conf.d/'
- Aspects of the configuration can be modified under /etc/munin/conf.d/ if you are in the (local) 'munin' unix group on cvideo1.
Alerts
Munin can be configured to send notifications via email and/or text-message if a monitored value exceeds threshold. Email notifications can occur every 5 minutes until the problem is addressed, so appropriate filtering/redirection in your mail client is recommended.
NOTE: This system is completely independent of the EPICS alarm handler(s) that make noise in the Counting House.
Notifications are sent to the HallC_Alarm_Notifications Mailing List. Subscribe if you wish to see them.
NOTE: This can be a very high-volume list when things go south in the Hall. It is strongly recommended to configure your mail reader/system to filter messages from that list into a dedicated folder.
Notification Management
Just a few rough notes on how to enable/disable notifications. (Munin documentation have the full story, of course.)
Unless otherwise noted, all of the configuration scripts are located at cvideo1:/etc/munin/
. Revisions are managed using git. The git log is a good 'how-to' and 'examples' resource.
Many of the conf files are editable by the cvideo1 user, but it is generally simplest/best to use sudo to make edits (and run the systemctl commands noted).
- Default trip thresholds are baked into the monitoring scripts under
/etc/munin/plugins/
on the server (cvideo1) and the respective nodes.- Thresholds can (and should) be overridden using the files under
cvideo1:/etc/munin/conf.d/
. You will need to work out which node (host) is providing the underlying data and modify the corresponding file. - Run
systemctl restart munin-node
to ensure changes are picked up on the next run (systems are polled every 5 minuntes).
- Thresholds can (and should) be overridden using the files under
- Enable/disable email notifications from a node (host)
- The default notification directive is defined by the contacts directive in
cvideo1:/etc/munin/munin.conf
. Change that line to contacts no to disable (default) notifications. - The contacts line can (and is) overridden on a host-by-host basis in the respective
cvideo1:/etc/munin/conf.d/*
files. - Note: most of the EPICS related monitoring is handled on the cvideo1 host (and are configured under
/etc/munin/conf.d/cvideo1.conf
- Run
systemctl restart munin
to ensure changes are picked up on the next run (systems are polled every 5 minuntes).
- The default notification directive is defined by the contacts directive in
NOTE: Logging changes with git is strongly encouraged.