Difference between revisions of "Hall C Compute Cluster"

From HallCWiki
Jump to: navigation, search
(Hall C Compute Cluster)
(One intermediate revision by the same user not shown)
Line 21: Line 21:
  
 
=== Miscellaneous ===
 
=== Miscellaneous ===
* cvideo1 -- rackmount machine that handles munin service and runs ''motion'' software that handles the cameras.  See also [[Video Capture Systems]]
+
* cvideo1 -- rackmount machine that handles munin service and runs ''motion'' software that handles the cameras.  See also [[Video capture systems]]
 
* cvideo2 -- desktop host that handles the 2 left-most large wall display screens
 
* cvideo2 -- desktop host that handles the 2 left-most large wall display screens
 
* cvideo3 -- desktop host that handles the 4 newest display screens
 
* cvideo3 -- desktop host that handles the 4 newest display screens
Line 27: Line 27:
 
* skylla10 -- a rackmount Win10 host that hosts the Rockwell HMI software used to interact with the SHMS/HMS spectrometer PLCs
 
* skylla10 -- a rackmount Win10 host that hosts the Rockwell HMI software used to interact with the SHMS/HMS spectrometer PLCs
 
* cdaqbackup1 -- a rackmount host used to provide backups of (linux) Hall C systems.  See [[#System Backups]] below.
 
* cdaqbackup1 -- a rackmount host used to provide backups of (linux) Hall C systems.  See [[#System Backups]] below.
*  
+
* cdaqfs1 -- a rackmount host that is the primary file server for the cluster
  
 
* CNAMES (DNS 'aliases' allowing systems to be pointed at a new physical host with a single DNS change)
 
* CNAMES (DNS 'aliases' allowing systems to be pointed at a new physical host with a single DNS change)
 
** hcpxeboot -> cdaqfs1
 
** hcpxeboot -> cdaqfs1
 +
 +
==== cdaqfs1 ====
 +
The primary dedicated file server for the Hall C cluster plays a few roles.
 +
 +
* NFS mounts are exported to the cluster and mounted on the clients using autofs under the /net/{cdaq,cdaqfs1}/ paths.
 +
** home/, home/coda/
 +
** Cluster-local copies of /site, /apps (synced manually when needed) (see cdaqfs1:/local/hallc/RHEL7-x86_64/README for notes/quirks)
 +
** opt/ contains cluster-local copies of ROOT, some Singularity containers and modules used with prior experiments
 +
* Hosts the files needed for PXE booting the linux ROCs.
 +
* Runs the 'cmagnets' Win10 virtual machine under VirtualBox.
  
 
== Puppet Configuration Management ==
 
== Puppet Configuration Management ==

Revision as of 16:48, 5 May 2022

Hall C Compute Cluster

Systems and (nominal) Functions

The Hall C compute cluster is composed of roughly 4 'classes' of machines. Hosts within these classes are intended to be largely interchangeable, allowing for easier upgrades and failover. === CODA / DAQ nodes (rackmount)

  • cdaql5, cdaql6
  • These are 'modestly' provision rackmount servers dedicated to running CODA. They each have ~5.5 TB of local disk intended to be used as a local buffer to the large NFS fileserver nodes if needed. It has never been needed, to date we have been fine with just pushing data over an NFS mount to cdaql1/2/3.

Compute / Fileserver nodes (rackmount)

  • cdaql1, cdaql2, cdaql3
  • These are generally pretty beefy machines with a lot of CPU and disk. They are intended for data storage and online replays (when feasible).
Right now, cdaql1:/data1 is the primary (NFS) destination volume for CODA data from cdaql5 and cdaql6.
This is fine under present DAQ loads, but this will need to change for NPS/LAD and other data-heavy experiments.

'hcdesk' / User nodes (desktop)

  • hcdesk1, 2, 3, ... : 'User' consoles in the Hall C counting house used by shift crew
  • shmshut, hmshut : 'User' consoles in the respective spectrometer huts

These are relatively low-powered computers the primarily perform as consoles to hang monitors and a keyboard off of. All the real work is done on other hosts.

Miscellaneous

  • cvideo1 -- rackmount machine that handles munin service and runs motion software that handles the cameras. See also Video capture systems
  • cvideo2 -- desktop host that handles the 2 left-most large wall display screens
  • cvideo3 -- desktop host that handles the 4 newest display screens
  • cmagnets (VM) -- A Win10 VM hosted on cdaqfs1 that handles the 'go_magnets' spectrometer magnet controls
  • skylla10 -- a rackmount Win10 host that hosts the Rockwell HMI software used to interact with the SHMS/HMS spectrometer PLCs
  • cdaqbackup1 -- a rackmount host used to provide backups of (linux) Hall C systems. See #System Backups below.
  • cdaqfs1 -- a rackmount host that is the primary file server for the cluster
  • CNAMES (DNS 'aliases' allowing systems to be pointed at a new physical host with a single DNS change)
    • hcpxeboot -> cdaqfs1

cdaqfs1

The primary dedicated file server for the Hall C cluster plays a few roles.

  • NFS mounts are exported to the cluster and mounted on the clients using autofs under the /net/{cdaq,cdaqfs1}/ paths.
    • home/, home/coda/
    • Cluster-local copies of /site, /apps (synced manually when needed) (see cdaqfs1:/local/hallc/RHEL7-x86_64/README for notes/quirks)
    • opt/ contains cluster-local copies of ROOT, some Singularity containers and modules used with prior experiments
  • Hosts the files needed for PXE booting the linux ROCs.
  • Runs the 'cmagnets' Win10 virtual machine under VirtualBox.

Puppet Configuration Management

  • The Hall C cluster machines are configured and maintained using the open-source [Puppet] system.
    • Main repo is hosted at: git@hallcgit.jlab.org:brads/hallc-puppet.git
    • Updates/Upgrades are handled manually to minimize any surprises during Production
      • Brad uses 'cssh' to periodically run global updates and/or push out configuration changes -- bug him for support.

System Backups

  • cdaqbackup1 is as an older rackmount host repurposed to provide backups of some important systems
    • All cdaqfs1 NFS exports are backed up nightly (rsync images; no snapshotting)
    • cdaqfs1:home/ is backed up nightly with' snapshots
      • The backup software is [Borg Backup]
      • This is handled by the script: cdaqbackup1:/data1/cdaqfs-backup/BACKUP-borg/borg-backup-cdaqfs-home.sh running on cdaqbackup1

Network Configuration / Management

  • All systems on the Hall C network should be registered with the central systems. Talk to Brad Sawatzky and he will set you up quickly.
    • Do not throw something on the network with a hardcoded IP address. That was fine 15 years ago, not a good plan in a modern network.
  • The network layout is roughly described on the Hall_C_Network page, but that is deprecated and may be out of date. JNET should be considered canonical.

vxworks boot

  • vxWorks hosts presently boot off cdaql1 (129.57.168.41)

PXE boot (intel/Linux ROCs)

  • Intel/Linux ROCs boot using the PXE mechanism. The PXE stanza is delivered by the CNI DHCP service to hosts on the 168 subnet:
  tftp-host:    hcpxeboot                        # Bootloader program
  tftp-path:    linux-diskless/pxelinux.0        # TFTP server