Hall C Compute Cluster

From HallCWiki
Revision as of 11:40, 31 July 2024 by Hanjie (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Hall C Compute Cluster

Systems and (nominal) Functions

The Hall C compute cluster is composed of roughly 4 'classes' of machines. Hosts within these classes are intended to be largely interchangeable, allowing for easier upgrades and failover. === CODA / DAQ nodes (rackmount)

  • cdaql5, cdaql6
  • These are 'modestly' provision rackmount servers dedicated to running CODA. They each have ~5.5 TB of local disk intended to be used as a local buffer to the large NFS fileserver nodes if needed. It has never been needed, to date we have been fine with just pushing data over an NFS mount to cdaql1/2/3.

Compute / Fileserver nodes (rackmount)

  • cdaql1, cdaql2, cdaql3
  • These are generally pretty beefy machines with a lot of CPU and disk. They are intended for data storage and online replays (when feasible).
Right now, cdaql1:/data1 is the primary (NFS) destination volume for CODA data from cdaql5 and cdaql6.
This is fine under present DAQ loads, but this will need to change for NPS/LAD and other data-heavy experiments.

'hcdesk' / User nodes (desktop)

  • hcdesk1, 2, 3, ... : 'User' consoles in the Hall C counting house used by shift crew
  • shmshut, hmshut : 'User' consoles in the respective spectrometer huts

These are relatively low-powered computers the primarily perform as consoles to hang monitors and a keyboard off of. All the real work is done on other hosts.

Miscellaneous

  • cvideo1 -- rackmount machine that handles munin service and runs motion software that handles the cameras. See also Video capture systems
  • cvideo2 -- desktop host that handles the 2 left-most large wall display screens
  • cvideo3 -- desktop host that handles the 4 newest display screens
  • cmagnets (VM) -- A Win10 VM hosted on cdaqfs1 that handles the 'go_magnets' spectrometer magnet controls in the past (no longer used)
  • cmagnets1 (VM) -- A Win10 VM hosted on Hall ESX clusters that handles the 'go_magnets' spectrometer magnet controls
  • skylla10 -- a rackmount Win10 host that hosts the Rockwell HMI software used to interact with the SHMS/HMS spectrometer PLCs
  • cdaqbackup1 -- a rackmount host used to provide backups of (linux) Hall C systems. See #System Backups below.
  • cdaqfs1 -- a rackmount host that is the primary file server for the cluster (retired but it's online)
  • cdaqfs -- a rack mount Netapp server that has replaced "cdaqfs1"
  • cdaqpxe -- a VM hosted on Hall ESX clusters that handles all SBCs pxeboot and home directories.
  • CNAMES (DNS 'aliases' allowing systems to be pointed at a new physical host with a single DNS change)
    • hcpxeboot -> cdaqpxe

cdaqfs1 (retired)

The primary dedicated file server for the Hall C cluster plays a few roles.

  • NFS mounts are exported to the cluster and mounted on the clients using autofs under the /net/{cdaq,cdaqfs1}/ paths.
    • home/, home/coda/
    • Cluster-local copies of /site, /apps (synced manually when needed) (see cdaqfs1:/local/hallc/RHEL7-x86_64/README for notes/quirks)
    • opt/ contains cluster-local copies of ROOT, some Singularity containers and modules used with prior experiments
  • Hosts the files needed for PXE booting the linux ROCs.
  • Runs the 'cmagnets' Win10 virtual machine under VirtualBox.

cdaqfs

The primary dedicated file server for the Hall C cluster plays a few roles. It's maintained by Paul Letta from CST.

  • NFS mounts are exported to the cluster and mounted on the clients using autofs under the /net/{cdaq,cdaqfs}/ paths.
    • /home are mounted on the clients using autofs under /home
    • /coda_home, are mounted on the clients using autofs under the /net/cdaqfs/cdaqfs-coda-home
    • local software are installed at /capps, and mounted on the clients using autofs under /net/cdaqfs/apps/

Puppet Configuration Management

  • The Hall C cluster machines are configured and maintained using the open-source [Puppet] system.
    • Main repo is hosted at: git@hallcgit.jlab.org:brads/hallc-puppet.git
    • Updates/Upgrades are handled manually to minimize any surprises during Production
      • Brad uses 'cssh' to periodically run global updates and/or push out configuration changes -- bug him for support.

System Backups

  • cdaqbackup1 is as an older rackmount host repurposed to provide backups of some important systems
    • All cdaqfs1 NFS exports are backed up nightly (rsync images; no snapshotting)
    • cdaqfs1:home/ is backed up nightly with snapshots
      • The backup software is [Borg Backup]
      • This is handled by the script: cdaqbackup1:/data1/cdaqfs-backup/BACKUP-borg/borg-backup-cdaqfs-home.sh running on cdaqbackup1

Network Configuration / Management

  • All systems on the Hall C network should be registered with the central systems. Talk to Brad Sawatzky and he will set you up quickly.
    • Do not throw something on the network with a hardcoded IP address. That was fine 15 years ago, not a good plan in a modern network.
  • The network layout is roughly described on the Hall_C_Network page, but that is deprecated and may be out of date. JNET should be considered canonical.

vxworks boot

  • vxWorks hosts presently boot off cdaql1 (129.57.168.41)

PXE boot (intel/Linux ROCs)

  • Intel/Linux ROCs boot using the PXE mechanism. The PXE stanza is delivered by the CNI DHCP service to hosts on the 168 subnet:
  tftp-host:    hcpxeboot                        # TFTP server (hcpxeboot a currently a CNAME for cdaqpxe)
  tftp-path:    linux-diskless/pxelinux.0        # Bootloader program

2024 RHEL 9 Upgrade

  1. Preparation