Batch Analysis on the Jlab Batch Farm using BATCHMAN

created for Gen01 by frw

updated for version 2.4.16

Executive Summary
BATCHMAN automates the batch farm analaysis of a large number of data sets, provided the procedure is identical for all sets.
Specifically, BATCHMAN handles to progression from staging of the raw data files in the tape silo to submitting the analysis jobs and monitoring their progression to completion. Numerous error conditions are detected and flagged for easy follow-up.
A graphical user interface makes controlling the system very easy and convenient, leaving only the subsequent evaluation of the analyzed results...

Quick Start Instructions Recent Updates
Overview

Batch Analysis at Jlab

Gen01 Specific Considerations

Gen01 Batch Replay Setup

The BatchMan GUI

The Batch DEAMON

BatchMan Parameter

The LOCKfile

BatchMan Job Progression

Quick Start Instructions
Make a new, dedicated directory in your space on the Jlab work disks as the home for your Gen01 batch replay. Here, you will have the batch replay setup and the results (recommendation: /work/hallc/e93026/your_user_id /batch/). You will be controlling your batch replay from the Jlab interactive farm computers ifarml1 or ifarml2!

Copy the latest complete batch replay setup from the group space (or the web site) and extract its contents into your new batch directory (tar -xzf filename). You should now have the requisite directory structure and the latest runtime setup, except (maybe) for the needed executables. (In the following, replace {home} with the path to the directory you just created.) Check if your {home}/Runtime directory contains the files engine_Linux.exe and syncfilter. If not, obtain the latest versions from the Gen01 group space (or web page). Be sure to fetch syncfilter_Linux and to rename it to just syncfilter!

Now take a look at {home}/Runtime/REPLAY.PARM and make sure that it reflects your wishes. This file will be used by all your batch replays! If you have it as you want, return to {home} and simply execute the command batchman. That's it! All you have to do now is follow the instructions and provide a list of runs to analyze.

Be aware, however, that a background process called batch_deamon.tcl will automatically be started and that it must continue to run until all your runs are analyzed. You can always start additional instances of this background process if you desire, even on different computers, as long as they are Linux machines at Jlab which have access to the batch and tape silo commands.

Recent Updates -- now at Version 2.10.14 -- please see changelog in batchman script for details
The recent upgrades provide improvements that make BatchMan easier to use, more reliable and stable, and also easier to adapt to another experiment. If you are upgrading, please be sure to replace all files with their updated version and, before any other use of the system, run the GUI and click on "save" in the "Params" window to ensure your parameter file reflects the latest options -- naturally, you should check the values displayed in the parameter window to make sure they are correct and as desired.

The reliability improvements are mostly due to additional error trapping in the deamon; the GUI is safer because a change in the lockfile status now results in the status file being re-loaded prior to allowing any changes. The subsets of runs have been renamed and the actions relabeled to make it easier to identify what they are and do and to provide more consistency. Also, the GUI will now display its interface prior to reading and decoding the status file, so everything works a little quicker.

The layout of the parameter pop-up has been changed, to make room for the new additions. The description below is not yet updated, but the items are either identical or should be rather self-explanatory. It should be noted that the parameter have now been distributed acroos three "tabs" which can be individually selected inside the pop-up window. All the functionality is the same as before.

Several options have been added to the parameter set and in the "params" window. The most significant are the toggles that control the automatic re-running of failed jobs, the releasing of raw files of failed jobs, and the checking for active and pending staging requests. Please turn to the respective description in the "Parameter" section, below, for details. Other irems have been added but not yet documented, for example the experiment label. All are self-explanatory and should not actually require changing.

Lastly, the helper scripts have been modified to obtain the file names and locations from the BatchMan scripts so that the choices you make in the parameter settings are reflected there as well. This should significantly reduce any modifications needed if you use the system for anything other than Gen01 on Jlab CUE computers.


Overview
The analysis replay for experiment Gen01 condenses the information contained in the raw experiment data files and compiles reports, histograms and Ntuples with the results. The desired output and many additional analysis options and experiment parameter are specified via a database structure consisting of numerous small parameter files.

The large volume of raw data, grouped into "runs", requires significant computing time to analyze -- either sequentially on a single CPU or in parallel on multiple machines. The Jlab batch farm is a large collection of (essentially) identically configured computers, permitting parallel execution. The batch farm also has ready access to the tape silo, where the raw data files are stored.

By neccessity, use of the batch farm is not interactive. Therefore, provisions need to be made to allow the analysis to progress unattended between "submission" of the batch request and its completion. The standard batch system mechanism accomplishes this via a command file that specifies the input and output options as well as the actual command to execute. Since this command file needs to be customized for every run to be analyzed, a large number of analysis jobs quickly makes this a tedious task. Additional preparations and requirements complicate this further and any serious effort will need to utilize some automation tools.

So far, these tools have usually been custom created and they have generally not been very user-friendly. BATCHMAN was created to handle all the details of the batch analysis yet be user-friendly and reasonably robust, isolating the user from the batch farm, the tape silo and their command interfaces.


Batch Analysis at Jlab
As indicated above, the analysis on the batch farm is not interactive and all input and output options, as well as the command parameter, must be specified up front in a standard format. Further, the local disk space of the batch computers is not directly accessible by the user. While the common shared disk systems are available to the batch computers, parallel calculations need to be aware of each other or isolated in order to avoid interference. This means that we cannot use a shared working space for the simultaneously occuring parallel analyses on the batch farm (since we cannot guarantee non-interference with the Hall C analysis replay).

Analyzing on the batch farm is characterized by two steps: getting the data files and submitting the analysis job. These tasks are characterized by their main commands, jcache and jsub which are extensively documented on the Jlab Scientific Computing User Documentation web page.


Gen01 Specific Considerations
For Gen01's batch analysis, a few other issues come into play. The batch system generally expects to have one job-specific input file. It does offer two options, therefore, if multiple files are submitted for analysis together: it can automatically start multiple jobs, one for each file or it can start just one run for all files. Since our raw data files are split into segments and we need to have these files analyzed by the same job, in sequence, we need to utilize the latter option and submit each run's analysis job individually.

Unfortunately, this puts us at odds with another consideration: the batch computers do have access to the tape system's cache disks and to the CUE work disks, so we could have the batch jobs work in these directories. This, however, would mean that every bit if data input and output occurs across the network. While this is generally acceptable and certainly commonplace for interactive jobs, it does represent a significant inefficiency and increases the potential for system problems affecting the analysis.

It is much more efficient to simply copy all the requisite files to the batch computer's local harddrive and have it process all data I/O locally. The results are then copied to the work disk for final storage when the analysis completes. This also reduces the time during which the analysis is succeptible to network problems.

Implementing this approach consistently, then, requires that we copy our runtime setup and the raw data files to the local disk before starting the replay. However, each of our raw data file segments is as large as 2GB so only runs that do not have more than 2 segments (discounting runtime setup and output) fit into the 4GB disk space allocated for each job by the computing center. While this does cover at least 1/2 of our runs, it leaves a large amount of data non-analyzable.

We could easily solve this problem by using a mixed approach, copying only the runtime setup to the batch computer's disk and accessing the raw data remotely. This, however, is precisely the inefficiency we try to avoid -- we're not the only ones using the batch farm and we do want our jobs to run as quickly as possible.

Instead, we have implemented a different approach: Prior to starting the analysis, we copy only the first file segment of the raw data to the local disk. Then, while the analysis is proceeding, a background task running on this same batch computer, in parallel with our analysis, keeps an eye on the analysis progress. While one segment is being analyzed, the next one is copied to the local disk by the background task. It also deletes those segments whose analysis is done, freeing up that disk space.

This method does exceed the (apparently) administrative limit on the disk space available to each job, but only intermittently and only by the space used by bthe runtime setup and the output files, which is significantly less than that taken up by the raw data files.

The background task monitors the stats files to determine which segment is currently being analyzed. If this recently changed, it will delete the previous data file segment (i-1), the analysis of which apparently has completed, and then starts copying the next one (i+1) to the local disk -- the segment currently being analyzed (i) is already present, since it has been copied in previously.


Gen01 Batch Replay Setup
The BatchMan analysis "system" consists of four scripts: the Graphical User Interface script batchman, the actual job managing script batch_deamon.tcl which runs continuously (in the background), the script executed on the batch computer (batch_job), and monitor_segments which runs in the background on the batch computer to manage the raw data file segments, as discussed above.

The batch replay setup, as expected by the BatchMan scripts, is illustrated in this diagram:

The two main groups, "Batch Home" and "Output Storage", are actually the same directory; they are shown seperately only due to their logical function. Thus, the proper Gen01 batch replay setup will:

The system then works as follows:
The user controls the system's action via the Graphical User Interface script batchman. The actual task of determining which files are needed, initiating the staging requests and submitting the jobs to the batch farm are handled by the (background) script batch_deamon.tcl. Once a job starts running on the batch farm, the script that gets executed there is batch_job which in turn invokes monitor_segments to run in the background locally, to handle the copying of the raw data file segments. It then starts the replay engine and, after the analysis concludes, copies the results back to the user's work disk directories.


The BatchMan GUI
The Graphical User Interface batchman gives the user control over the jobs handled by the batchman setup. Initially, a list of runs to be analyzed needs to be supplied. As the analysis progresses, the user has the option of aborting jobs, starting them over or running completed jobs once more. The option to purge completed or aborted jobs from the task list is also available.


BatchMan GUI     click on image for full size view

The GUI displays each job on one line, 10 lines per screen; the buttons at the lower left and right serve to page forward and backwards, one page at a time, five at once, or all the way to the beginning or end of the list. For each run, the run number is displayed (far left) and the raw data file file segment count. The latter is actually a button which invokes a window listing the files explicitly. The far right displays the time that has elapsed since the current state has been entered. It only gets updated when the batch_deamon.tcl script iterates.

The remaining fields' content changes depending on the specific job status. This could be a button detailing the files that have been stage so far, or a field containing the batch system's job sequence ID. Other options include the time at which a job entered the current state, and if it completed also its finish time, and the state the job was in when it was aborted. If a job terminates abnormally, most likely some error indicator can be found here as well.

The status file contains the exact same information as is displayed by the batchman GUI. Via this file, it communicates with the workhorse script batch_deamon.tcl. They both read from and write to the same status file, updating the individual job's state as they progress. The user can change a job's state to an action request (e.g. new or kill) using the GUI. The workhorse script continually runs in the background and updates the job's status as it acts upon it.

A sample BatchMan status file corresponding to the above example of the GUI is shown here; it is in HTML format to make it more easily readable (using a web browser). This also allows remote monitoring if the file is suitably located. Upon comparison between the two, you might note the additionally listed job marked "Unknown!" It was found to be running on the batch farm under the current user ID but is not currently accounted for in the status file. The user has the option of merging this job into the task list for tracking ease.

Important! Be careful when exiting the BatchMan GUI: make sure to use the Exit button, not the X-Windows manager's "close window" icon! Otherwise, the lock will not be released (see below) and your jobs will cease to progress! The resulting exit dialog will confirm that you want to make the changes you have requested permanent. If you indicate No they will not be entered into the status file and will not be acted upon. You also get the option of not exiting after all.


The Batch DEAMON
The script batch_deamon.tcl is the one that actually tracks your analysis jobs and initiates tape staging requests and submits jobs to the batch farm. It is designed to run continually in the background, as long as you have jobs that require tracking (i.e. it exits when all jobs have terminated). It is automatically started by batchman when it exits, if there are any runs to be tracked and it cannot detect any batch_deamon.tcl already running on this CPU under the current user ID.

batch_deamon.tcl first attempts to claim the lock file (see below) and, once it succeeds, it reads the status file. If so indicated by the corresponding flag (see Parameters), a call to jtstat is issue and a list of active and pending staging requests is obtained. Then it issues a jobstat command to query the status of the batch queue. This is checked against the status file and the various jobs' state is updated as needed. If possible, the completion of the previous state is verified prior to moving on, or an error state is assigned if the check is unsuccessful.

Abort commands (kill and restart) are executed first. Next, the list of files for which staging request have been issue is checked and the already present files are noted. If the staging system was queried, as per parameter setting, any files that are neither already present nor still in the staging task list are considered AWOL. If their staging request is older than a few minutes (to cover any latency in the task handling system), the run is terminated with an error status. If, however, all the files of a given run are present, the run is submitted for analysis on the batch farm. For newly added runs, the list of files is obtained from the tape staging system and the files are requested. Then, if indicated by the corresponding parameter setting, any unsuccessfully terminated runs are resubmitted. Finally, the status file is updated and the lock released. batch_deamon.tcl then sleeps for a certain time (see parameters) and afterwards another iteration is started.

Since all scripts that access the status file make use of the lock file, it is possible to have any number of deamons running under the same user ID and from inside the same directory (BatchMan systems running from different directories are independent by design, though they may require the user to start a batch_deamon.tcl manually). It is even possible to run them on different computers, as long as they have access to all the needed disks and can interface with the MSS tape system and the batch farm. One should probably increase the sleep time in that case, though (see parameters).

In addition to the background deamon mode, started with the command

batch_deamon.tcl &
two other modes are available, primarily for debugging purposes:

In oneshot mode the script goes through only one iteration and then terminates. In this mode, it generates status messages on the terminal as it progresses. This can also be used to force an update, for example after a session with the batchman GUI. The command is simply

batch_deamon.tcl oneshot
or, avoiding the extra screen output,
batch_deamon.tcl quietone

Useful mostly in the case of intermittent problems, the verbose mode generates the same messages as the oneshot mode but it does not exit after one iteration. Instead, it continues based on the same rules as the background deamon would. This requires that the command window (xterm?) in which the command was issued stay open. In exchange, a log of the progress is printed to the screen:

batch_deamon.tcl verbose
or
batch_deamon.tcl interactive


BatchMan Parameter
The BatchMan parameter file BatchMan.PARAMS is automatically created the first time batchman is executed. The user is strongly discouraged from modifying these file directly -- instead use the dialog displayed by the batchman GUI for this purpose. It is always accessible via the Params button:


BatchMan Parameter Dialog     click on image for full size view

Here is a description of the parameters and their default values ({thisdir} refers to the directory in which your batch replay is located):

Parameters controlling the operation of batchman and batch_deamon.tcl:
    Delay between iterations of batch_deamon.tcl
If this is too short, batchman will be unable to make any changes and you may well be causing excessive load on the system. If this value is too large, jobs may be picked up by the batch system and finish before BatchMan realizes they were picked up at all (as opposed to having vanished), which will result in an ERROR status for the affected job(s).
15
The directory dedicated for temporary files
This will be used by batch_deamon.tcl for the job files that get submitted to the batch system. The default setup includes such a directory at the location the default indicates.
Should only need changing if you are not using the default setup.
{thisdir}/tmp
Location of the status file
As is indicated in the dialog, making this location some place that is accessible via WWW allows you to monitor the job progress from any web browser. The default is a particularly handy location which allows access from anywhere in the world at http://www.jlab.org/~your_userID/batchman.html
~/public_html/batchman.html
Information used to stage the raw data files:
Maximum number of raw data file segments staged at any one time
This is a fair-play value: since each staged file takes up space on the tape staging system's cache disks, we need to limit ourselves so as to not monopolize it. Only files that BatchMan has currently staged count against this limit, including all files for which a staging request has been issued and the files of runs currently in the batch queue, whether they are pending or already being analyzed. Once a run completes, successfully or not, or is aborted, it's files are cleared in the staging system and no longer used in this sum. The default value should allow an average of 100 runs to be staging and queued at a time, which ought to be sufficient (expecting 2-4 segments per run).
300
location of the staging system's stub files
If you were to compare the process of staging in a raw data file from tape to copying the file, this would be the location you are copying from. The default is the correct value if you are analyzing experiment Gen01 at Jlab. See the Jlab MSS user documentation for details.
/mss/hallc/e93026/raw
location of the already staged files
If you were to compare the process of staging in a raw data file from tape to copying the file, this would be the location you are copying to. The default is the correct value if you are analyzing experiment Gen01 at Jlab. Unless you are not caching the files, which would mean you changed almost all of the BatchMan scripts... See the Jlab MSS user documentation for details.
/cache/mss/hallc/e93026/raw
prefix of raw data files' names
This, together with the run number, the suffix (see next parameter) and the segment number, makes up the raw data file's name.
The default is the correct value if you are analyzing experiment Gen01 at Jlab.
e93026_
suffix of raw data files' names
This, together with the prefix (see previous parameter), the run number, and the segment number, makes up the raw data file's name.
The default is the correct value if you are analyzing experiment Gen01 at Jlab.
.log.
Information used for the batch system:
Runtime Setup directory
This is where the batch job's runtime setup will be copied from.
Should only need changing if you are not using the default setup.
{thisdir}/Runtime
Output directory to which the STATS report will be copied
The STATS report is used by the monitor_segments script to determine the data file segment currently being analyzed. This entry, however, tells BatchMan where to find this report once the analysis has finished. It is now used to determine if the analysis terminated prematurely (incomplete) or if it completed properly.
Should only need changing if you are not using the default setup.
{thisdir}/output
Directory to which the log file(s) will be copied upon job completion
Right now, this is only used to delete old logs when jobs are (re)started to eliminate confusion. Note that it is not used to determine where the log files end up! That is always assumed to be the /logs subdirectory of the batch home directory (see below).
Should only need changing if you are not using the default setup.
{thisdir}/logs
Command to be run on the batch computer
This is the command that is actually given to the batch system (via a command file in the TEMP directory) for execution on the batch computer. It is responsible for getting the Runtime setup installed, actually executing the replay, and copying the results back to the batch home directory. You could create slightly modified versions of this script under different names and use this to pick one to use. The default value is the script that BatchMan comes with: batch_job. Be sure to include the folly qualified path!
{thisdir}/batch_job
Your Batch Home Directory
This parameter is supplied to batch_job (see above). Our version uses it to determine where to copy the analysis results to once analysis is completed since we do not have access to the batch computer's disk.
Should only need changing if you are not using the default setup.
{thisdir}
Other Options:
Release FAILED Runs' Files
When a run completes successfully, its raw data files are flagged for removal from the staging cache disk to make room for other files. This flag determines if the files of a failed run are to be released or not; the latter is beneficial if the run will be resubmitted soon (see also the "Auto-Restart FAILED Runs" option), but it will be at the expense of available disk space which may slow down the overall progress.
yes
Auto-Restart FAILED Runs
This toggle determines whether to automatically resubmit a failed job. Depending on the reason for a run's failure, a second attempt may be successful. If this option is active, a bad run may be in progress forever; use with caution.
no
Verify Staging Queue
If a run has been submitted to the batch system for analysis, it is expected to show up in the batch status query. If it is not there, the expected output is checked for to determine if the run completed successfully or not. This flag determines if the analog check is made in the file staging system. Any raw data file that was requested by batchman either ought to have staged successfully or show up in the staging queue as active or pending. Any file that is neither found to be present nor listed in the staging queue is flagged as AWOL. If the staging request is NOT recent, allowing for some latency in the system, the run is terminated with an error state. Reasonably accurate and can save lots of time.
yes

For completeness only:   The defaults used by batch_deamon.tcl in the absence of an entry in the parameter file (or the absence of a parameter file) is NOT the same as the GUI dialog's default. The latter are recommended values which have high expectation of giving satisfactory results, while the former are designed to allow operation in even unusual situations. Don't push the issue, though, a missing or incomplete parameter file is bound to cause problems!


The LOCKfile
From the batchman "Lock File Found" dialog:
The current status of all your jobs is stored in a file. This status file is used by BATCHMAN and by the automatic background process to communicate.

Right now, it appears that the background process is updating your status file. To ensure that you do not overwrite these updates, you are given only read access to the status file until the automatic process is done. Usually, this only takes a few moments -- another check will be done every ??? seconds.

However, it is possible that the "lock file" is actually a remnant from an aborted or crashed old process. You can exit BATCHMAN and check the contents of the lock file for the PID of the process that created it. If that process is no longer running, you can delete the lock file and restart BATCHMAN. Your lock file is ???.

If you feel that no other process ought to be claiming this lock right now, get the PID from the lock file and check if such a process exists. If so, you should give it a few more minutes and then, after first checking if the lock is still taken and by this same process, you could kill the process before deleting the lockfile.

Keep in mind, though, that a process running on another CPU might also access this same lockfile and then it will not show up in this CPU's tasklist. This will of course only be an issue if you ran batchman (or batch_deamon) on this other CPU and in this same directory. Also you should check the modification date of the lockfile: if your sleep time is too short, you will have a hard time catching batch_deamon asleep and the lockfile will seem to be permanent. Simply run BatchMan, even if it can't get the lock, and click on "Params" to open the parameter window to change the sleep time.

The best way to avoid problems with the lockfile is to make sure you exit the BatchMan GUI properly -- via the "Exit" dialog. Taking the shortcut of closing the window via the Xwindows window manager (the corners of the window frame) id certain to cause these problems. (Note: if you know how to force a tcl script running under wish to use *my* exit command, please let me know so I can bypass this problem.)


BatchMan Job Progression
BatchMan (and batch_deamon) use a state-based system to track the jobs' progression. Is this motivated primarily by the fact that at two different stages it is dependent on the completion of external processes of unknown duration (file staging and batch analysis).

The following chart hopes to illustrated the different states known to BatchMan and how a transition from one to another is initiated, be it by user interaction or due to job progression.

User Action
Job Status
Deamon Progression
just added by user new
unstaged The newly requested run has been received by batch_deamon and requisite input files have been determined and old logs have been deleted, but staging request has not been issue yet -- probably on hold due to staging limit (see PARAMS)
staging... The needed files have been requested to be staged. Waiting for them to appear. Note that those files that are present are tracked.
If this takes very long, the staging request may have been unsuccessful -- the staging system ought to have sent an email to that effect. Either way, you can Restart this run if needed...
staged All needed files are present and the run can be sent to the batch queue. This state is purely administrative and should never actually be encountered in the status file. Not a problem if it does, though.
submitted The run has been submitted to the batch queue but has yet to appear in our job listing. There is some latency in the batch queueing so this is ok. It could, however, also indicate that the run was received, started and completed (or crashed!) before the next iteration of batch_deamon. This is especially likely for short jobs (or just too short for the iteration delay?), problems with the analysis itself (try it interactively) or if the BatchMan GUI was running for too long!
queued The job has shown up in the batch queue ("jobstat") and is on hold there. If this takes too long (days?) then there is a realistic chance that the data files will have gotten deleted by the tape staging system. Restart this job if needed.
running The job is in the batch queue job list and it is indicated to be currently running. Good!
completed The job has disappeared from the batch queue job list but the log files have been found in the local storage so this job should have completed as desired! If you don't like the results, you can run it again or otherwise you can purge it from the listing if you find it too cluttered...

incomplete! The difference between this and completed is that the status file indicates that we did not analyze all data file segments. Maybe some were deleted from the tape staging disk before we got around to using it? Just re-run this job (or purge it if you are giving up on it).
aborted! This job was going along just fine (as far as batchman could tell) but the user requested it be killed. Well, it's dead now. You can re-run it or purge it if you want. It should tell you at what stage in its life it was so mercilessly killed.
ERROR Something went horribly wrong! The universe is collapsing or something equally horrible is underfoot. Well, maybe its just a problem reading the status file. You didn't change anything, did you?! This could be due to many, many, many different causes and unless it's systematic the best choice is to re-run this job. If it is systematic, then you need to figure out what's cause the problem. There might be a clue somewhere to help you...
User has indicated that they dislike this job and would rather not it stick around. The system has not yet received the contract but will attend to it post haste. Kill!
While the user appears to have an issue with this job, too, at least they are willing to give it another chance to rehabilitate itself. Unfortunately, our batch system is ill-equipped to handle such optimism and the only option we have is to first kill this job and then resurrect it. It will be terminated and its status changed to "new" in the next iteration of batch_deamon. Restart!
aborting... Once again an administrative state only. It too you should never see but is harmless if you do. This is the limbo between batch_deamon having issue the kill order and it getting around to issuing the death certificate. We do not stick around to verify the job's demise so this really just takes batch_man's "signature" to become aborted!.


frw November, 2002