:@ Network Weather Service


NWS site
Edit...

NWS Sponsors

nsf_logo.gif

sdsclogo-black.gif

UCSBlogo.gif

Edit...

Collaborators

SDSC

Edit...

Enabling your machine to be part of the QBETS system is very straightforward if you the administrator of the system, have administrator-like privileges, or even if you are a regular user and the site allows you to inspect the batch queue prediction log files (in fact, the latter is the most common case for the machines currently being serviced). This document outlines the process of setting up a batch queue sensor; a lightweight process which gathers batch queue job data and sends a sanitized subset of the data to the prediction system at UCSB. This process allows us to immediately start making quantile bound predictions for the amount of time individial jobs wait in your batch queues.

In a nutshell, one performs following steps:

Choose a machine

Find a single machine in the system that has access to the batch queue log data and/or batch queue front end tools call it machine with hostname HOSTNAME

Choose an installation path

This is where the system will live (NWS installation), call it IPATH. Note that you must specify the full pathname for IPATH, do not use '~' or any other character which the shell normally expands (for example: '*', '?', '~', ...).

Write the parser script or use ours

The sensor will periodically call an external program called 'sitescript' which, when executed periodically, outputs job data to files which are read by the NWS sensor. We suggest that you first try to use this sitescript which is the program we use on the majority of systems currently being serviced. This sitescript supports PBS (and variants like Torque, OpenPBS, etc), LSF and Sun Grid Engine (SGE). At the top of the script is a configuration section where you must select which scheduler to use and the path to the selected scheduler's logfiles (see the script itself for details). If you choose to use our sitescript, simply download it and install it in your IPATH/bin directory.

If you are using an unsupported scheduler or wish to write your own sitescript, that is also reasonable since the program itself can be very simple. Fundamentally, the 'sitescript' program will be called by the NWS sensor periodically with no inputs and will output two files; 'waittime_db' and 'waittime_log_db'. The former should be populated with job data which is 'fresher' than those gathered from any available logfiles, typically from frontend batch queue system commands like 'qstat', 'showq', etc. The latter (waittime_log_db) should contain data gleaned directly from the batch queue log files (/var/spool/torque/server_priv/accounting/* for instance). Note that even if your script only generates data for one file or the other but not both, it must create both files (use 'touch' or equiv). Each line of these files represents one job and should be of the form

"jobid" "submit_time" "queue_waittime" "nodes_requested" "walltime_requested" "queue_name"

For example, an example line (one unique job) may look something like the following:

18854.tgfoo 1170655220 25 1 900 dque

Such a line indicates that a job with id '18854.tgfoo' was submitted to the queue 'dque' at UNIX timestamp '1170655220'. This job, which requested 1 node for 900 seconds waited in the queue for exactly '25' seconds before executing.

The sitescript program must tolerate the fact that every time it is executed, the 'waittime*' files are read and emptied, and so only new jobs should be written to them. For example, if you are reading log files and the same jobs get read everytime sitescript executes, sitescript should only write the job data to 'waittime_log_db' once, and store the fact that it has already been processed in a different file. 'sitescript' should be installed into IPATH/bin and should output, minimally, two files IPATH/bin/waittime_db and IPATH/bin/waittime_log_db.

Install the Network Weather Service (NWS)

The NWS is very easy to build and install. Please use the version of NWS linked here as the stable NWS installation does not yet include batch queue monitoring support.

      ./configure --enable-debug --disable-sigalarm --enable-nonblocking --enable-experimental --prefix=IPATH
      make install

Bring up the sensor

     IPATH/bin/nws_sensor -N batchq.cs.ucsb.edu -M batchq.cs.ucsb.edu -c no -A yes  -n HOSTNAME

NOTE: HOSTNAME is the public, static hostname of the machine on which nws_sensor is running

Start the batch queue activity

     IPATH/bin/start_activity -F HOSTNAME skillName:batchQueueMonitor dbpath:IPATH/bin period:60

If all goes well, your 'sitescript' should be running once every 60 seconds, after which the contents of 'waittime_db' and 'waittime_log_db' are consumed by 'nws_sensor' and sent over the network to our server 'batchq.cs.ucsb.edu'. Once we start getting batch queue data on our server, and you explicitly give us permission to enable your machine, you can instantly start making predictions! When your sitescript is running and you want to start using the prediction tools, or you have any questions/concerns/curiosities regarding QBETS, send email to Daniel Nurmi (nurmi0;cs.ucsb.edu).

Here is an example sitescript that is used on many of the machines we're currently monitoring. This sitescript supports PBS (and variants like Torque, OpenPBS, etc), LSF and Sun Grid Engine (SGE). At the top of the script is a configuration section where you must select which scheduler to use and the path to the selected scheduler's logfiles (see the script itself for details).