Job Monitoring

Back to NDCMS homepage

CRAB Monitoring

To ask CRAB about the status of your jobs, use the following command.

 crab -status (-c crab_directory)

This monitoring webpage gives extensive information about jobs up to a week old:

 http://dashb-cms-job.cern.ch/dashboard/request.py/jobsummary#user=&site=T3_US_NotreDame

There are many options on the left side of the page, which should be fairly self-explanatory. This link [1] has the presets which are generally the most useful.

CONDOR Monitoring

Official documentation of all condor command-line utilities:

 http://www.cs.wisc.edu/condor/manual/v7.0/9_Command_Reference.html

To view the status of all jobs in the condor queue (or for a specific user), use the following command:

 condor_q (username)

Note: if you have submitted jobs from outside ND, condor maps your username to "uscms01".

If you have jobs in the "Idle" state, you can ask condor why they are not running, using the following command:

 condor_q -better {jobID}

This command will give you a list of all the reasons a job can be rejected, and tell you how many of the 600 nodes are rejecting your job for that reason.


It is possible to manually abort your jobs using the following command:

 condor_rm (jobID or username)

However, it is recommended to cancel the jobs via CRAB, because CRAB will often get confused if its jobs suddenly disappear. To abort jobs using CRAB, use:

 crab -kill (job_range or "all")

The memory restriction per node on the condor queue is set to 1 GB by default. This is measured against the virtual memory of the process, meaning once a job's image size (virtual memory) becomes larger than 1 GB it could be idled indefinitely. Many CMSSW jobs tend to accumulate much more virtual memory than physical memory (up to 2x in some cases). This causes jobs to be idled when they don't necessarily need to be. You can change the default 1 GB memory restriction through the command condor_qedit. If you choose to change this value, try to be reasonably sure that your jobs won't be using more than 1 GB of real memory as this could cause problems for the machine the job is running on (including possibly crashing it). In order to change the memory limit, do the following:

1) find the job requirements string for one of your jobs:

> condor_q -l <condor_job_id> | grep Requirements

2) copy the string in quotes that comes after "Requirements = " in the output.

The requirements string will have an element like

( ( TARGET.Memory * 1024 ) >= ImageSize )

Since we can't edit the value of 'TARGET.Memory', we'll have to change the proportionality factor...from 1024 to, say, 2048.

3) reset the Requirements string for a job or group of jobs, using the old requirements string with the new proportionality factor.

> condor_qedit <user/condor_job_id> Requirements <new_req_string>

Note that since the default string contains double quotation marks (""), you'll need to surround your <new_req_string> with single qoutes (). In addition, if you copy and paste the output from step one, you'll need to remove the space in "> =" to have instead ">=".

Remember that 'condor_qedit' will work on a specific job ID (xxx.y), a job cluster ID (xxx), or a username.