2019-03-24-Maintenance

March 24, 2019 Maintenance

Systems with GPU cards are in very high demand across both faculty owned and our general access systems. With the goal of ensuring fair and even usage across all of these machines, we are making several configuration and naming changes. These changes will be implemented during our routine maintenance on the morning of Sunday, March, 24.

For most users, this will not affect your running jobs and we anticipate only a short window where new jobs will be delayed from starting.

The primary change being implemented is that all batch jobs sent to any GPU queue will now have the environment variable CUDA_VISIBLE_DEVICES automatically set as part of the job script. This variable will ensure that a script is only allowed to make use of the GPU resources that it requested from the scheduler.

The scheduling system reports which devices are assigned to a particular job through an environment variable named SGE_HGR_gpu_card. Previously, this environment variable would return the names of the devices in the form of gpuX, where X would be a number between 1 and 4. Going forward, this environment variable will only return 0-indexed values. For example, if your job asks for 2 GPUs and is assigned the second and third devices, then you would see this:

 $ echo $SGE_HGR_gpu_card
 1 2

The motivation for this is to more closely mirror the numbering scheme used by CUDA_VISIBLE_DEVICES. For interactive BASH scripts, you may use the following assignment once your sessions are scheduled:

 export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu_card// /,}

Similarly, interactive (T)CSH scripts can use this command:

 setenv CUDA_VISIBLE_DEVICES `echo $SGE_HGR_gpu_card | tr ' ' ','`

Jobs that circumvent these settings may be removed at our discretion.