|Dell R815 AMD "Abu Dahbi" Servers||54||64||Quad 16-core||128GB|
DACCSS FRONT-END MACHINE
You can securely log into the front-end DACCSS machine using a ssh client e.g.
> ssh –Y firstname.lastname@example.org
The software environment on DACCSS cluster is managed with modules. You can easily modify your programming and/or application environment by simply loading and removing the required modules. The most useful module commands are:
• module avail (view the complete module list)
• module load xyz (load the module xyz)
• module unload xyz (unload the module xyz)
• module list (list currently loaded modules)
Jobs are submitted to the compute nodes via the Univa Grid Engine (UGE) batch submission system. Basic UGE batch scripts should conform to the following template:
#!/bin/csh #$ -M email@example.com # Email address for job notification #$ -m bea # Send mail when job begins, ends and aborts #$ -pe mpi-64 640 # Specify parallel environment and legal core size #$ -q *@@daccss # Specify queue #$ -N job_name # Specify job name module load xyz # Required modules mpiexec -n $NSLOTS ./app # Application to execute
Parallel job scripts must request a parallel environment for execution:
(parallel jobs running within a single machine)
2. mpi-64 (parallel jobs running across multiple 64-core machines)
Note: If no parallel environment is requested (i.e. you do not specify a –pe parameter), then the default execution environment is a single-core serial job.
When requesting a parallel environment you must also specify a valid core size. Legal core sizes for the parallel environments are:
(1, 2, 3, . . ., 64)
• mpi-64 (a multiple of 64)
JOB SUBMISSION AND MONITORING
Job scripts can be submitted to the UGE batch submission system using the qsub command:
> qsub job.script
Once your job script is submitted, you will receive a numerical job id from the batch submission system, which you can use to monitor the progress of the job.
Well-formed Job Scripts
Job scripts that are determined by UGE to have made valid resource requests will enter the queuing system with a queue-waiting(qw) status (once the requested resources become available, the job will enter the running (r) status). Job scripts that are determined not to be valid will enter the queuing system with an error queue-waiting (Eqw) status.
To see the running/queued status of your job submissions, invoke the qstat command with your username (netid) and observe the 'status column:
> qstat –u username
For a complete overview of your job submission, invoke the qstat command with the job id:
> qstat –j job_id
Note: The main reasons for invalid job scripts (i.e. having Eqw status) typically are:
- Illegal specification of parallel environments and/or core size requests
- Illegal queue specification
- Copying job scripts from a Windows OS environment to the Linux OS environment on the front-end machines (invisible Windows control codes are not parsed correctly by UGE). This can be fixed by running the script through the dos2unix command
To delete a running or queued (e.g. submissions with Eqw status) job, use the following command:
> qdel –j job_id
Job Resource Monitoring
To better understand the resource usage (e.g. memory, cpu and I/O utilization) of your running jobs, you can monitor the runtime behavior of your job’s tasks as they execute on the compute nodes.
To determine the nodes on which your tasks are running, enter the following qstat command along with your username. Record the machine names (e.g. q16copt003.crc.nd.edu) associated with each task (both MASTER and SLAVE):
> qstat -u username -g t