CRC provides the following front-end machines for compilation and job submission. Each machine is configured with identical software stacks.
FRONT-ENDs are NOT for large long running (>1hr) jobs. For such jobs please using the queuing system and compute nodes.
- crcfe01.crc.nd.edu (2 12 core Intel(R) Haswell processors with 256 GB RAM)
- crcfe02.crc.nd.edu (2 12 core Intel(R) Haswell processors with 256 GB RAM)
- crcfeIB01.crc.nd.edu* (4 16 core 2.4 GHz AMD Opteron processors with 512 GB RAM)
*This machine has Infiniband and is only accessible from the campus network or VPN
You can securely log into the front-end machines (enabling X forwarding for GUI displays) using a ssh client e.g.
> ssh –Y firstname.lastname@example.org
Further info: Setting up your computer
Further info: Available Hardware
Using SSH (Microsoft Windows): MobaXterm.
The software environment on the front-end machines is managed with modules. You can easily modify your programming and/or application environment by simply loading and removing the required modules. The most useful module commands are:
|• module avail||(view the complete module list)|
|• module load xyz||(load the module xyz)|
|• module unload xyz||(unload the module xyz)|
|• module list||(list currently loaded modules)|
Further info: Modules
CRC provides two complimentary file systems for storing programs and data for both runtime scratch space and longer-term storage:
- Distributed file system
- Initial 100GB allocation on crc.nd.edu cell
- Longer-term storage; backup taken daily
- You can check your current AFS usage with the following command:
- High-performance parallel file system
- Required to request an allocation on /scratch365
- Used for runtime working storage; no backup
- You can check your current /scratch365 usage with the following command:
> pan_df –H /scratch365/netid
Further info: Available_Storage
FILE TRANSFERS TO OR FROM WINDOWS AND MAC SYSTEMS
To transfer files from your local desktop filesystem to your CRC filesystem space we recommend installing and using the following file transfer (GUI) client:
Jobs are submitted to the compute nodes via the Univa Grid Engine (UGE) batch submission system (note: currently there is no interactive access to compute nodes). Basic SGE batch scripts should conform to the following template:
#!/bin/csh #$ -M email@example.com # Email address for job notification #$ -m abe # Send mail when job begins, ends and aborts #$ -pe mpi-24 24 # Specify parallel environment and legal core size #$ -q long # Specify queue #$ -N job_name # Specify job name module load xyz # Required modules mpirun -np $NSLOTS ./app # Application to execute
Parallel job scripts must request a parallel environment for execution:
|1. smp||(parallel jobs running within a single machine)|
|2. mpi-24||(parallel jobs running across multiple 24-core machines)|
Note: If no parallel environment is requested (i.e. you do not specify a –pe parameter), then the default execution environment is a single-core serial job.
When requesting a parallel environment you must also specify a valid core size. Legal core sizes for the parallel environments are:
|• smp||(1, 2, 3, ... 24)|
|• mpi-24||(a multiple of 24)|
Further info: CRC_SGE_Environment
CRC provides two general-purpose queues for the submission of jobs (using the –q parameter):
|1. long||(queue for production jobs; maximum running wall-time of 15 days)|
|2. debug||(quick turnaround testing/debugging queue; the current maximum wall-time is 4 hours)|
Note: The debug queue will only accept jobs with 8-core parallel environments i.e. smp 24 and mpi-24
If you wish to target a specific architecture for your jobs, then you can specify a host group instead of a general-purpose queue. Valid host groups are:
|1. @@debug_d12chas||(Dual 12-core Intel Haswell general access machines in debug queue 64GB RAM)|
|2. @@crc_d12chas||(Dual 12-core Intell Haswell general acess machines in long queue 256GB RAM)|
Host Group Monitoring
The free_nodes.sh command can be used to show how many nodes and cores are available in a given host group or queue. For example:
> free_nodes.sh @crc_d12chas
You can monitor the status of CRC queues by using the qstat command.
To view all running or pending jobs in the queues, enter the following command:
Further info: CRC_SGE_Environment
JOB SUBMISSION AND MONITORING
Job scripts can be submitted to the SGE batch submission system using the qsub command:
> qsub job.script
Once your job script is submitted, you will receive a numerical job id from the batch submission system, which you can use to monitor the progress of the job.
Well-formed Job Scripts
Job scripts that are determined by SGE to have made valid resource requests will enter the queuing system with a queue-waiting (qw) status (once the requested resources become available, the job will enter the running (r) status). Job scripts that are determined not to be valid will enter the queuing system with an error queue-waiting (Eqw) status.
To see the running/queued status of your job submissions, invoke the qstat command with your username (netid) and observe the status column:
> qstat –u username
For a complete overview of your job submission, invoke the qstat command with the job id:
> qstat –j job_id
Note: The main reasons for invalid job scripts (i.e. having Eqw status) typically are:
- Illegal specification of parallel environments and/or core size requests
- Illegal queue specification
- Copying job scripts from a Windows OS environment to the Linux OS environment on the front-end machines (invisible Windows control codes are not parsed correctly by SGE). This can be fixed by running the script through the dos2unix command
To delete a running or queued (e.g. submissions with Eqw status) job, use the following command:
> qdel –j job_id
Job Resource Monitoring
To better understand the resource usage (e.g. memory, CPU and I/O utilization) of your running jobs, you can monitor the runtime behavior of your job’s tasks as they execute on the compute nodes.
To determine the nodes on which your tasks are running, enter the following qstat command along with your username. Record the machine names (e.g. d6copt283.crc.nd.edu) associated with each task (both MASTER and SLAVE):
> qstat -u username -g t
There are two methods for analyzing the behavior of tasks (once you have a machine name):
- Xymon GUI Tool (detailed breakdown per task on a given machine)
- qhost command (aggregate summary across all tasks on a given machine)
CRC provides a GUI tool to analyze the behavior of processes on a given CRC machine. Xymon can be accessed at the following URL:
Use Xymon to navigate to the specific machine and then view the runtime resource usage of tasks on the machine.
You can summarize the resource utilization of all tasks on a given machine using the following qhost command:
> qhost -h machine_name
Further info: Submitting_Batch/SGE_jobs
If you have a large number of job scripts to run, that are largely identical in terms of executable and processing e.g. a 'parameter sweep' where only the input deck changes per run, then you should use a job array to submit your job.
An example job array script is provided below. The SGE batch system will repeatedly submit jobs differentiated by the $SGE_TASK_ID variable which is assigned a value within the task range indicated by the -t SGE task request parameter. To avoid overloading email server, please do not use email notification when submitting an array job.
#!/bin/csh #$ -pe smp 12 # Specify parallel environment and legal core size #$ -q long # Specify queue (use ‘debug’ for development) #$ -N job_name # Specify job name #$ -t 1-10 # Specify number of tasks in array module load mpich2 # Required modules mpiexec –n 12 ./foo < data.$SGE_TASK_ID # Application to execute