CRC Wiki
CRC Wiki
Log in

CRC Quick Start Guide

From CRC Wiki

FRONT-END SYSTEMS

CRC provides the following front-end machines for compilation and job submission. Each machine is configured with identical software stacks.

FRONT-ENDs are NOT for large long running (>1hr) jobs. For such jobs please using the queuing system and compute nodes.

  1. crcfe01.crc.nd.edu (2 12 core Intel(R) Haswell processors with 256 GB RAM)
  2. crcfe02.crc.nd.edu (2 12 core Intel(R) Haswell processors with 256 GB RAM)
  3. crcfeIB01.crc.nd.edu* (4 16 core 2.4 GHz AMD Opteron processors with 512 GB RAM)

You can securely log into the front-end machines (enabling X forwarding for GUI displays) using a ssh client e.g.

> ssh –Y netid@crcfe01.crc.nd.edu

Further info: Setting up your computer

Further info: Available Hardware

Video icon.jpeg Using SSH (MAC OS X)


*This machine has Infiniband and is only accessible from the campus network or VPN

MODULES

The software environment on the front-end machines is managed with modules. You can easily modify your programming and/or application environment by simply loading and removing the required modules. The most useful module commands are:

module avail (view the complete module list)
module load xyz (load the module xyz)
module unload xyz (unload the module xyz)
module list (list currently loaded modules)

Further info: Modules

FILE SYSTEMS

CRC provides two complimentary file systems for storing programs and data for both runtime scratch space and longer-term storage:

AFS

  • Distributed file system
  • Initial 100GB allocation on crc.nd.edu cell
  • Longer-term storage; backup taken daily
  • You can check your current AFS usage with the following command:
> quota

Further info: CRC_AFS_Cell and see below on how to transfer files

Panasas

  • High-performance parallel file system
  • Required to request an allocation on /scratch365
  • Used for runtime working storage; no backup
  • You can check your current /scratch365 usage with the following command:
> pan_df –H /scratch365/netid

Further info: Available_Storage

FILE TRANSFERS TO OR FROM WINDOWS AND MAC SYSTEMS

To transfer files from your local desktop filesystem to your CRC filesystem space we recommend installing and using the following file transfer (GUI) client:

 Cyberduck

JOB SCRIPTS

Jobs are submitted to the compute nodes via the Univa Grid Engine (UGE) batch submission system (note: currently there is no interactive access to compute nodes). Basic SGE batch scripts should conform to the following template:

#!/bin/csh

#$ -M netid@nd.edu	 # Email address for job notification
#$ -m abe		 # Send mail when job begins, ends and aborts
#$ -pe mpi-24 24	 # Specify parallel environment and legal core size
#$ -q long		 # Specify queue
#$ -N job_name	         # Specify job name

module load xyz	         # Required modules

mpirun -np $NSLOTS ./app # Application to execute

Further info: Submitting Batch/SGE jobs

PARALLEL ENVIRONMENTS

Parallel job scripts must request a parallel environment for execution:

1. smp (parallel jobs running within a single machine)
2. mpi-24 (parallel jobs running across multiple 24-core machines)

Note: If no parallel environment is requested (i.e. you do not specify a –pe parameter), then the default execution environment is a single-core serial job.

When requesting a parallel environment you must also specify a valid core size. Legal core sizes for the parallel environments are:

smp (1, 2, 3, ... 24)
mpi-24 (a multiple of 24)

Further info: CRC_SGE_Environment

QUEUES

CRC provides two general-purpose queues for the submission of jobs (using the –q parameter):

1. long (queue for production jobs; maximum running wall-time of 15 days)
2. debug (quick turnaround testing/debugging queue; the current maximum wall-time is 4 hours)

Note: The debug queue will only accept jobs with 8-core parallel environments i.e. smp 8 and mpi-8

If you wish to target a specific architecture for your jobs, then you can specify a host group instead of a general-purpose queue. Valid host groups are:

1. @@debug_d12chas (Dual 12-core Intel Haswell general access machines in debug queue)
2. @@crc_d12chas (Dual 12-core Intell Haswell general acess machines in long queue)

free_nodes.sh command can be used to show how many nodes are available in a given host group. For example:

   > free_nodes.sh @crc_d12chas

Queue Monitoring

You can monitor the status of CRC queues by using the qstat command.

To view all running or pending jobs in the queues, enter the following command:

> qstat

Further info: CRC_SGE_Environment

JOB SUBMISSION AND MONITORING

Job scripts can be submitted to the SGE batch submission system using the qsub command:

> qsub job.script

Once your job script is submitted, you will receive a numerical job id from the batch submission system, which you can use to monitor the progress of the job.

Well-formed Job Scripts

Job scripts that are determined by SGE to have made valid resource requests will enter the queuing system with a queue-waiting (qw) status (once the requested resources become available, the job will enter the running (r) status). Job scripts that are determined not to be valid will enter the queuing system with an error queue-waiting (Eqw) status.

To see the running/queued status of your job submissions, invoke the qstat command with your username (netid) and observe the status column:

> qstat –u username

For a complete overview of your job submission, invoke the qstat command with the job id:

> qstat –j job_id 

Note: The main reasons for invalid job scripts (i.e. having Eqw status) typically are:

  1. Illegal specification of parallel environments and/or core size requests
  2. Illegal queue specification
  3. Copying job scripts from a Windows OS environment to the Linux OS environment on the front-end machines (invisible Windows control codes are not parsed correctly by SGE). This can be fixed by running the script through the dos2unix command

Job Deletion

To delete a running or queued (e.g. submissions with Eqw status) job, use the following command:

> qdel –j job_id

Job Resource Monitoring

To better understand the resource usage (e.g. memory, CPU and I/O utilization) of your running jobs, you can monitor the runtime behavior of your job’s tasks as they execute on the compute nodes.

To determine the nodes on which your tasks are running, enter the following qstat command along with your username. Record the machine names (e.g. d6copt283.crc.nd.edu) associated with each task (both MASTER and SLAVE):

> qstat -u username -g t

There are two methods for analyzing the behavior of tasks (once you have a machine name):

  1. Xymon GUI Tool (detailed breakdown per task on a given machine)
  2. qhost command (aggregate summary across all tasks on a given machine)


Xymon

CRC provides a GUI tool to analyze the behavior of processes on a given CRC machine. Xymon can be accessed at the following URL:

CRC Xymon

Use Xymon to navigate to the specific machine and then view the runtime resource usage of tasks on the machine.

qhost

You can summarize the resource utilization of all tasks on a given machine using the following qhost command:

> qhost -h machine_name

Further info: Submitting_Batch/SGE_jobs

JOB ARRAYS

If you have a large number of job scripts to run, that are largely identical in terms of executable and processing e.g. a 'parameter sweep' where only the input deck changes per run, then you should use a job array to submit your job.

An example job array script is provided below. The SGE batch system will repeatedly submit jobs differentiated by the $SGE_TASK_ID variable which is assigned a value within the task range indicated by the -t SGE task request parameter. To avoid overloading email server, please do not use email notification when submitting an array job.

#!/bin/csh

#$ -pe smp 12          # Specify parallel environment and legal core size
#$ -q long             # Specify queue (use ‘debug’ for development)
#$ -N job_name         # Specify job name
#$ -t 1-1000           # Specify number of tasks in array

module load mpich2     # Required modules

mpiexec –n 12 ./foo < data.$SGE_TASK_ID # Application to execute