CRC Wiki
CRC Wiki
Log in

CRC Quick Start Guide

From CRC Wiki

FRONT-END SYSTEMS

CRC provides the following front-end machines for compilation and job submission. Each machine is configured with identical software stacks.

FRONT-ENDs are NOT for large long running (>1hr) jobs. For such jobs please using the queuing system and compute nodes.

  1. crcfe01.crc.nd.edu (2 12 core Intel(R) Haswell processors with 256 GB RAM)
  2. crcfe02.crc.nd.edu (2 12 core Intel(R) Haswell processors with 256 GB RAM)
  3. crcfeIB01.crc.nd.edu* (4 16 core 2.4 GHz AMD Opteron processors with 512 GB RAM)

*This machine has Infiniband and is only accessible from the campus network or VPN

You can securely log into the front-end machines (enabling X forwarding for GUI displays) using a ssh client e.g.

> ssh –Y netid@crcfe01.crc.nd.edu

Please view the Front-end workflow page for more information on proper Front-end usage.

Further info: Setting up your computer

Further info: Available Hardware

Video icon.jpeg Using SSH (MAC OS X)

Using SSH (Microsoft Windows): MobaXterm.

MODULES

The software environment on the front-end machines is managed with modules. You can easily modify your programming and/or application environment by simply loading and removing the required modules. The most useful module commands are:

module avail (view the complete module list)
module load xyz (load the module xyz)
module unload xyz (unload the module xyz)
module list (list currently loaded modules)

Further info: Modules

FILE SYSTEMS

CRC provides two complimentary file systems for storing programs and data for both runtime scratch space and longer-term storage:

AFS

  • Distributed file system
  • Initial 100GB allocation on crc.nd.edu cell
  • Longer-term storage; backup taken daily
  • You can check your current AFS usage with the following command:
> quota

Further info: CRC_AFS_Cell and see below on how to transfer files.

Panasas

  • High-performance parallel file system
  • Required to request an allocation on /scratch365
  • Used for runtime working storage; no backup
  • You can check your current /scratch365 usage with the following command:
> pan_df –H /scratch365/netid

Further info: Available_Storage

FILE TRANSFERS TO OR FROM WINDOWS AND MAC SYSTEMS

To transfer files from your local desktop MacOS filesystem to your CRC filesystem space we recommend installing and using the following file transfer (GUI) client:

 Cyberduck

For Windows users, we reccomend using MobaXterm as both an SSH client and a file transfer client.
If you would like to transfer data between the CRC servers and Google Drive, we recommend the Rclone tool.


JOB SCRIPTS

Jobs are submitted to the compute nodes via the Univa Grid Engine (UGE) batch submission system (note: currently there is no interactive access to compute nodes). Basic SGE batch scripts should conform to the following template:

#!/bin/csh

#$ -M netid@nd.edu	 # Email address for job notification
#$ -m abe		 # Send mail when job begins, ends and aborts
#$ -pe mpi-24 24	 # Specify parallel environment and legal core size
#$ -q long		 # Specify queue
#$ -N job_name	         # Specify job name

module load xyz	         # Required modules

mpirun -np $NSLOTS ./app # Application to execute

Further info: Submitting Batch/SGE jobs
For more examples: Sample User Scripts

PARALLEL ENVIRONMENTS

Parallel job scripts must request a parallel environment for execution:

1. smp (parallel jobs running within a single machine)
2. mpi-24 (parallel jobs running across multiple 24-core machines)

Note: If no parallel environment is requested (i.e. you do not specify a –pe parameter), then the default execution environment is a single-core serial job.
Every machine has one thread per core!

When requesting a parallel environment you must also specify a valid core size. Legal core sizes for the parallel environments are:

smp (1, 2, 3, ... 24)
mpi-24 (a multiple of 24)

Further info: CRC_SGE_Environment

QUEUES

CRC provides two general-purpose queues for the submission of jobs (using the –q parameter):

1. long (queue for production jobs; maximum running wall-time of 15 days)
2. debug (quick turnaround testing/debugging queue; the current maximum wall-time is 4 hours)

Note: The debug queue will only accept jobs with 8-core parallel environments i.e. smp 24 and mpi-24

If you wish to target a specific architecture for your jobs, then you can specify a host group instead of a general-purpose queue. Valid host groups are:

1. @@debug_d12chas (Dual 12-core Intel Haswell general access machines in debug queue 64GB RAM)
2. @@crc_d12chas (Dual 12-core Intell Haswell general acess machines in long queue 256GB RAM)

Host Group Monitoring

The free_nodes.sh command can be used to show how many nodes and cores are available in a given host group or queue. For example:

> free_nodes.sh @crc_d12chas

Queue Monitoring

You can monitor the status of CRC queues by using the qstat command.

To view all running or pending jobs in the queues, enter the following command:

> qstat

Further info: CRC_SGE_Environment

JOB SUBMISSION AND MONITORING

Job scripts can be submitted to the SGE batch submission system using the qsub command:

> qsub job.script

Once your job script is submitted, you will receive a numerical job id from the batch submission system, which you can use to monitor the progress of the job.

Well-formed Job Scripts

Job scripts that are determined by SGE to have made valid resource requests will enter the queuing system with a queue-waiting (qw) status (once the requested resources become available, the job will enter the running (r) status). Job scripts that are determined not to be valid will enter the queuing system with an error queue-waiting (Eqw) status.

To see the running/queued status of your job submissions, invoke the qstat command with your username (netid) and observe the status column:

> qstat –u username

For a complete overview of your job submission, invoke the qstat command with the job id:

> qstat –j job_id 

Note: The main reasons for invalid job scripts (i.e. having Eqw status) typically are:

  1. Illegal specification of parallel environments and/or core size requests
  2. Illegal queue specification
  3. Copying job scripts from a Windows OS environment to the Linux OS environment on the front-end machines (invisible Windows control codes are not parsed correctly by SGE). This can be fixed by running the script through the dos2unix command

Job Deletion

To delete a running or queued (e.g. submissions with Eqw status) job, use the following command:

> qdel –j job_id

Job Resource Monitoring

To better understand the resource usage (e.g. memory, CPU and I/O utilization) of your running jobs, you can monitor the runtime behavior of your job’s tasks as they execute on the compute nodes.

To determine the nodes on which your tasks are running, enter the following qstat command along with your username. Record the machine names (e.g. d6copt283.crc.nd.edu) associated with each task (both MASTER and SLAVE):

> qstat -u username -g t

There are two methods for analyzing the behavior of tasks (once you have a machine name):

  1. Xymon GUI Tool (detailed breakdown per task on a given machine)
  2. qhost command (aggregate summary across all tasks on a given machine)


Xymon

CRC provides a GUI tool to analyze the behavior of processes on a given CRC machine. Xymon can be accessed at the following URL:

CRC Xymon

Use Xymon to navigate to the specific machine and then view the runtime resource usage of tasks on the machine.

qhost

You can summarize the resource utilization of all tasks on a given machine using the following qhost command:

> qhost -h machine_name

Further info: Submitting_Batch/SGE_jobs

JOB ARRAYS

If you have a large number of job scripts to run, that are largely identical in terms of executable and processing e.g. a 'parameter sweep' where only the input deck changes per run, then you should use a job array to submit your job.

An example job array script is provided below. The SGE batch system will repeatedly submit jobs differentiated by the $SGE_TASK_ID variable which is assigned a value within the task range indicated by the -t SGE task request parameter. To avoid overloading email server, please do not use email notification when submitting an array job.

#!/bin/csh

#$ -pe smp 12          # Specify parallel environment and legal core size
#$ -q long             # Specify queue (use ‘debug’ for development)
#$ -N job_name         # Specify job name
#$ -t 1-10             # Specify number of tasks in array

module load mpich2     # Required modules

mpiexec –n 12 ./foo < data.$SGE_TASK_ID # Application to execute