rilpoint_mw113

Bertini Hardware

Contents

Back to the Parent Page

Bertini Resources

BERTINI FRONT-END MACHINES


hostnames: bertini1.crc.nd.edu, bertini2.crc.nd.edu , and bertini3.crc.nd.edu

You can securely log into the front-end BERTINI machine using a ssh client e.g.

> ssh –Y netid@bertini1.crc.nd.edu

CLUSTER SPECIFICATION

Sommese Skoll (Newer Xeon)

Description Host Group Name Nodes Cores/Node Node Specification Memory/Node
Dell PowerEdge R410 Servers @sommese_dqcneh 9 8 Dual Quad-core 12GB

Sommese Skoll (Older Xeon)

Description Host Group Name Nodes Cores/Node Node Specification Memory/Node
Western Scientific ASUS DSBF-DE Servers @sommese_dqcxeon 28 8 Dual Quad-core 8GB

Sommese Large Memory (New)

Description Host Group Name Nodes Cores/Node Node Specification Memory/Node
Dell R815 AMD "Abu Dahbi" Servers @sommese_q16copt 4 64 Quad 16-core 128GB

Sommese Large Memory (Older)

Description Host Group Name Nodes Cores/Node Node Specification Memory/Node
HP ProLiant SL165z G7 Servers @sommese_d8copt_96GB 2 16 Dual Eight-Core 96GB

Hauenstein (New)

Description Host Group Name Nodes Cores/Node Node Specification Memory/Node
Dell R815 AMD "Abu Dahbi" Servers @hauenstein_q16copt 4 64 Quad 16-core 128GB

Hauenstein (From NCState)

Description Host Group Name Nodes Cores/Node Node Specification Memory/Node
Dell R815 AMD "Abu Dahbi" Servers @hauenstein_q16copt 12 64 Quad 16-core 128GB


MODULES

The software environment on Bertini cluster is managed with modules. You can easily modify your programming and/or application environment by simply loading and removing the required modules. The most useful module commands are:

module avail             (view the complete module list)
module load xyz       (load the module xyz)
module unload xyz   (unload the module xyz)
module list                (list currently loaded modules)


JOB SCRIPTS

Jobs are submitted to the compute nodes via the Univa Grid Engine (UGE) batch submission system. Basic UGE batch scripts should conform to the following template:

#!/bin/csh

#$ -M netid@nd.edu	 # Email address for job notification
#$ -m bea		 # Send mail when job begins, ends and aborts
#$ -pe mpi-64 128	 # Specify parallel environment and legal core size
#$ -q long@@bertini	 # Specify queue
#$ -N job_name	         # Specify job name

module load bertini	         # Required modules

mpiexec -np $NSLOTS bertini input_file # Application to execute


PARALLEL ENVIRONMENTS

Parallel job scripts must request a parallel environment for execution:

1. smp         (parallel jobs running within a single machine)
2. mpi-64     (parallel jobs running across multiple 64-core machines)
3. mpi-16     (parallel jobs running across multiple 16-core machines)
4. mpi-8     (parallel jobs running across multiple 8-core machines)

Note: If no parallel environment is requested (i.e. you do not specify a –pe parameter), then the default execution environment is a single-core serial job.

When requesting a parallel environment you must also specify a valid core size. Legal core sizes for the parallel environments are:

smp         (1, 2, 3, . . ., 64)
mpi-64     (a multiple of 64)
mpi-16     (a multiple of 16)
mpi-8     (a multiple of 8)

JOB SUBMISSION AND MONITORING

Job scripts can be submitted to the UGE batch submission system using the qsub command:

> qsub job.script

Once your job script is submitted, you will receive a numerical job id from the batch submission system, which you can use to monitor the progress of the job.

Well-formed Job Scripts

Job scripts that are determined by UGE to have made valid resource requests will enter the queuing system with a queue-waiting(qw) status (once the requested resources become available, the job will enter the running (r) status). Job scripts that are determined not to be valid will enter the queuing system with an error queue-waiting (Eqw) status.

To see the running/queued status of your job submissions, invoke the qstat command with your username (netid) and observe the 'status column:

> qstat –u username

For a complete overview of your job submission, invoke the qstat command with the job id:

> qstat –j job_id

Note: The main reasons for invalid job scripts (i.e. having Eqw status) typically are:

  1. Illegal specification of parallel environments and/or core size requests
  2. Illegal queue specification
  3. Copying job scripts from a Windows OS environment to the Linux OS environment on the front-end machines (invisible Windows control codes are not parsed correctly by UGE). This can be fixed by running the script through the dos2unix command

Job Deletion

To delete a running or queued (e.g. submissions with Eqw status) job, use the following command:

> qdel –j job_id

Job Resource Monitoring

To better understand the resource usage (e.g. memory, cpu and I/O utilization) of your running jobs, you can monitor the runtime behavior of your job’s tasks as they execute on the compute nodes.

To determine the nodes on which your tasks are running, enter the following qstat command along with your username. Record the machine names (e.g. q16copt003.crc.nd.edu) associated with each task (both MASTER and SLAVE):

> qstat -u username -g t