Opportunistic High Energy Physics

On this webpage, we present a (rough) guide for software and data access for NDCMS on the opportunistic CRC condor cluster at Notre Dame. Given that the dedicated NDCMS cluster has only 800 cores available for analysis, it is imperative that opportunistic systems (~5000-6000 available) are available for use.

Background

Setting your .cshrc/.bashrc file for use with Parrot

With CSH:

#Check http://crc.nd.edu/wiki for login problems
#Contact crcsupport@nd.edu if further problems

if ( -r /opt/crc/Modules/current/init/csh ) then
        source /opt/crc/Modules/current/init/csh
endif

if ( $?PARROT_ENABLED ) then
    set prompt = " (Parrot) %n@%m%~%# "
else
    set prompt = " %n@%m%~%# "
endif

#needed in order for parrot to fork, etc.. to work correctly.
setenv PARROT_HELPER /afs/nd.edu/user37/ccl/software/cctools/bin/parrot_helper.so

#gives you parrot.
setenv PATH /afs/nd.edu/user37/ccl/software/cctools/bin:$PATH

#needed so that parrot knows where to look for the repositories
setenv HTTP_PROXY "http://ndcms.crc.nd.edu:3128"

#CMS software location in /cvmfs
setenv VO_CMS_SW_DIR /cvmfs/cms.cern.ch

#Needed in order to find info about SITECONFIG
setenv CMS_LOCAL_SITE T3_US_NotreDame

#CVS without CMS CERN account: undergrad studnets, REU
setenv CVSROOT :pserver:anonymous@cmssw.cvs.cern.ch:/local/reps/CMSSW

With BASH:

# Check http://crc.nd.edu/wiki for login problems
# Contact crcsupport@nd.edu if further problems

if [ -f /opt/crc/Modules/current/init/sh ]; then
        source /opt/crc/Modules/current/init/sh
fi

# Needed in order for parrot to fork, etc.. to work correctly.
export PARROT_HELPER=/afs/nd.edu/user37/ccl/software/cctools/bin/parrot_helper.so

# Gives you parrot.
export PATH=/afs/nd.edu/user37/ccl/software/cctools/bin:$PATH

# Needed so that parrot knows where to look for the repositories
export HTTP_PROXY="http://ndcms.crc.nd.edu:3128"

# CMS software location in /cvmfs
export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch

# Needed in order to find info about SITECONFIG
export CMS_LOCAL_SITE=T3_US_NotreDame

# CVS without CMS CERN account: undergrad studnets, REU
export CVSROOT=":pserver:anonymous@cmssw.cvs.cern.ch:/local/reps/CMSSW"

Files Needed

In addition to the copies on this page, these scripts exist in ~dskeehan/Public, accessible to anyone on AFS. These scripts should be placed in your home directory. The .tar file should be untared to your home directory as well. Bash versions of these scripts also exist and are located in ~dskeehan/Public.

grid.tgz

(contains the grid software, update when new versions come out. From /cvmfs/grid.cern.ch)

cmssw_init.csh

(initialization of cms environment to enable commands from /cvmfs/cms.cern.ch)

#!/bin/csh

#modules needed in order to make sure dependencies are correct
module load gcc/4.6.2

#some environmental variables must be set in order to use the cmssw and grid software
setenv SCRAM_ARCH slc5_amd64_gcc462

#takes care of cmssw setup.
source $VO_CMS_SW_DIR/cmsset_default.csh

#takes care of crab setup
source /scratch365/ndcms/crab/current/crab.csh


grid_init.csh

(Initialization of the grid environment to enable grid stuff)

#!/bin/csh

#takes care of grid-ui setup. Got this from the grid.cern.ch repository.
#as new grid versions come out this will need to be upgraded.
setenv GRID_SW_DIR ~
source $GRID_SW_DIR/grid.new/grid.cern.ch/3.2.11-1/etc/profile.d/grid-env.csh

#as parrot support becomes better, eventually it will be fast enough to grab directly from /cvmfs/
#source /cvmfs/grid.cern.ch/3.2.11-1/etc/profile.d/grid-env.csh


setup.csh

(updates the crl in the grid-security directory from the /cvmfs/grid.cern.ch respository)

#!/bin/csh

#This portion copies over the updated (daily) grid-security repository that current resides in /cvmfs/grid.cern.ch
#It does this because cvmfs publishes Certificate Revocation Lists daily that need to be current.
#If this is not updated, VOMS will not recognize your certificate.

parrot_run cp -avr /cvmfs/grid.cern.ch/etc/grid-security/certificates .
rm -rf ~/grid.new/grid.cern.ch/etc/grid-security/certificates
cp -avr ~/certificates ~/grid.new/grid.cern.ch/etc/grid-security
rm -rf ~/certificates

job_fixer.pl

(Fixes the jdl submission script that crab creates in order to run in an opportunistic setting)


#!/usr/bin/perl


#HOW TO USE THIS SCRIPT
#PLEASE RUN IN TOP LEVEL DIRECTORY FOR EACH CRAB JOB
#FIRST ARGUMENT SHOULD BE THE NAME OF THE JDL FILE THAT SHOULD BE MODIFIED
#FOR EXAMPLE: perl job_fixer.pl ./share/.condor_temp/name.jdl
#SCRIPT WILL CREATE A NEW JDL FILE WITH ENDING .OPPORTUNISTIC. 
#THIS SUBMISSION SCRIPT WILL CAN BE USED WITH THE OPPORTUNISTIC POOL
#In order to submit navigate to the jdl location and run: condor_submit name.jdl.opportunistic

use strict;
use warnings;
use Cwd;

my $line;
my $local_dir = getcwd;
my $arg;
my $file = $ARGV[0];

print "$local_dir\n";

rename $file, "$file.old";
unlink $file;

open (my $in, '<', "$file.old") || die "Can't open $file.old for reading!";
open (my $out, '>', "$file.opportunistic") || die "Can't open $file for editing!";

while (<$in>)
{
	my $line = $.;
	if (/Arguments\s\s=\s[0-9]+\s[0-9]+/) {
		$arg = substr($_, 13);
		print $out "Arguments  = -t cache $local_dir/job/CMSSW.sh $arg";  }
	elsif (/Executable/) {
		print $out "Executable = /afs/nd.edu/user37/ccl/software/cctools/bin/parrot_run\n"; }
	elsif (/environment = CONDOR_ID/) {
		print $out $_;
		print $out "getenv = true\n"; }
	elsif (/stream_output/) {
		print $out "stream_output = true\n";}
	elsif (/stream_error/) {
		print $out "stream_error = true\n"; }
	elsif (/when_to_transfer_output/) {
		print $out "when_to_transfer_output = ON_EXIT_OR_EVICT\n"; }
	else {
		print $out $_; }
}

close ($in) || die "Can't close $file.old!";
close ($out) || die "Can't close $file!";

cmssw_fixer.pl

(fixes the CMSSW.sh script to allow condor to serve as middleware)

#!/usr/bin/perl


#HOW TO USE THIS SCRIPT
#PLEASE RUN IN TOP LEVEL DIRECTORY FOR EACH CRAB JOB
#FIRST ARGUMENT SHOULD BE THE CMSSW.SH SCRIPT
#FOR EXAMPLE: perl cmssw_fixer.pl ./job/CMSSW.sh
#SCRIPT WILL Modify CMSSW.sh to allow condor to serve as the middlware batch engine.
use strict;
use warnings;
use Cwd;

my $line;
my $local_dir = getcwd;
my $file = $ARGV[0];

rename $file, "$file.old";
unlink $file;

open (my $in, '<', "$file.old") || die "Can't open $file.old for reading!";
open (my $out, '>', "$file") || die "Can't open $file for editing!";

while (<$in>) 
{
	$line = $.;
	if (/middleware == PBS/) {
		print $out "elif \[ \$middleware == PBS \] \|\| \[ \$middleware == CONDOR \]; then\n"; }
	else {
		print $out $_;}
}

close ($in) || die "Can't close $file.old!";
close ($out) || die "Can't close $file!";

chmod 0700, $file;

Setting up the CMSSW working directory

This process is identical to what is done on earth (and should be done on earth!), and any previous setup from earth should work fine on any of the opportunistic head nodes.

Getting your environment ready for opportunistic computation

  1. Run ./setup.csh. This will take a few minutes, but only needs to be done once every couple of days in order to keep the CRL's up to date in your local directory.
  2. Run source grid_init.csh. This will only take about a second. This sets up your environment to use the grid commands. Make sure you have a valid proxy by running
     grid-proxy-init -valid 192:00 
     voms-proxy-init -voms cms -valid 192:00 
  3. Run
     parrot_run tcsh 
    This will give you a parrot enabled shell. You can now access /cvmfs/ and its related repositories.
  4. Run
     source cmssw_init.csh 
    It should take a few seconds, and it will give you access to the scram, crab, etc commands.
  5. Navigate to your local CMSSW directory. You now need to edit both the jdl submission script and the CMSSW.sh runtime script in order for the worker node setup to work. Navigate to the .condor_temp in the dataset/share/ you wish to change. If this directory doesn't exist, have crab create it, by running
     multicrab -c crab_directory -submit 
    Have the job is created, kill it by running condor_rm username.
  6. Run
     perl job_fixer.pl ./share/.condor_temp/username+dataset.jdl 
    This will edit the jdl and produce two files: *.jdl.old and *.jdl.opportunistic.
  7. Run
     perl cmssw_fixer.pl ./job/CMSSW.sh 
    . This will edit CMSSW.sh to produce CMSSW.sh.old and CMSSW.sh (a new version).

Setup should now be complete. In the future, a wrapper script will be available to call both job_fixer and cmssw_fixer together for each dataset directory automatically.

Submitting jobs

In order to submit jobs, navigate to the dataset/share/.condor_temp/ directory again. You can now run

 condor_submit username+dataset.jdl.opportunistic 

If everything is set up correctly, the condor scheduler, should take it from here.

Getting output

Output currently resides on /store/username.

How to best use this system

Analysis of runtime eviction conditions coming soon!

Trouble Shooting

Help Parrot isn't responding!

It's possible that a parrot process currently has a lock on the cvmfs directory. Currently, each user is only able to maintain one process lock on each repository. Thus check to see if this is case by issuing

ps aux | grep parrot

If there is a rogue parrot process, kill it!

That didn't work!

It's also possible that the parrot local cache on the machine you're using is corrupted. To fix this, navigate to /tmp and delete your parrot.local.XXXXXX directory.

My jobs are held!

To see why your jobs are held, refer to this documentation (LINK HERE)

My jobs finish very quickly, but it doesn't seem like it worked!

Check the stdout and stderr in the .condor_temp directory. This gives excellent insight into why some jobs might have failed.

Further Reference

CCTools: (insert link) CVMFS: (insert link) --Dskeehan 16:23, 16 August 2013 (EDT)