Compute Element RPM Installation
The installation of the Compute Element (CE) software will make your cluster available to users within SURAgrid and any other VO's you choose to support. In this example, we will follow the OSG Release3 Compute Element Documentation and will assume a Torque (PBS) installation. The OSG documentation provides details on supporting other schedulers: SGE, LSF, and Condor. Instead of reproducing all of the steps in the OSG document, we will only comment on the steps to help with your installation of the CE software.
- A personal certificate - Requesting Personal Certificates
- Certificates for host, http, rsv - Requesting Certificates
- Registration of your resource in OIM - OIM_Registration
- A batch scheduler, either on your CE or accessible by your CE.
- Root access to a x86_64 real or virtual machine running RHEL5 or RHEL6 (or CentOS/SL) with current updates. Minimum 1GB RAM, 8GB disk. FQDN in DNS with a static IP. We'll assume RHEL6 for this example.
- Firewall port openings for GRAM (2119/tcp), GridFTP(2811/tcp), and callback ports. You can choose the callback range. E.g., 20000-24999/tcp. These will need to be opened on your campus firewall as well as in /etc/sysconfig/iptables on the CE host.
- User accounts: apache (uid 48), tomcat (uid 91) on your CE. rsv, sura000-sura050, sgvoadmin on your CE, head node, and compute nodes (and uid's). The sgvoadmin is used for maintaining software packages such as R and Octave.
- Decide on locations for $OSG_APP and $OSG_DATA. These should be accessible from every node.
Important: umask 022 - to ensure that everything is installed with proper permissions.
Following the OSG Release3 Compute Element Documentation, carry out the preliminary steps through Section 5.
If you haven't done so already, install your certificates as described in Requesting Certificates.
In section 6, choose the package for your batch system. For a base RHEL6 installation, this step may include over 350 package dependencies, over 250MB of downloads, and over 600MB of installed packages.
Section 7.1 describes the configuration of the OSG stack. Go to /etc/osg/config.d/ and edit the following files:
- 10-misc.ini - set gums_host = null.yourdomain.edu. We're not using GUMS in this example.
- 10-storage.ini - set app_dir, data_dir, worker_node_temp, site_read (usually = data_dir), and site_write (usually data_dir). Permissions on app_dir and data_dir should be 1777.
- 20-pbs.ini (or 20-your_scheduler.ini)
- 30-cemon.ini - set enabled = False if your CE is not yet registered in OIM
- 30-gip.ini - set batch = yourscheduler, advertise_gsiftp = TRUE, gsiftp_host = your-ce-name.yourdomain.edu. Create [Subcluster ALLNODES] and describe your cluster.
- 30-gratia.ini - set enabled = False if your CE is not yet registered in OIM
- 40-siteinfo.ini - fill in all your site info, which should correlate to your OIM registration.
Run osg-configure -v. Run osg-configure -c.
Sections 7.2-7.4 deal with scheduler-specific items. There's nothing in particular about PBS for our case.
Section 8 presents two authentication options: GUMS and edg-mkgridmap. For our example case, we will use edg-mkgridmap, which is a script used to generate /etc/grid-security/grid-mapfile.
Not present in the OSG documentation at the time of this wiki page writing is a step required to disable GUMS. In /etc/grid-security/gsi-authz.conf, comment out the existing globus_mapping line:
#globus_mapping liblcas_lcmaps_gt4_mapping.so lcmaps_callout
Next, save a copy of the distribuited /etc/edg-mkgridmap.conf.
cp -p /etc/edg-mkgridmap.conf /etc/edg-mkgridmap.conf.dist
Edit /etc/edg-mkgridmap.conf so that it contains the following lines:
#### GROUP: group URI [lcluser] # #------------------- group vomss://voms.hpcc.ttu.edu:8443/voms/suragrid AUTO group vomss://voms.hpcc.ttu.edu:8443/voms/suragrid?/suragrid/sgvoadmin sgvoadmin gmf_local /etc/grid-security/local-grid-mapfile
The AUTO definition for lcluser above will use a /usr/libexec/edg-mkgridmap/local-subject2user script to generate local usernames for a given grid certificate's subject DN. This script does not exist, so you will need to create it. A simple example is given below.
cat << EOT > /usr/libexec/edg-mkgridmap/local-subject2user #!/bin/bash # # Simple local-subject2user script called by edg-mkgridmap. # Persistent mappings are stored in $mapfile - sufficient # for small VO's. Last used index is stored in $lastfile. # Could extend this to a simple sqlite3 DB. # mapfile=/usr/local/etc/subject2user.lis lastfile=/usr/local/etc/subject2user.last dn="$1" base="sura" fmt="%03d" test -f $mapfile || touch $mapfile test -f $lastfile || echo -1 > $lastfile last=`cat $lastfile` fgrep -q "$dn" $mapfile if [ $? -ne 0 ]; then last=$((last+1)) user=`printf "%s$fmt\n" $base $last` echo $user echo "$dn" $user >> $mapfile echo $last > $lastfile fi exit 0 EOT
chmod u+x /usr/libexec/edg-mkgridmap/local-subject2user
Next, create a new file, /etc/grid-security/local-grid-mapfile containing the mapping for your RSV user:
"/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=rsv/your-ce.yourdomain.edu" rsv
Run edg-mkgridmap and check the file /etc/grid-security/grid-mapfile.
Go ahead and install RSV per Section 9.
By default, Gratia will report all job accounting records to OSG, including those of your local, non-grid users. To send only grid accounting data to OSG, edit /etc/gratia/pbs-lsf/ProbeConfig and set the following:
Globus Callback Ports
The port range defined in 40-network.ini needs to be set in some additional files.
echo export GLOBUS_TCP_PORT_RANGE=begin_port,end_port >> /etc/sysconfig/globus-gatekeeper echo export GLOBUS_TCP_PORT_RANGE=begin_port,end_port >> /var/lib/osg/globus-firewall echo export GLOBUS_TCP_PORT_RANGE=begin_port,end_port >> /etc/profile.d/globus-firewall.sh echo setenv GLOBUS_TCP_PORT_RANGE begin_port,end_port >> /etc/profile.d/globus-firewall.csh
Services and Cron Scripts
If you have not yet registered your resource with OIM, or if your resource has not yet been approved, go ahead and disable the reporting cron scripts. Comment out the scripts in /etc/cron.d/gratia* and /etc/cron.d/osg-info-services.
Start and enable the services as described in Sections 10 and 11. Do not start/enable tomcat5 nor gratia-probes-cron if your site is not registered in OIM.
When your site's GIP scripts begin publishing data to the OSG BDII, it will send job managers in the form ce.yourdomain.edu/jobmanager-pbs-queuename. However, if a user tries to submit a job to that resource it will be bounced by your CE, claiming that the resource doesn't exist. One way to publish this queue and have it handled as a legitimate resource is to add it to the /etc/grid-services/ directory. For this example, let's assume that your queue is named "suragrid" and you're running Torque. The changes for other job managers should be similar.
cd /etc/grid-resources/available cp -p jobmanager-pbs-seg jobmanager-pbs-suragrid-seg - or, depending on if seg files aren't present - cp -p jobmanager-pbs jobmanager-pbs-suragrid cd /etc/grid-resources ln -s available/jobmanager-pbs-suragrid-seg jobmanager-pbs-suragrid-seg ln -s jobmanager-pbs-suragrid-seg jobmanager-pbs-suragrid
If your installation does not have -seg files you can just make the analogous symlinks using the regular files.
Next, you'll need to add a Perl handler for the jobmanager-pbs-suragrid resource. Our version of Perl is 5.8.8 as shown in the directory below. Your's may be different.
cd /usr/lib/perl5/vendor_perl/5.8.8/Globus/GRAM/JobManager cp -p pbs.pm pbs-suragrid.pm vi pbs-suragrid.pm - Near the top, change JobManager::pbs; to JobManager::pbs-suragrid; - Search for the section of code containing "PBS -q". Replace the entire if section starting at: if ($description->queue() ne ... with print JOB "#PBS -q suragrid\n";
Try it out:
service globus-gatekeeper start service globus-gridftp-server start voms-proxy-init -voms suragrid globus-job-run ce.yourdomain.edu/jobmanager-pbs-suragrid /bin/env