Frequently Asked Questions



Contents



Users

Condor C Submission

http://www.t2.ucsd.edu/twiki2/bin/view/UCSDTier2/FkwUafCondorC

Specifying how many CPU you want your job to request


+request_cpus=8

Information

Site Readiness

http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport.html

Static Condor Information

Static Condor information updated every few minutes

http://www.t2.ucsd.edu/condorstatus/

Cacti

Cacti Plots

https://sentry.t2.ucsd.edu/cacti/

Ganglia Plots

Ganglia Plots

http://t2gw02.t2.ucsd.edu/ganglia/

Glidein Mon

http://glidein-mon.t2.ucsd.edu/ucsd/overview.html

Administrative Questions

How to generate a grid host cert

After backing up the old cert files run the following.

source $VDT_LOCATION/setup.sh
./globus/bin/grid-cert-request -host <hostname> 



How do I adjust the memory parameter on the kernel config line in a rocks CDROM kernel roll

The purpose of this fix is to allow Rocks to create very large raid partitions on 64bit machines.

Edit the following file:

rocks/src/roll/kernel/src/rocks-boot/enterprise/4/images/x86_64/isolinux.cfg

change:

label internal
       kernel vmlinuz
       append ramdisk_size=150000 initrd=initrd.img devfs=nomount ks
ksdevice=eth0 kssendmac selinux=0

to:

label internal
       kernel vmlinuz
       append ramdisk_size=150000 initrd=initrd.img devfs=nomount ks
ksdevice=eth0 kssendmac selinux=0 mem=1024M

then rebuild the kernel roll:

   # cd rocks/src/roll/kernel
   # make roll 



Why does /dev not appear with the right files on RHEL4 in chroot?

/dev is not a directory but a mount.

mount -t tmpfs --bind /dev /sysroot/dev



LCG VOMSRS

https://lcg-voms.cern.ch:8443/vo/cms/vomrs



Complete Listing of OSG Configuration Variables

A complete list of OSG Configuration variables

https://twiki.grid.iu.edu/twiki/bin/view/Main/OSGConfigurationParameters

Retrieving ERT Times for the Site

Login to a gatekeeper, source the VDT and run

ldapsearch -xLLL -h is.grid.iu.edu:2170 -b mds-vo-name=UCSDT2,mds-vo-name=local,o=grid '(&(GlueCEAccessControlBaseRule=VO:cms)(GlueCEUniqueID=*))' GlueCEStateEstimatedResponseTime



How to Output the Text form of Grid Certificates (Host Cert)

openssl x509 -in cert.pem -text

Converting PEM x509 Format Files to P12(DER) for import into Mozilla/Firefox

Log into the machine on which you have you x509 PEM files for your cert.

cd ~/.globus
openssl pkcs12 -in foo.pem -inkey bar.pem -export -out foo.p12 

It will then ask for your password, this is the same password that you would use e.g. when you run voms-proxy-init. NewCert? .p12 can then be imported into your browser!



Entering a UCSD ACT Customer Service Request

http://blink.ucsd.edu/go/csr



Preserving the RSL file from submitted grid jobs

On the server side, edit $VDT_LOCATION/globus/etc/globus-job-manager.conf and set "-save-logfile always"

That should preserve the gram_job_mgr files in the user home dir for debugging.



Updating OSG CA Certificates

http://vdt.cs.wisc.edu/releases/1.6.1/certificate_authorities.html

From

Run the following in $VDT_LOCATION

# pacman -update CA-Certificates





Fixing broken CRLs by hand

The CRLs come with the wn-client installation of OSG. This is exported from codefs to all the worker nodes. A CRL is a set of several files, one of which is updated via cron on codefs. The one that is updated has the ending .r0

Over Xmass 2009, we developed 3 CRLs that failed to update properly. They had zero filesize, and could no longer be overwritten by the CRL updater. It seems to refuse to update the zero size files. I thus had to copy by hand from the VDT client that's installed on the uaf the three .r0 files with zero filesize.

Once the files were copied, all worker nodes worked again.

Useful Grid Twiki for GRAM Errors

http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq



Installing Ubuntu on a Mac Mini

Using Grub

http://doc.gwos.org/index.php/UbuntuOnApple#Introduction_to_Linux_Installation_on_i386_Mac_Mini



UCSD Testing and Monitoring Links

Links, tools and sites related to monitoring the UCSD Tier2.

UCSDT2 ITB RSV Monitoring Link

https://osg-gw-3.t2.ucsd.edu:8443/rsv/

Query the ITB Ress and BDII

condor_status -pool osg-ress-4.fnal.gov -constraint 'GlueSiteName=="UCSDT2-ITB1"' -l 

ldapsearch -x -h is-itb.grid.iu.edu -p 2170 -b mds-vo-name=UCSDT2-ITB1,mds-vo-name=local,o=grid   

Checking BDII publishing

Sites to check to see whether UCSD is properly reporting to the BDII?

http://is.grid.iu.edu/cgi-bin/status.cgi

SAM Test Page

https://twiki.cern.ch/twiki/bin/view/CMS/SAMForCMS

CMS Prod Exit Code Results (CMS)

http://t2.unl.edu/pa/xml/quality_map_query?team=OSG

Job Robot Report

http://jobrobot.web.cern.ch/JobRobot/

VORS Monitoring

http://vors.grid.iu.edu/cgi-bin/index.cgi



SCRAM Template.pm Error

Error:

SCRAM Error: It appears that the module "Template.pm" is not installed. Please check your installaion. If you are an administrator, you can find the Perl Template Toolkit at www.cpan.org or at the web site of the author (Andy Wardley):

Fix: Install perl-Template-Toolkit and supporting packages

Purging CE Jobs

To fully purge the CE of jobs you need to

  1. Remove or move the contents of the condor home area (eg. /state/data/condor_local)
  2. Remove or move the contents of the GRAM area $GLOBUS_LOCATION/tmp/gram_job_state/gram_condor_log.*


Installing the cert infrastructure only from the VDT

This will install the parts needed to request host certs as well as keep CRLs and CAs up to date on a machine.


#!/bin/sh


mkdir -p /data/vdt
cd /data/vdt
wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-3.25.tar.gz
tar zxvf pacman-3.25.tar.gz
chown root:root -R pacman-3.25
cd pacman-3.25
source setup.sh
cd /data/vdt


VDTSETUP_AGREE_TO_LICENSES=y
export VDTSETUP_AGREE_TO_LICENSES
VDTSETUP_ENABLE_ROTATE=y
export VDTSETUP_ENABLE_ROTATE
VDTSETUP_EDG_CRL_UPDATE=y
export VDTSETUP_EDG_CRL_UPDATE
VDTSETUP_CA_CERT_UPDATER=y
export VDTSETUP_CA_CERT_UPDATER
VDTSETUP_INSTALL_CERTS=r
export VDTSETUP_INSTALL_CERTS


pacman -pretend-platform:linux-rhel-4
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:CA-Certificates
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:CA-Certificates-Updater
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:PPDG-Cert-Scripts



Condor jobs cannot find /state/data/condor_local/execute/dir_XXXX

Due to remounting order is important, check to make sure all underlying file systems are mounted before the remounts.

WS Gram Performance Optimization

http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Performance_Guide.html

Resting priority factors on the Condor cluster

for i in `condor_userprio -all -allusers |grep "@" | awk -F"@" '{print $1}'|grep ligo`; do for j in `seq 2 5`; do condor_userprio -setfactor ${i}@osg-gw-${j}.t2.ucsd.edu 100; done; done

Local Users Mappings

uscms048
uscms1581
uscms099
uscms1658
uscms1633
uscms076
uscms1586
uscms1285
uscms1674

WS GRAM Errors

Error initializing GAHP

Check that Java is installed and the condor_config correctl points to its location

Additional CMS Config for OSG

Copy the following file from the old install to the new

add-attributes.conf

./lcg/etc/add-attributes.conf

alter-attributes.conf

./lcg/etc/alter-attributes.conf

Getting a slot wn-client environment on a node interactively

Log into a node as root

# chroot /chroot/cafuser1
# su - cafuser1
# source /code/osgcode/wn-client-itb/setup.sh

Rocks Commands

Adding a cabinet to rocks

rocks add appliance cabinet-5 membership="Cabinet 5" short-name='c' node='cab5-compute'

OSG-RSV Commands at UCSD

osg-gw-4

$VDT_LOCATION/osg-rsv/setup/configure_osg_rsv --user rsv --init --server y  --ce-probes --ce-uri "osg-gw-4.t2.ucsd.edu" --srm-probes --srm-uri "srm-3.t2.ucsd.edu" -srm-dir /pnfs/t2.ucsd.edu/data4/cms/phedex/store/user/tmartin --gridftp-probes  --gratia --grid-type "OSG"  --consumers --verbose --setup-for-apache --proxy /tmp/x509up_u59001

osg-gw-2

$VDT_LOCATION/osg-rsv/setup/configure_osg_rsv --user rsv --init --server y  --ce-probes --ce-uri "osg-gw-2.t2.ucsd.edu" --srm-probes --srm-uri "srm-3.t2.ucsd.edu" -srm-dir /pnfs/t2.ucsd.edu/data4/cms/phedex/store/user/tmartin --gridftp-probes  --gratia --grid-type "OSG"  --consumers --verbose --setup-for-apache --proxy /tmp/x509up_u59001

OSG RSV

Testing CA Cert Probe by hand

 su rsv -c "./cacert-crl-expiry-probe -m org.osg.certificates.cacert-expiry -u osg-gw-4.t2.ucsd.edu -x /tmp/x509up_u59001"

Gratia Search Links

https://t2.unl.edu/gratia/xml/dn_efficiency_summary?vo=cms&facility=UCSD&fixed-height=False https://t2.unl.edu/gratia/xml/dn_wasted_summary?vo=cms&facility=UCSD&fixed-height=False

Making the RAID devices on the nodes by hand

In the event you need to do this by hand

Create the partitions on the new disk

Stop the devices

mdadm --stop /dev/md0
mdadm --stop /dev/md1
mdadm --create /dev/md0 --chunk=256 --level=0 --raid-devices=4 /dev/sda2 /dev/sdb1 /dev/sdc1 /dev/sdd1
 mdadm --create /dev/md1 --chunk=256 --level=0 --raid-devices=4 /dev/sda5 /dev/sdb3 /dev/sdc3 /dev/sdd3

Make the file systems

mkfs.ext3 -i 16384 /dev/md0; mkfs.ext3 -i 16384 /dev/md1
tune2fs -m0 /dev/md0; tune2fs -m0 /dev/md1

Fixing a corrupt ext3 Journal

debugfs -w -R "feature ^has_journal,^needs_recovery" /dev/md2
fsck -y /dev/md2
tune2fs -j /dev/md2

or

debugfs -w -R "feature ^has_journal,^needs_recovery" /dev/md1 && fsck -y /dev/md1 && tune2fs -j /dev/md1

Bulk CA Certs for Web Browsers

TACAR keeps a repository of all the IGTF CAs. You can individually install the ones you care about directly in your browser (or try a bulk download and install)

https://www.tacar.org/repos/

SRM Ping

srm-ping srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server

VOMS Proxy and FTS

https://twiki.cern.ch/twiki/bin/view/CMS/PhedexAdminDocsVomsProxies

HADOOP

Error 1

Exception in thread "main" java.io.IOException: Mkdirs failed to create /cms/store/user/tmartin                        
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)                                 
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)                                                 
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)                                                 
Call to org.apache.hadoop.conf.FileSystem::create((Lorg/apache/hadoop/fs/Path;ZISJ)Lorg/apache/hadoop/fs/FSDataOutputStream;) failed!                 

Check to make sure the hadoop-site.xml is properly configured, or the CLASSPATH is set correctly.

Rocks Command Add Appliance at UCSD

rocks add appliance cabinet-5 membership="Cabinet 5" short-name='c' node='cab5-compute'
rocks add appliance cabinet-4 membership="Cabinet 4" short-name='c' node='cab4-compute'
rocks add appliance cabinet-6 membership="Cabinet 6" short-name='c' node='cab6-compute' 
rocks add appliance cabinet-7 membership="Cabinet 7" short-name='c' node='cab7-compute'

Memory copy of hadoop fsimage when restarting

First put hadoop into safe mode then run

hadoop dfsadmin -metasave

Checking for black hole nodes with condor

Remote

globus-job-run osg-gw-2.t2.ucsd.edu /bin/sh -c 'source 
$OSG_LOCATION/setup.sh; condor_history  -constraint "RemoteWallClockTime 
< 120 && Owner == \"cmsprod\" && CurrentTime-EnteredCurrentStatus < 
3600*24*4" -format "%s\n" LastRemoteHost ' | sed 's/slot.*@//g' | sort | 
uniq -c | sort -r -n

Local

condor_history  -constraint "RemoteWallClockTime 
< 120 && Owner == \"cmsprod\" && CurrentTime-EnteredCurrentStatus < 
3600*24*4" -format "%s\n" LastRemoteHost  | sed 's/slot.*@//g' | sort | 
uniq -c | sort -r -n

Hadoop mount is responding slow

This can be caused by the hadoop namenode getting stuck in a loop, this is often obvious when the hadoop namenode is sitting at around 100% of a single CPU under normal operating conditions, it should be much lower. After carefully checking the namespace backup restart the namenode.

If the mount process for fuse is at 100% then remount it. There is possibly a memory issue. Add to the file

/etc/hadoop/conf/hadoop-env.sh

export LIBHDFS_OPTS=-Xmx4096m

Remounting hadoop on UAF

umount -l /hadoop
mount /hadoop

New CRL check script

The following is the new CRL check script location and cron job on codefs. It will confirm all of the CRL are valid CRL files and force a re-run of the fetch-crl script of they are not.

root@codefs /code/osgcode/tmartin# cat /etc/cron.d/checkcrl 22 0,3,6,9,12,15,18,22 * * * root /code/osgcode/tmartin/checkcrl.sh root@codefs /code/osgcode/tmartin#

--++ Adding a site to Glidein Factory

Log into the glide 1

Restarting the Glidein 1 Factory


When you you finished moving the stuff, remember to start (in this order):
1) the httpd (/etc/init.d as root)
2) Condor (/etc/init.d as root)
3) the gfactory (~/glideinsubmit/glidein_Productio3_1 as gfactory)

HKspecInt?

https://hepix.caspur.it/benchmarks/doku.php?id=bench:results

Converting P12 Certificates to x509

OSGCMSCertificateSetup

Setting up Putty with RSA keys

http://www.andremolnar.com/how_to_set_up_ssh_keys_with_putty_and_not_get_server_refused_our_key

Glidein Factory FAQ

GlideinFactoryFAQ

Kerberos Help for Mac Users

http://uscms.org/uscms_at_work/data_computing/facility_operations/uaf.shtml

XrootD? Install

https://twiki.cern.ch/twiki/bin/view/Main/HdfsXrootdInstall

Gaining access to Grub prompt in DomU?

xm create -c domain

Useful SRM Commands

Copy

lcg-cp -v -b -D srmv2 file:/home/users/tmartin/smallfile.zero srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/deleteme



 srmcp -2 --debug=true -delegate=false srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/srmtest/testfile-today file://localhost//tmp/testfile-b.zero

List

lcg-ls  -l -v -b -D srmv2 srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/deleteme

Delete

srmrm -2 -delegate=false srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/deleteme
Note: You need to run once per file, so you probably want to iterate over a list of files with a for loop in bash

Make a directory

srmmkdir -2 -delegate=false  srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/dirfordelete

Remove a directory

srmrmdir -2 -delegate=false  srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/dirfordelete

Creating a Xen instance

First copy the default config to the name of the instance you are creating. Edit as needed, Then run;

xm create -c devg-2 extra=" init 1 xencons=xvc0"

Tuning NFS Server

Number of threads

cat /proc/net/rpc/nfsd



fh 278562 0 0 0 0
io 4244458892 3311147501
th 96 72034 108604.717 46238.689 43326.894 1563.190 980.777 1653.901 105.886 73.850 58.314 102.676
ra 192 1422103166 0 0 0 0 0 0 0 0 0 2354117

th

  • How many threads you have
  • The number of times you have used the last thread
  • Count you have used between 1 and 12 threads (10%) for 108604.717 seconds
  • Count you have used between 13 and 25 threads for 46238.689 seconds

GUMS Manual Account mapping instructions for RSV


Very sorry about letting this lie dormant. I haven't yet sent instructions.

The plan is to add to the Maven-generated GUMS documentation a recipe
for doing the one-to-one user mappings. I'll make sure the info makes it
into the next release.

Briefly, the recipe would be:

1) Under User Groups, create a User Group for the user, e.g.
JohnSmithUserGroup
2) Under Manual User Group Members, add the intended user's DN and
optional FQAN and email to the group created in #1
3) Under Account Mappers, create a new Account Mapper, e.g.
JohnSmithMapper of type "manual" pointed at the UNIX account you want
John Smith to go to (e.g. jsmith).
4) Under Group To Account Mappings, create a Group To Account Mapping,
e.g. JohnSmithGTAMapping using user group from #1 and Account Mapper
from #3, defining VO accounting info.
5) Under Host To Group Mappings, create or edit a relevant host to group
mapping definition and include the GroupToAccount mapping from #4.

Note that where a name is defined, I've chosen a distinct name that
includes what kind of thing it is. In theory the namespace shouldn't
matter, but it makes what you're doing clearer.

Note also that this is all quite complicated to do on a per-user basis.
That is because GUMS was never designed or intended to do manual
per-user mapping. Rather it was intended to be a Grid-ID-to-UNIX-ID
*policy* tool where you handle a whole VO with one chain.

Hope this made sense.

Cheers,

--john

Apache build modules

 apxs -I /usr/include/libxml -I . -i -c mod_proxy_html.c
 apxs -I /usr/include/libxml -I . -i -c mod_proxy_html.c
 apxs -I /usr/include/libxml -I . -i -c mod_xml2enc.c

GLEXEC Test

export GLEXEC_CLIENT_CERT=/tmp/x509up_u583 
export X509_USER_PROXY=/tmp/x509up_u583
/usr/sbin/glexec /usr/bin/id

Grub Installing on second raid device

# grub
Probing devices to guess BIOS drives. This may take a long time.


    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.]
grub> find /grub/stage1
find /grub/stage1
 (hd0,0)
 (hd1,0)
grub> device (hd0) /dev/sdb 
device (hd0) /dev/sdb 
grub> root (hd0,0)
root (hd0,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd0)
setup (hd0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  15 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2 /grub/grub.conf"... succeeded
Done.
grub> quit
quit

Tastwiki

New User Registration

http://www.t2.ucsd.edu/tastwiki/bin/view/TWiki/TWikiRealRegistration

GUMS

Banning a user

https://www.opensciencegrid.org/bin/view/Documentation/Release3/BanningUsersAtSite

Install certificate components and fetch-crl

First remove any existing soft links to old pacman certificate install and disable the cron based crl and cert updates for the pacman based install.

To install the OSG 3 CRL infrastucture

First remove the soft link for the old pacman based cert-inf install

rm /etc/grid-security/certificates

Then remove the cron jobs

cd /cert-inf
source setup.sh
vdt-control --off



rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
rpm -Uvh http://repo.grid.iu.edu/osg-el5-release-latest.rpm

yum -y install osg-ca-certs
yum -y install fetch-crl
chkconfig fetch-crl-cron on
service fetch-crl-cron start

To grab the latest immediately run

To grab the latest immediately run
/usr/sbin/fetch-crl

Remove the old pacman cert area

rm -rf /cert-inf

mtest

Mtest is a process that runs once an hour in Cron on the worker nodes to check for the hadoop mount. If it is not there mtest tries to remount the hadoop filesystem at /hadoop. The process creates a lot of logs in hadoop from the nodes that process the test.

Condor Requirements change

Example of changing a condor requirements line for all jobs.

condor_cron_qedit -const 'Owner=!=UNDEFINED' Requirements '( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" )'

Create a mirrored Logical Volume

Create the physical and volume groups as normal.

lvcreate -L 223G -m1 --mirrorlog mirrored --alloc anywhere -n osg-ce-2_vol vg_osg-ce-2

Installing Condor UAF Glidein

Install, in order:

  1. condor
  2. glideinwms-userschedd

Add in /etc/condor/config.d/99_local.config

CONDOR_HOST = uaf-2.t2.ucsd.edu

Run security stuff

glidecondor_addDN -daemon "My own DN from hostcert" /etc/grid-security/hostcert.pem condor
glidecondor_addDN -daemon "The collector of the UAF pool" '/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uaf-2.t2.ucsd.edu' coll

Disable the second schedd that is enable by default

glidecondor_createSecSched ""

Xrootd Testing

  xrdcp -d 2 -f root://xrootd.t2.ucsd.edu//store/test/xrootd/T2_US_UCSD//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/null
  xrdcp -d 2 -f root://cmsxrootd.fnal.gov//store/test/xrootd/T2_US_UCSD//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/null
  xrdcp -d 2 -f root://cms-xrd-global.cern.ch//store/test/xrootd/T2_US_UCSD//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/nul

Authors

-- TerrenceMartin - 3/17/2017
Topic revision: r98 - 2017/03/17 - 15:52:27 - TerrenceMartin
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback