Frequently Asked Questions
Contents
Users
Condor C Submission
http://www.t2.ucsd.edu/twiki2/bin/view/UCSDTier2/FkwUafCondorC
Specifying how many CPU you want your job to request
+request_cpus=8
Information
Site Readiness
http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport.html
Static Condor Information
Static Condor information updated every few minutes
http://www.t2.ucsd.edu/condorstatus/
Cacti
Cacti Plots
https://sentry.t2.ucsd.edu/cacti/
Ganglia Plots
Ganglia Plots
http://t2gw02.t2.ucsd.edu/ganglia/
Glidein Mon
http://glidein-mon.t2.ucsd.edu/ucsd/overview.html
Administrative Questions
How to generate a grid host cert
After backing up the old cert files run the following.
source $VDT_LOCATION/setup.sh
./globus/bin/grid-cert-request -host <hostname>
How do I adjust the memory parameter on the kernel config line in a rocks CDROM kernel roll
The purpose of this fix is to allow Rocks to create very large raid partitions on 64bit machines.
Edit the following file:
rocks/src/roll/kernel/src/rocks-boot/enterprise/4/images/x86_64/isolinux.cfg
change:
label internal
kernel vmlinuz
append ramdisk_size=150000 initrd=initrd.img devfs=nomount ks
ksdevice=eth0 kssendmac selinux=0
to:
label internal
kernel vmlinuz
append ramdisk_size=150000 initrd=initrd.img devfs=nomount ks
ksdevice=eth0 kssendmac selinux=0 mem=1024M
then rebuild the kernel roll:
# cd rocks/src/roll/kernel
# make roll
Why does /dev not appear with the right files on RHEL4 in chroot?
/dev is not a directory but a mount.
mount -t tmpfs --bind /dev /sysroot/dev
LCG VOMSRS
https://lcg-voms.cern.ch:8443/vo/cms/vomrs
Complete Listing of OSG Configuration Variables
A complete list of OSG Configuration variables
https://twiki.grid.iu.edu/twiki/bin/view/Main/OSGConfigurationParameters
Retrieving ERT Times for the Site
Login to a gatekeeper, source the VDT and run
ldapsearch -xLLL -h is.grid.iu.edu:2170 -b mds-vo-name=UCSDT2,mds-vo-name=local,o=grid '(&(GlueCEAccessControlBaseRule=VO:cms)(GlueCEUniqueID=*))' GlueCEStateEstimatedResponseTime
How to Output the Text form of Grid Certificates (Host Cert)
openssl x509 -in cert.pem -text
Converting PEM x509 Format Files to P12(DER) for import into Mozilla/Firefox
Log into the machine on which you have you x509 PEM files for your cert.
cd ~/.globus
openssl pkcs12 -in foo.pem -inkey bar.pem -export -out foo.p12
It will then ask for your password, this is the same password that you would use e.g. when you run voms-proxy-init.
NewCert? .p12 can then be imported into your browser!
Entering a UCSD ACT Customer Service Request
http://blink.ucsd.edu/go/csr
Preserving the RSL file from submitted grid jobs
On the server side, edit $VDT_LOCATION/globus/etc/globus-job-manager.conf and set "-save-logfile always"
That should preserve the gram_job_mgr files in the user home dir for debugging.
Updating OSG CA Certificates
http://vdt.cs.wisc.edu/releases/1.6.1/certificate_authorities.html
From
Run the following in $VDT_LOCATION
# pacman -update CA-Certificates
Fixing broken CRLs by hand
The CRLs come with the wn-client installation of OSG. This is exported from codefs to all the worker nodes. A CRL is a set of several files, one of which is updated via cron on codefs. The one that is updated has the ending .r0
Over Xmass 2009, we developed 3 CRLs that failed to update properly. They had zero filesize, and could no longer be overwritten by the CRL updater. It seems to refuse to update the zero size files. I thus had to copy by hand from the VDT client that's installed on the uaf the three .r0 files with zero filesize.
Once the files were copied, all worker nodes worked again.
Useful Grid Twiki for GRAM Errors
http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq
Installing Ubuntu on a Mac Mini
Using Grub
http://doc.gwos.org/index.php/UbuntuOnApple#Introduction_to_Linux_Installation_on_i386_Mac_Mini
UCSD Testing and Monitoring Links
Links, tools and sites related to monitoring the UCSD Tier2.
UCSDT2 ITB RSV Monitoring Link
https://osg-gw-3.t2.ucsd.edu:8443/rsv/
Query the ITB Ress and BDII
condor_status -pool osg-ress-4.fnal.gov -constraint 'GlueSiteName=="UCSDT2-ITB1"' -l
ldapsearch -x -h is-itb.grid.iu.edu -p 2170 -b mds-vo-name=UCSDT2-ITB1,mds-vo-name=local,o=grid
Checking BDII publishing
Sites to check to see whether UCSD is properly reporting to the BDII?
http://is.grid.iu.edu/cgi-bin/status.cgi
SAM Test Page
https://twiki.cern.ch/twiki/bin/view/CMS/SAMForCMS
CMS Prod Exit Code Results (CMS)
http://t2.unl.edu/pa/xml/quality_map_query?team=OSG
Job Robot Report
http://jobrobot.web.cern.ch/JobRobot/
VORS Monitoring
http://vors.grid.iu.edu/cgi-bin/index.cgi
SCRAM Template.pm Error
Error:
SCRAM Error: It appears that the module "Template.pm" is not installed. Please check your installaion. If you are an administrator, you can find the Perl Template Toolkit at www.cpan.org or at the web site of the author (Andy Wardley):
Fix: Install perl-Template-Toolkit and supporting packages
Purging CE Jobs
To fully purge the CE of jobs you need to
- Remove or move the contents of the condor home area (eg. /state/data/condor_local)
- Remove or move the contents of the GRAM area $GLOBUS_LOCATION/tmp/gram_job_state/gram_condor_log.*
Installing the cert infrastructure only from the VDT
This will install the parts needed to request host certs as well as keep CRLs and CAs up to date on a machine.
#!/bin/sh
mkdir -p /data/vdt
cd /data/vdt
wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-3.25.tar.gz
tar zxvf pacman-3.25.tar.gz
chown root:root -R pacman-3.25
cd pacman-3.25
source setup.sh
cd /data/vdt
VDTSETUP_AGREE_TO_LICENSES=y
export VDTSETUP_AGREE_TO_LICENSES
VDTSETUP_ENABLE_ROTATE=y
export VDTSETUP_ENABLE_ROTATE
VDTSETUP_EDG_CRL_UPDATE=y
export VDTSETUP_EDG_CRL_UPDATE
VDTSETUP_CA_CERT_UPDATER=y
export VDTSETUP_CA_CERT_UPDATER
VDTSETUP_INSTALL_CERTS=r
export VDTSETUP_INSTALL_CERTS
pacman -pretend-platform:linux-rhel-4
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:CA-Certificates
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:CA-Certificates-Updater
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:PPDG-Cert-Scripts
Condor jobs cannot find /state/data/condor_local/execute/dir_XXXX
Due to remounting order is important, check to make sure all underlying file systems are mounted before the remounts.
WS Gram Performance Optimization
http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Performance_Guide.html
Resting priority factors on the Condor cluster
for i in `condor_userprio -all -allusers |grep "@" | awk -F"@" '{print $1}'|grep ligo`; do for j in `seq 2 5`; do condor_userprio -setfactor ${i}@osg-gw-${j}.t2.ucsd.edu 100; done; done
Local Users Mappings
uscms048
uscms1581
uscms099
uscms1658
uscms1633
uscms076
uscms1586
uscms1285
uscms1674
WS GRAM Errors
Error initializing GAHP
Check that Java is installed and the condor_config correctl points to its location
Additional CMS Config for OSG
Copy the following file from the old install to the new
add-attributes.conf
./lcg/etc/add-attributes.conf
alter-attributes.conf
./lcg/etc/alter-attributes.conf
Getting a slot wn-client environment on a node interactively
Log into a node as root
# chroot /chroot/cafuser1
# su - cafuser1
# source /code/osgcode/wn-client-itb/setup.sh
Rocks Commands
Adding a cabinet to rocks
rocks add appliance cabinet-5 membership="Cabinet 5" short-name='c' node='cab5-compute'
OSG-RSV Commands at UCSD
osg-gw-4
$VDT_LOCATION/osg-rsv/setup/configure_osg_rsv --user rsv --init --server y --ce-probes --ce-uri "osg-gw-4.t2.ucsd.edu" --srm-probes --srm-uri "srm-3.t2.ucsd.edu" -srm-dir /pnfs/t2.ucsd.edu/data4/cms/phedex/store/user/tmartin --gridftp-probes --gratia --grid-type "OSG" --consumers --verbose --setup-for-apache --proxy /tmp/x509up_u59001
osg-gw-2
$VDT_LOCATION/osg-rsv/setup/configure_osg_rsv --user rsv --init --server y --ce-probes --ce-uri "osg-gw-2.t2.ucsd.edu" --srm-probes --srm-uri "srm-3.t2.ucsd.edu" -srm-dir /pnfs/t2.ucsd.edu/data4/cms/phedex/store/user/tmartin --gridftp-probes --gratia --grid-type "OSG" --consumers --verbose --setup-for-apache --proxy /tmp/x509up_u59001
OSG RSV
Testing CA Cert Probe by hand
su rsv -c "./cacert-crl-expiry-probe -m org.osg.certificates.cacert-expiry -u osg-gw-4.t2.ucsd.edu -x /tmp/x509up_u59001"
Gratia Search Links
https://t2.unl.edu/gratia/xml/dn_efficiency_summary?vo=cms&facility=UCSD&fixed-height=False https://t2.unl.edu/gratia/xml/dn_wasted_summary?vo=cms&facility=UCSD&fixed-height=False
Making the RAID devices on the nodes by hand
In the event you need to do this by hand
Create the partitions on the new disk
Stop the devices
mdadm --stop /dev/md0
mdadm --stop /dev/md1
mdadm --create /dev/md0 --chunk=256 --level=0 --raid-devices=4 /dev/sda2 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm --create /dev/md1 --chunk=256 --level=0 --raid-devices=4 /dev/sda5 /dev/sdb3 /dev/sdc3 /dev/sdd3
Make the file systems
mkfs.ext3 -i 16384 /dev/md0; mkfs.ext3 -i 16384 /dev/md1
tune2fs -m0 /dev/md0; tune2fs -m0 /dev/md1
Fixing a corrupt ext3 Journal
debugfs -w -R "feature ^has_journal,^needs_recovery" /dev/md2
fsck -y /dev/md2
tune2fs -j /dev/md2
or
debugfs -w -R "feature ^has_journal,^needs_recovery" /dev/md1 && fsck -y /dev/md1 && tune2fs -j /dev/md1
Bulk CA Certs for Web Browsers
TACAR keeps a repository of all the IGTF CAs. You can individually install the ones you care about directly in your browser (or try a bulk download and install)
https://www.tacar.org/repos/
SRM Ping
srm-ping srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server
VOMS Proxy and FTS
https://twiki.cern.ch/twiki/bin/view/CMS/PhedexAdminDocsVomsProxies
HADOOP
Error 1
Exception in thread "main" java.io.IOException: Mkdirs failed to create /cms/store/user/tmartin
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:358)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
Call to org.apache.hadoop.conf.FileSystem::create((Lorg/apache/hadoop/fs/Path;ZISJ)Lorg/apache/hadoop/fs/FSDataOutputStream;) failed!
Check to make sure the hadoop-site.xml is properly configured, or the CLASSPATH is set correctly.
Rocks Command Add Appliance at UCSD
rocks add appliance cabinet-5 membership="Cabinet 5" short-name='c' node='cab5-compute'
rocks add appliance cabinet-4 membership="Cabinet 4" short-name='c' node='cab4-compute'
rocks add appliance cabinet-6 membership="Cabinet 6" short-name='c' node='cab6-compute'
rocks add appliance cabinet-7 membership="Cabinet 7" short-name='c' node='cab7-compute'
Memory copy of hadoop fsimage when restarting
First put hadoop into safe mode then run
hadoop dfsadmin -metasave
Checking for black hole nodes with condor
Remote
globus-job-run osg-gw-2.t2.ucsd.edu /bin/sh -c 'source
$OSG_LOCATION/setup.sh; condor_history -constraint "RemoteWallClockTime
< 120 && Owner == \"cmsprod\" && CurrentTime-EnteredCurrentStatus <
3600*24*4" -format "%s\n" LastRemoteHost ' | sed 's/slot.*@//g' | sort |
uniq -c | sort -r -n
Local
condor_history -constraint "RemoteWallClockTime
< 120 && Owner == \"cmsprod\" && CurrentTime-EnteredCurrentStatus <
3600*24*4" -format "%s\n" LastRemoteHost | sed 's/slot.*@//g' | sort |
uniq -c | sort -r -n
Hadoop mount is responding slow
This can be caused by the hadoop namenode getting stuck in a loop, this is often obvious when the hadoop namenode is sitting at around 100% of a single CPU under normal operating conditions, it should be much lower. After carefully checking the namespace backup restart the namenode.
If the mount process for fuse is at 100% then remount it. There is possibly a memory issue. Add to the file
/etc/hadoop/conf/hadoop-env.sh
export LIBHDFS_OPTS=-Xmx4096m
Remounting hadoop on UAF
umount -l /hadoop
mount /hadoop
New CRL check script
The following is the new CRL check script location and cron job on codefs. It will confirm all of the CRL are valid CRL files and force a re-run of the fetch-crl script of they are not.
root@codefs /code/osgcode/tmartin# cat /etc/cron.d/checkcrl 22 0,3,6,9,12,15,18,22 * * * root /code/osgcode/tmartin/checkcrl.sh root@codefs /code/osgcode/tmartin#
--++ Adding a site to Glidein Factory
Log into the glide 1
Restarting the Glidein 1 Factory
When you you finished moving the stuff, remember to start (in this order):
1) the httpd (/etc/init.d as root)
2) Condor (/etc/init.d as root)
3) the gfactory (~/glideinsubmit/glidein_Productio3_1 as gfactory)
HKspecInt?
https://hepix.caspur.it/benchmarks/doku.php?id=bench:results
Converting P12 Certificates to x509
OSGCMSCertificateSetup
Setting up Putty with RSA keys
http://www.andremolnar.com/how_to_set_up_ssh_keys_with_putty_and_not_get_server_refused_our_key
Glidein Factory FAQ
GlideinFactoryFAQ
Kerberos Help for Mac Users
http://uscms.org/uscms_at_work/data_computing/facility_operations/uaf.shtml
XrootD? Install
https://twiki.cern.ch/twiki/bin/view/Main/HdfsXrootdInstall
Gaining access to Grub prompt in DomU?
xm create -c domain
Useful SRM Commands
Copy
lcg-cp -v -b -D srmv2 file:/home/users/tmartin/smallfile.zero srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/deleteme
srmcp -2 --debug=true -delegate=false srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/srmtest/testfile-today file://localhost//tmp/testfile-b.zero
List
lcg-ls -l -v -b -D srmv2 srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/deleteme
Delete
srmrm -2 -delegate=false srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/deleteme
Note: You need to run once per file, so you probably want to iterate over a list of files with a for loop in bash
Make a directory
srmmkdir -2 -delegate=false srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/dirfordelete
Remove a directory
srmrmdir -2 -delegate=false srm://bsrm-1.t2.ucsd.edu:8443/srm/v2/server?SFN=/hadoop/cms/store/user/tmartin/dirfordelete
Creating a Xen instance
First copy the default config to the name of the instance you are creating. Edit as needed, Then run;
xm create -c devg-2 extra=" init 1 xencons=xvc0"
Tuning NFS Server
Number of threads
cat /proc/net/rpc/nfsd
fh 278562 0 0 0 0
io 4244458892 3311147501
th 96 72034 108604.717 46238.689 43326.894 1563.190 980.777 1653.901 105.886 73.850 58.314 102.676
ra 192 1422103166 0 0 0 0 0 0 0 0 0 2354117
th
- How many threads you have
- The number of times you have used the last thread
- Count you have used between 1 and 12 threads (10%) for 108604.717 seconds
- Count you have used between 13 and 25 threads for 46238.689 seconds
GUMS Manual Account mapping instructions for RSV
Very sorry about letting this lie dormant. I haven't yet sent instructions.
The plan is to add to the Maven-generated GUMS documentation a recipe
for doing the one-to-one user mappings. I'll make sure the info makes it
into the next release.
Briefly, the recipe would be:
1) Under User Groups, create a User Group for the user, e.g.
JohnSmithUserGroup
2) Under Manual User Group Members, add the intended user's DN and
optional FQAN and email to the group created in #1
3) Under Account Mappers, create a new Account Mapper, e.g.
JohnSmithMapper of type "manual" pointed at the UNIX account you want
John Smith to go to (e.g. jsmith).
4) Under Group To Account Mappings, create a Group To Account Mapping,
e.g. JohnSmithGTAMapping using user group from #1 and Account Mapper
from #3, defining VO accounting info.
5) Under Host To Group Mappings, create or edit a relevant host to group
mapping definition and include the GroupToAccount mapping from #4.
Note that where a name is defined, I've chosen a distinct name that
includes what kind of thing it is. In theory the namespace shouldn't
matter, but it makes what you're doing clearer.
Note also that this is all quite complicated to do on a per-user basis.
That is because GUMS was never designed or intended to do manual
per-user mapping. Rather it was intended to be a Grid-ID-to-UNIX-ID
*policy* tool where you handle a whole VO with one chain.
Hope this made sense.
Cheers,
--john
Apache build modules
apxs -I /usr/include/libxml -I . -i -c mod_proxy_html.c
apxs -I /usr/include/libxml -I . -i -c mod_proxy_html.c
apxs -I /usr/include/libxml -I . -i -c mod_xml2enc.c
GLEXEC Test
export GLEXEC_CLIENT_CERT=/tmp/x509up_u583
export X509_USER_PROXY=/tmp/x509up_u583
/usr/sbin/glexec /usr/bin/id
Grub Installing on second raid device
# grub
Probing devices to guess BIOS drives. This may take a long time.
GNU GRUB version 0.97 (640K lower / 3072K upper memory)
[ Minimal BASH-like line editing is supported. For the first word, TAB
lists possible command completions. Anywhere else TAB lists the possible
completions of a device/filename.]
grub> find /grub/stage1
find /grub/stage1
(hd0,0)
(hd1,0)
grub> device (hd0) /dev/sdb
device (hd0) /dev/sdb
grub> root (hd0,0)
root (hd0,0)
Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd0)
setup (hd0)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/e2fs_stage1_5" exists... yes
Running "embed /grub/e2fs_stage1_5 (hd0)"... 15 sectors are embedded.
succeeded
Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2 /grub/grub.conf"... succeeded
Done.
grub> quit
quit
Tastwiki
New User Registration
http://www.t2.ucsd.edu/tastwiki/bin/view/TWiki/TWikiRealRegistration
GUMS
Banning a user
https://www.opensciencegrid.org/bin/view/Documentation/Release3/BanningUsersAtSite
Install certificate components and fetch-crl
First remove any existing soft links to old pacman certificate install and disable the cron based crl and cert updates for the pacman based install.
To install the OSG 3 CRL infrastucture
First remove the soft link for the old pacman based cert-inf install
rm /etc/grid-security/certificates
Then remove the cron jobs
cd /cert-inf
source setup.sh
vdt-control --off
rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
rpm -Uvh http://repo.grid.iu.edu/osg-el5-release-latest.rpm
yum -y install osg-ca-certs
yum -y install fetch-crl
chkconfig fetch-crl-cron on
service fetch-crl-cron start
To grab the latest immediately run
To grab the latest immediately run
/usr/sbin/fetch-crl
Remove the old pacman cert area
rm -rf /cert-inf
mtest
Mtest is a process that runs once an hour in Cron on the worker nodes to check for the hadoop mount. If it is not there mtest tries to remount the hadoop filesystem at /hadoop. The process creates a lot of logs in hadoop from the nodes that process the test.
Condor Requirements change
Example of changing a condor requirements line for all jobs.
condor_cron_qedit -const 'Owner=!=UNDEFINED' Requirements '( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" )'
Create a mirrored Logical Volume
Create the physical and volume groups as normal.
lvcreate -L 223G -m1 --mirrorlog mirrored --alloc anywhere -n osg-ce-2_vol vg_osg-ce-2
Installing Condor UAF Glidein
Install, in order:
- condor
- glideinwms-userschedd
Add in /etc/condor/config.d/99_local.config
CONDOR_HOST = uaf-2.t2.ucsd.edu
Run security stuff
glidecondor_addDN -daemon "My own DN from hostcert" /etc/grid-security/hostcert.pem condor
glidecondor_addDN -daemon "The collector of the UAF pool" '/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uaf-2.t2.ucsd.edu' coll
Disable the second schedd that is enable by default
glidecondor_createSecSched ""
Xrootd Testing
xrdcp -d 2 -f root://xrootd.t2.ucsd.edu//store/test/xrootd/T2_US_UCSD//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/null
xrdcp -d 2 -f root://cmsxrootd.fnal.gov//store/test/xrootd/T2_US_UCSD//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/null
xrdcp -d 2 -f root://cms-xrd-global.cern.ch//store/test/xrootd/T2_US_UCSD//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/nul
Authors
--
TerrenceMartin - 3/17/2017