PhEDEx Monitoring
Introduction
The role of the PhEDEx data transfer system is to catalog, replicate and register CMS data sets. Therefore,
PhEDEx? is the end-user of many services in CMS and at UCSD. If something breaks anywhere on the chain of services used in
PhEDEx? , then transfers will stop. Such services include:
- DBS - the data set database for CMS at CERN used for analysis jobs to find out where the data is
- TMDB - the PhEDEx data transfer system database at CERN, including web services
- UCSD dCache, including:
- SRM srm-3.t2.ucsd.edu
- gftp doors
- pools
- pnfs
- Remote site storage element, usually CASTOR (at ASGC and RAL) or dCache (elsewhere), including:
- Network links between the T1 sites and UCSD
- Network links between FNAL, ASGC and UCSD have has issues in the past 12 months.
- Lost, corrupted or unreadable files either at UCSD or a remote site, i.e. files that cannot be read from tape
PhEDEx Data Transfers
In general, if there are successful data transfers in AND out of UCSD, there are no problems and all of the above services are functioning normally. Check the data transfers in the Debug and Prod instances of PhEDEx. There should be successful transfers in Debug at any time, since we have continuous injection of new files, and in Prod only if there is production data in the pipeline for transfer to our site.
- PhEDEx Debug Transfers, Last Hour
- If there are no errors: SRM, dCache, network and remote sites are all probably healthy.
- If there are errors, click on the number of errors in the table and see the details ... the error logs are sometimes very cryptic!
- If the errors are on all links, we are probably the source of the problem.
- If the errors are on a particular link, then the source site or the network between us and them (rare) is most likely the culprit. Check the source sites other transfers by replacing "UCSD" with the source site name (i.e. FNAL, CERN) in the "From" and "To" fields.
There may not be any production instance transfers if there is nothing subscribed to us today. The first link shows the last hour of Prod transfers, and the second link shows whats in our transfer pipeline.
PhEDEx Agents
PhEDEx operates by running a set of "agents" on phedex-1.t2.ucsd.edu. These agents are responsible for file downloads, exports, deletion requests, etc. Agents that die are automatically restarted by a cron job every hour. However, agents can lose their connection to the central databases at CERN and appear to be down even if they are still running locally. To see the health of various data transfer links from the T1 sites to UCSD, follow the links below. They should ALL be green. If not an agent might be down either at UCSD or at the source T1. Click on the box to see the problem.
To stop or start an agent (or all agents) do as phedex on phedex-1.t2.ucsd.edu:
cd SITECONF/UCSD/PhEDEx
./phedex 'Prod|Debug' 'start|stop' [agent name]
where agent name is the name of the agent found in the ConfigPart.* configuration file.
What to do if there is a problem
If there is a problem with a remote site, then look in the
SiteDB and find the email address of the appropriate Data Manager at the remote site.
If there is a problem at UCSD, then depending on what the problem is contact:
- for networking or hardware problems T. Martin
- for SRM, dCache or storage problems A. Rana
- for PhEDEx problems, J. Letts
Just to note, generally about 50% of problems are with our storage, and 50% with a remote site being down.
--
JamesLetts - 26 Jul 2008