--
NaveenKashyap - 2016/08/24
Purpose
This document will outline the lack of a current factory monitoring system and the design of our proposed solution.
Problem
The current problem we face is a lack of aggregation of data produced by each factory. Such aggregation is important to the monitoring and analysis of data.
Solution
Our solution utilizes the Collector within each factory.
We want to use Tyson's script (information found
here) to query a list of factories. However, we do not want to query the factories sequentially to protect from failures and to improve performance. Therefore, we want to query the factories in parallel, each with their own logging/output files. The queries should be sent every 15 minutes.
Technicalities:
- Each factory should have it's own thread/daemon to allow for parallel computations
- Each thread/daemon should have it's own logging/output files
- The user (probably crontab) should only have to make one call to initiate all queries to all factories and it should be easy to edit the list of factories to query
- It is important that we use only pieces of Condor and python scripts and bindings.