Difference: InfluxFactoryMonitoring (1 vs. 3)

Revision 32016/09/08 - Main.NaveenKashyap

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- NaveenKashyap - 2016/08/24

Purpose

Line: 11 to 11
  Our solution utilizes the Collector within each factory.
Changed:
<
<
We want to use Tyson's script to query a list of factories. However, we do not want to query the factories sequentially to protect from failures and to improve performance. Therefore, we want to query the factories in parallel, each with their own logging/output files. The queries should be sent every 15 minutes.
>
>
We want to use Tyson's script (information found here) to query a list of factories. However, we do not want to query the factories sequentially to protect from failures and to improve performance. Therefore, we want to query the factories in parallel, each with their own logging/output files. The queries should be sent every 15 minutes.
  Technicalities:

  • Each factory should have it's own thread/daemon to allow for parallel computations
  • Each thread/daemon should have it's own logging/output files
  • The user (probably crontab) should only have to make one call to initiate all queries to all factories and it should be easy to edit the list of factories to query
Added:
>
>
  • It is important that we use only pieces of Condor and python scripts and bindings.
 \ No newline at end of file

Revision 22016/09/07 - Main.NaveenKashyap

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- NaveenKashyap - 2016/08/24

Purpose

Line: 6 to 6
 This document will outline the lack of a current factory monitoring system and the design of our proposed solution.

Problem

Changed:
<
<
The current problem we face is a lack of visibility into inidividual factories. Such visibility is important to the monitoring and analysis of data aggregated in the factory. There are many implied benefits to factory transparency such as: ensurance of proficient data quality, a detailed report on loads and performance, and a clear understanding of the network of computational services (UCSD, Blue Waters, CERN, etc.). Without this level of direct visibility into a factory, monitoring aggregated data requires ...
>
>
The current problem we face is a lack of aggregation of data produced by each factory. Such aggregation is important to the monitoring and analysis of data.
 

Solution

Our solution utilizes the Collector within each factory.

Deleted:
<
<

 
Changed:
<
<
  • Workflow
    • Each factory has two functional components: a Collector and a collection of Schedulers. The Collector presides over all Schedulers and each Scheduler schedules jobs to be computed by its respective computational service (UCSD, Blue Waters, CERN, etc.). When a user wants to run a job for computation, a request will be sent to the Collector. The Collector will contact a Scheduler (the specific Scheduler can be specified by the user) and schedule the job. The Scheduler functions as a queue, and when it finds that its computational service is available for computation, it will send a job to the computational service. The result is then reported to the user. In this way, the user is abstracted from all scheduling and computing creating low visibility of the progress of a job.
    • Our solution will solve this problem by comunicating directly with the Collector to discover all Schedulers, the computational service for which they schedule, and the jobs they currently have queued and running.
  • How
    • A daemonized script will periodically query every known factory (and thus the Collector within it) for a list of all Schedulers. The Collector will return the locations and names of the Schedulers for which it is responsible. Then, for each Scheduler, every scheduled and running job will be queried and stored in a time-series manner, using Influxdb. This information will be grabbed by Grafana for a front-facing display of all currently active or planned jobs for all Schedulers in all factories.
>
>
We want to use Tyson's script to query a list of factories. However, we do not want to query the factories sequentially to protect from failures and to improve performance. Therefore, we want to query the factories in parallel, each with their own logging/output files. The queries should be sent every 15 minutes.

Technicalities:

  • Each factory should have it's own thread/daemon to allow for parallel computations
  • Each thread/daemon should have it's own logging/output files
  • The user (probably crontab) should only have to make one call to initiate all queries to all factories and it should be easy to edit the list of factories to query

Revision 12016/08/24 - Main.NaveenKashyap

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebHome"
-- NaveenKashyap - 2016/08/24

Purpose

This document will outline the lack of a current factory monitoring system and the design of our proposed solution.

Problem

The current problem we face is a lack of visibility into inidividual factories. Such visibility is important to the monitoring and analysis of data aggregated in the factory. There are many implied benefits to factory transparency such as: ensurance of proficient data quality, a detailed report on loads and performance, and a clear understanding of the network of computational services (UCSD, Blue Waters, CERN, etc.). Without this level of direct visibility into a factory, monitoring aggregated data requires ...

Solution

Our solution utilizes the Collector within each factory.

  • Workflow
    • Each factory has two functional components: a Collector and a collection of Schedulers. The Collector presides over all Schedulers and each Scheduler schedules jobs to be computed by its respective computational service (UCSD, Blue Waters, CERN, etc.). When a user wants to run a job for computation, a request will be sent to the Collector. The Collector will contact a Scheduler (the specific Scheduler can be specified by the user) and schedule the job. The Scheduler functions as a queue, and when it finds that its computational service is available for computation, it will send a job to the computational service. The result is then reported to the user. In this way, the user is abstracted from all scheduling and computing creating low visibility of the progress of a job.
    • Our solution will solve this problem by comunicating directly with the Collector to discover all Schedulers, the computational service for which they schedule, and the jobs they currently have queued and running.
  • How
    • A daemonized script will periodically query every known factory (and thus the Collector within it) for a list of all Schedulers. The Collector will return the locations and names of the Schedulers for which it is responsible. Then, for each Scheduler, every scheduled and running job will be queried and stored in a time-series manner, using Influxdb. This information will be grabbed by Grafana for a front-facing display of all currently active or planned jobs for all Schedulers in all factories.

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback