Sitemon flow

One of the most critical use cases we cover for WLCG is the Sitemon one, this flow is based on ETF (Experiments Test Framework) provided metrics that are aggregated together to produce the so called "Sitemon" metrics, composed of endpoint, flavour and site statuses.

All these information is further used for the creation of the "Availability/Reliability" reports, that are shared every month with the WLCG community and used for different purposes. If site managers are not agreeing with what the report shows they are entitled to request a recomputation to altet the reported results. More specific information on how to request recomputations as well as the produced reports can be found in the wlcg-sitemon website.

Flows

The main responsible for the raw data in this flow is the ETF team, as they are the one in charge of running the tests and submitting the results to MONIT, once the data is in, it undergoes different processing to end up in the expected shape.

As a pre-production infrastructure, ETF people also submit a set of metrics from their "QA" environment that in MONIT is treated in parallel to the production data.

Input	Processing type	Storage	Index Pattern	Dashboards
ETF	Enrichment	OpenSearch LT	monit_prod_sam3_enr_metric	historical tests
ETF	Aggregation	OpenSearch LT	monit_prod_sitemon_agg	historical Profiles
ETF-QA	Enrichment	OpenSearch LT	monit_prod_sam3-qa_enr_metric	historical tests
ETF-QA	Aggregation	OpenSearch LT	monit_prod_sitemon-qa_agg	historical Profiles QA

How does the flow work?

ETF will produce status metrics for different endpoints configured by the experiments, these metrics are then aggregated following the profiles definition into endpoint, and site status metrics.

Currently the flow supports the following statuses:

OK: The tests returned as success (100% available/reliable, 0% unknown)
CRITICAL: The tests returned as failure (100% available/reliable, 0% unknown)
DOWNTIME: The site was marked as downtime (0% available/ 100% reliable, 0% unknown)
UNKNOWN/NODATA: There were no tests for the period of time (0% available/reliable, 100% unknown)