Sitemon flow
One of the most critical use cases we cover for WLCG is the Sitemon one, this flow is based on ETF (Experiments Test Framework) provided metrics that are aggregated together to produce the so called "Sitemon" metrics, composed of endpoint, flavour and site statuses.
All these information is further used for the creation of the "Availability/Reliability" reports, that are shared every month with the WLCG community and used for different purposes. If site managers are not agreeing with what the report shows they are entitled to request a recomputation to altet the reported results. More specific information on how to request recomputations as well as the produced reports can be found in the wlcg-sitemon website.
Flows
The main responsible for the raw data in this flow is the ETF team, as they are the one in charge of running the tests and submitting the results to MONIT, once the data is in, it undergoes different processing to end up in the expected shape.
As a pre-production infrastructure, ETF people also submit a set of metrics from their "QA" environment that in MONIT is treated in parallel to the production data.
Input | Processing type | Storage | Index Pattern | Dashboards |
---|---|---|---|---|
ETF | Enrichment | OpenSearch LT | monit_prod_sam3_enr_metric | historical tests |
ETF | Aggregation | OpenSearch LT | monit_prod_sitemon_agg | historical Profiles |
ETF-QA | Enrichment | OpenSearch LT | monit_prod_sam3-qa_enr_metric | historical tests |
ETF-QA | Aggregation | OpenSearch LT | monit_prod_sitemon-qa_agg | historical Profiles QA |
How does the flow work?
ETF will produce status metrics for different endpoints configured by the experiments, these metrics are then aggregated following the profiles definition into endpoint, and site status metrics.
Currently the flow supports the following statuses:
- OK: The tests returned as success (100% available/reliable, 0% unknown)
- CRITICAL: The tests returned as failure (100% available/reliable, 0% unknown)
- DOWNTIME: The site was marked as downtime (0% available/ 100% reliable, 0% unknown)
- UNKNOWN/NODATA: There were no tests for the period of time (0% available/reliable, 100% unknown)