Send Alerts Implementing a Custom Producer

As part of the MONIT infrastructure we offer the possibility to inject alert type of documents, depending on the integration way, this documents can be treated as real alerts and being forwarded to notification endpoints (GNI), or just stored in the MONIT infrastructure as a way to visualize them later like metrics.

Please refer to the first step section to know what you should do before using any of the integrations.

GNI

As part of the MONIT infrastructure we have implemented the concept of "GNI" (Generic Notification Infrastructure) and we offer several ways to interact with it. Once an alert has been submitted to the GNI infrastructure it will arrive to a notification endpoint and generally generate some kind of notification.

Supported Endpoints

SNOW

It is possible to create ServiceNow incidents from alerts integrated in the MONIT infrastructure. There are several "SNOW" fields that are supported when integrating an alert (which integration name varies depending on the method), so here we will only cover the SNOW concepts, for more information on the integration names please refer to the integration documentation of supported producers.

Service Element: SNOW SE of the incident
Functional Element: SNOW FE of the incident
Assignment Level: SNOW assignment level of the incident (as an int)
Hostgroup Grouping: Boolean specifying if the incident needs to group events by hostgroup
Auto Closing: Boolean specifying if the incident should resolve automatically when receiving an OK event (only if Hostgroup Grouping is False)
Notification Type: The alarm type, possible values:
- app: Default type, used for any alert that's specific to application layer
- hw: Used for alerts related to hardware components (disks, memory modules...)
- os: Used for alerts related to OS (cpu/memory utilisation...)
Watchlist: String representing a comma separated list of emails that will form the watchlist of the SNOW incident e.g: "email1,email2,..."

In the Puppet world, alerts generated using Collectd have a set of default values that can be override as follows:

cerncollectd::config::snow_alarms_enabled: true
cerncollectd::config::snow_fe: populated from local FE fact
cerncollectd::config::snow_se: the default SE associated to the FE
cerncollectd::config::snow_assignment_level: 3
cerncollectd::config::snow_grouping: true (deprecated in favour of hostgroup_grouping)
cerncollectd::config::snow_hostgroup_grouping: true
cerncollectd::config::snow_auto_closing: false (only when hostgroup_grouping is false)
cerncollectd::config::snow_fe_category: undef
cerncollectd::config::snow_watchlist: undef

Grouping

All alerts are grouped by default and the infrastructure allows to specify if alerts are grouped by "entity" or "hostgroup". To control this selection please use the snow_hostgroup_grouping parameter.

cerncollectd::config::snow_hostgroup_grouping: true

The grouping of alerts follows these rules: * If there is no "hostgroup_grouping" field or is set to false the events will be grouped in an incident by "alarm_name" and "entity". * If there is "hostgroup_grouping" set to true and there is not "submitter_hostgroup" the events will be grouped by "alarm_name" and "entity". * If there is "hostgroup_grouping" set to true and there is "submitter_hostgroup" the events will be grouped by "alarm_name" and "submitter_hostgroup".

Auto closing

Incidents created in SNOW can be closed automatically when and OK event is received matching and open ticket by "alarm_name" and "entity".

cerncollectd::config::snow_auto_closing: true

This events are only sent to SNOW when the "hostgroup_grouping" option are disabled (set to false). Since it's not possible to know when a ticket containing multiple entities should be closed.

Email

Another supported endpoint for GNI notifications comes in the shape of emails, to enable it we offer a few optional parameters that can be set with their default values:

email/to: the recipient(s) of the email
email/cc: the address(es) to include as CC
email/bcc: the address(es) to include as BCC
email/reply_to: the email to reply to
email/send_ok: if enabled, emails will be sent upon receiving OK alarms; defaults to false

In the Puppet world only two of these parameters are supported as Hiera variables:

cerncollectd::config::email_alarms_enabled: false
cerncollectd::config::email_to: [] - List of email recipients

SMS

Finally we can also forward alerts in the shape of SMS to CERN numbers, in order to do this, the following parameters are offered:

sms/to: the recipient of the text message in international format (restrained to CERN numbers, i.e. +4175411xxxx)
sms/send_ok: if enabled, text messages will be sent upon receiving OK alarms; defaults to false

Storages

All alerts integrated using the GNI endpoints will be stored by default in our short term OpenSearch cluster, under the monit_prod_alarm_raw_gni* index pattern.

Supported Producers

There are several ways to integrate with GNI depending on the producer, here's a list of the supported ones:

JSON+HTTP: Please refer to the JSON Alerts documentation
Grafana: Please refer to the Grafana Alerts documentation
Prometheus Alertmanager: Please refer to the Prometheus Alerts documentation

Inhibitions (Roger)

The GNI infrastructure will always check for the status of the alerting entity in Roger. On top if an alert is shipped with roger_alarmed=false all notification endpoints will be ignored and no action will be taken (ticket creation in SNOW for example).

JSON

Similar to other type of documents, we offer the possibility to submit alerts using an HTTP+JSON endpoint in order to make use of the GNI integration and generation of notifications.

Storages

Check the GNI storages section for more information.

Send data

Send your alarms to the HTTP endpoint listening in http://monit-alarms.cern.ch:10011, the available JSON fields are:

(mandatory) timestamp: the timestamp of the alarm (check #1 below)
(mandatory) source: the name of your dataset/producer
(mandatory) alarm_name: the name of the alarm
(mandatory) entities: the entity being alarmed, such as hosts, services, etc
(mandatory) status: the alarm state, possible values: OK, WARNING, FAILURE (check #2 below)
(optional) metric: the name of the metric that generated the alarm
(optional) summary: a short summary of the alarm; if provided it is used as the email/ticket subject
(optional) description: a more verbose description of the alarm
(optional) correlation: the condition that was met to generate the alarm
(optional) troubleshooting: A link providing troubleshooting details
(optional) targets: An array with information about where to send the alarm, possible values: snow, email and sms e.g: ["snow"]
(optional) snow/service_element: the SNOW SE of the incident
(optional) snow/functional_element: the SNOW FE of the incident
(optional) snow/assignment_level: the SNOW assignment level of the incident (as an int)
(optional) snow/grouping: (deprecated, use hostgroup_grouping) if the incident should be grouped by hostgroup
(optional) snow/hostgroup_grouping: if the incident should be grouped by hostgroup
(optional) snow/auto_closing: if grouping is disabled, OK alerts automatically close tickets
(optional) snow/notification_type: the alarm type, possible values: app (default), hw, os
(optional) snow/watchlist: String representing a comma separated list of emails that will form the watchlist of the SNOW incident e.g: "email1,email2,..."
(optional) email/to: the recipient(s) of the email
(optional) email/cc: the address(es) to include as CC
(optional) email/bcc: the address(es) to include as BCC
(optional) email/reply_to: the email to reply to
(optional) email/send_ok: if enabled, emails will be sent upon receiving OK alarms; defaults to false
(optional) sms/to: the recipient of the text message in international format (restrained to CERN numbers, i.e. +4175411xxxx)
(optional) sms/send_ok: if enabled, text messages will be sent upon receiving OK alarms; defaults to false

Please pay attention to the following:

All timestamps must be in UTC milliseconds or seconds, without any subdecimal part
FAILURE status can also be send as CRITICAL or ERROR and will internally be converted
Use double quotes and not single quote (not valid in JSON)
Anything that is considered metadata for the infrastructure will be promoted to the metadata field in the produced JSON, and the rest will be put inside data
The alarm must be a JSON document with the fields above.
When possible please send multiple documents in the same batch grouping them in a JSON array.

Prometheus Alertmanager

Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, on-call notification systems, and chat platforms.

Storages

When sending data using Prometheus alertmanager documents will be stored OpenSearch, and translated into two different index patterns: * monit_prod_alertmanager_raw_alerts: This will contain the full JSON representation of the alert as provided by Prometheus Alertmanager * monit_prod_alarm_raw_gni: If integration with GNI has been made this will contain the GNI representation of your alert

Send data

The alerts produced by Prometheus and sent by the alertmanager should contain the labels as described by the schema, please note that configuring any of the SNOW parameters will trigger GNI integration.

(mandatory) timestamp: the timestamp of the alarm (check #1 below)
(default) source: the source will be set to "prometheus"
(mandatory) alarm_name: the name of the alarm
(default) entities: for the case of prometheus, entities will be composed by any joining together any of the following fields available in the message:
"instance", "pod_name", "pod", "cluster", "service", "producer"
(mandatory) status: the alarm state, possible values: "resolved" -> OK, "firing" -> FAILURE
(optional) summary: a short summary of the alarm; if provided it is used as the ticket subject
(optional) description: a more verbose description of the alarm
(optional) submitter_environment: environment from where the alarm was fired
(optional) submitter_hostgroup: hostgroup from where the alarm was fired
(optional) troubleshooting: A link providing troubleshooting details
(optional) roger_alarmed: if the alarm is masked by roger or not
(optional) snow_service_element: the SNOW SE of the incident
(optional) snow_functional_element: the SNOW FE of the incident
(optional) snow_assignment_level: the SNOW assignment level of the incident (as an int)
(optional) snow_hostgroup_grouping: if the incident should be grouped by hostgroup
(optional) snow_auto_closing: if grouping is disabled, OK alerts automatically close tickets
(optional) snow_notification_type: the alarm type, possible values: app (default), hw, os
(optional) snow_watchlist: list of emails that will form the watchlist of the SNOW incident
(optional) snow_troubleshooting: used historically by nocontacts, will be override by "troubleshooting" if specified

You will need to configure a new receiver with the endpoint to integrate alerts with the MONIT infrastructure, here's a working example:

global:
  resolve_timeout: 5m
route:
  receiver: default-receiver
  group_by:
  - alertname
  - cluster
  - service
  - pod_name
  routes:
  - match:
      monit_forward: "true"
    receiver: cern-central-monitoring
    continue: true
  group_wait: 1m
  group_interval: 5m
receivers:
- name: default-receiver
- name: cern-central-monitoring
  webhook_configs:
  - send_resolved: true
    http_config: {}
    url: http://monit-alarms.cern.ch:10014