Send Alerts Implementing a Custom Producer
As part of the MONIT infrastructure we offer the possibility to inject alert type of documents, depending on the integration way, this documents can be treated as real alerts and being forwarded to notification endpoints (GNI), or just stored in the MONIT infrastructure as a way to visualize them later like metrics.
Please refer to the first step section to know what you should do before using any of the integrations.
GNI
As part of the MONIT infrastructure we have implemented the concept of "GNI" (Generic Notification Infrastructure) and we offer several ways to interact with it. Once an alert has been submitted to the GNI infrastructure it will arrive to a notification endpoint and generally generate some kind of notification.
Supported Endpoints
SNOW
It is possible to create ServiceNow incidents from alerts integrated in the MONIT infrastructure. There are several "SNOW" fields that are supported when integrating an alert (which integration name varies depending on the method), so here we will only cover the SNOW concepts, for more information on the integration names please refer to the integration documentation of supported producers.
- Service Element: SNOW SE of the incident
- Functional Element: SNOW FE of the incident
- Assignment Level: SNOW assignment level of the incident (as an int)
- Hostgroup Grouping: Boolean specifying if the incident needs to group events by hostgroup
- Auto Closing: Boolean specifying if the incident should resolve automatically when receiving an OK event (only if Hostgroup Grouping is False)
- Notification Type: The alarm type, possible values:
- app: Default type, used for any alert that's specific to application layer
- hw: Used for alerts related to hardware components (disks, memory modules...)
- os: Used for alerts related to OS (cpu/memory utilisation...)
- Watchlist: String representing a comma separated list of emails that will form the watchlist of the SNOW incident e.g: "email1,email2,..."
In the Puppet world, alerts generated using Collectd have a set of default values that can be override as follows:
- cerncollectd::config::snow_alarms_enabled: true
- cerncollectd::config::snow_fe: populated from local FE fact
- cerncollectd::config::snow_se: the default SE associated to the FE
- cerncollectd::config::snow_assignment_level: 3
- cerncollectd::config::snow_grouping: true (deprecated in favour of hostgroup_grouping)
- cerncollectd::config::snow_hostgroup_grouping: true
- cerncollectd::config::snow_auto_closing: false (only when hostgroup_grouping is false)
- cerncollectd::config::snow_fe_category: undef
- cerncollectd::config::snow_watchlist: undef
Grouping
All alerts are grouped by default and the infrastructure allows to specify if alerts are grouped by "entity" or "hostgroup". To control this selection please use the snow_hostgroup_grouping
parameter.
The grouping of alerts follows these rules: * If there is no "hostgroup_grouping" field or is set to false the events will be grouped in an incident by "alarm_name" and "entity". * If there is "hostgroup_grouping" set to true and there is not "submitter_hostgroup" the events will be grouped by "alarm_name" and "entity". * If there is "hostgroup_grouping" set to true and there is "submitter_hostgroup" the events will be grouped by "alarm_name" and "submitter_hostgroup".
Auto closing
Incidents created in SNOW can be closed automatically when and OK event is received matching and open ticket by "alarm_name" and "entity".
This events are only sent to SNOW when the "hostgroup_grouping" option are disabled (set to false). Since it's not possible to know when a ticket containing multiple entities should be closed.
Another supported endpoint for GNI notifications comes in the shape of emails, to enable it we offer a few optional parameters that can be set with their default values:
- email/to: the recipient(s) of the email
- email/cc: the address(es) to include as CC
- email/bcc: the address(es) to include as BCC
- email/reply_to: the email to reply to
- email/send_ok: if enabled, emails will be sent upon receiving OK alarms; defaults to false
In the Puppet world only two of these parameters are supported as Hiera variables:
- cerncollectd::config::email_alarms_enabled: false
- cerncollectd::config::email_to: [] - List of email recipients
SMS
Finally we can also forward alerts in the shape of SMS to CERN numbers, in order to do this, the following parameters are offered:
- sms/to: the recipient of the text message in international format (restrained to CERN numbers, i.e. +4175411xxxx)
- sms/send_ok: if enabled, text messages will be sent upon receiving OK alarms; defaults to false
Storages
All alerts integrated using the GNI endpoints will be stored by default in our short term OpenSearch cluster, under the monit_prod_alarm_raw_gni*
index pattern.
Supported Producers
There are several ways to integrate with GNI depending on the producer, here's a list of the supported ones:
- JSON+HTTP: Please refer to the JSON Alerts documentation
- Grafana: Please refer to the Grafana Alerts documentation
- Prometheus Alertmanager: Please refer to the Prometheus Alerts documentation
Inhibitions (Roger)
The GNI infrastructure will always check for the status of the alerting entity in Roger. On top if an alert is shipped with roger_alarmed=false
all notification endpoints will be ignored and no action will be taken (ticket creation in SNOW for example).
JSON
Similar to other type of documents, we offer the possibility to submit alerts using an HTTP+JSON endpoint in order to make use of the GNI integration and generation of notifications.
Storages
Check the GNI storages section for more information.
Send data
Send your alarms to the HTTP endpoint listening in http://monit-alarms.cern.ch:10011
, the available JSON fields are:
- (mandatory) timestamp: the timestamp of the alarm (check #1 below)
- (mandatory) source: the name of your dataset/producer
- (mandatory) alarm_name: the name of the alarm
- (mandatory) entities: the entity being alarmed, such as hosts, services, etc
- (mandatory) status: the alarm state, possible values: OK, WARNING, FAILURE (check #2 below)
- (optional) metric: the name of the metric that generated the alarm
- (optional) summary: a short summary of the alarm; if provided it is used as the email/ticket subject
- (optional) description: a more verbose description of the alarm
- (optional) correlation: the condition that was met to generate the alarm
- (optional) troubleshooting: A link providing troubleshooting details
- (optional) targets: An array with information about where to send the alarm, possible values: snow, email and sms e.g: ["snow"]
- (optional) snow/service_element: the SNOW SE of the incident
- (optional) snow/functional_element: the SNOW FE of the incident
- (optional) snow/assignment_level: the SNOW assignment level of the incident (as an int)
- (optional) snow/grouping: (deprecated, use hostgroup_grouping) if the incident should be grouped by hostgroup
- (optional) snow/hostgroup_grouping: if the incident should be grouped by hostgroup
- (optional) snow/auto_closing: if grouping is disabled, OK alerts automatically close tickets
- (optional) snow/notification_type: the alarm type, possible values: app (default), hw, os
- (optional) snow/watchlist: String representing a comma separated list of emails that will form the watchlist of the SNOW incident e.g: "email1,email2,..."
- (optional) email/to: the recipient(s) of the email
- (optional) email/cc: the address(es) to include as CC
- (optional) email/bcc: the address(es) to include as BCC
- (optional) email/reply_to: the email to reply to
- (optional) email/send_ok: if enabled, emails will be sent upon receiving OK alarms; defaults to false
- (optional) sms/to: the recipient of the text message in international format (restrained to CERN numbers, i.e. +4175411xxxx)
- (optional) sms/send_ok: if enabled, text messages will be sent upon receiving OK alarms; defaults to false
Please pay attention to the following:
- All timestamps must be in UTC milliseconds or seconds, without any subdecimal part
- FAILURE status can also be send as CRITICAL or ERROR and will internally be converted
- Use double quotes and not single quote (not valid in JSON)
- Anything that is considered metadata for the infrastructure will be promoted to the metadata field in the produced JSON, and the rest will be put inside data
- The alarm must be a JSON document with the fields above.
- When possible please send multiple documents in the same batch grouping them in a JSON array.
Prometheus Alertmanager
Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, on-call notification systems, and chat platforms.
Storages
When sending data using Prometheus alertmanager documents will be stored OpenSearch, and translated into two different index patterns: * monit_prod_alertmanager_raw_alerts: This will contain the full JSON representation of the alert as provided by Prometheus Alertmanager * monit_prod_alarm_raw_gni: If integration with GNI has been made this will contain the GNI representation of your alert
Send data
The alerts produced by Prometheus and sent by the alertmanager should contain the labels as described by the schema, please note that configuring any of the SNOW parameters will trigger GNI integration.
- (mandatory) timestamp: the timestamp of the alarm (check #1 below)
- (default) source: the source will be set to "prometheus"
- (mandatory) alarm_name: the name of the alarm
- (default) entities: for the case of prometheus, entities will be composed by any joining together any of the following fields available in the message:
- "instance", "pod_name", "pod", "cluster", "service", "producer"
- (mandatory) status: the alarm state, possible values: "resolved" -> OK, "firing" -> FAILURE
- (optional) summary: a short summary of the alarm; if provided it is used as the ticket subject
- (optional) description: a more verbose description of the alarm
- (optional) submitter_environment: environment from where the alarm was fired
- (optional) submitter_hostgroup: hostgroup from where the alarm was fired
- (optional) troubleshooting: A link providing troubleshooting details
- (optional) roger_alarmed: if the alarm is masked by roger or not
- (optional) snow_service_element: the SNOW SE of the incident
- (optional) snow_functional_element: the SNOW FE of the incident
- (optional) snow_assignment_level: the SNOW assignment level of the incident (as an int)
- (optional) snow_hostgroup_grouping: if the incident should be grouped by hostgroup
- (optional) snow_auto_closing: if grouping is disabled, OK alerts automatically close tickets
- (optional) snow_notification_type: the alarm type, possible values: app (default), hw, os
- (optional) snow_watchlist: list of emails that will form the watchlist of the SNOW incident
- (optional) snow_troubleshooting: used historically by nocontacts, will be override by "troubleshooting" if specified
You will need to configure a new receiver with the endpoint to integrate alerts with the MONIT infrastructure, here's a working example:
global:
resolve_timeout: 5m
route:
receiver: default-receiver
group_by:
- alertname
- cluster
- service
- pod_name
routes:
- match:
monit_forward: "true"
receiver: cern-central-monitoring
continue: true
group_wait: 1m
group_interval: 5m
receivers:
- name: default-receiver
- name: cern-central-monitoring
webhook_configs:
- send_resolved: true
http_config: {}
url: http://monit-alarms.cern.ch:10014