GNI

As part of the MONIT infrastructure we have implemented the concept of "GNI" (Generic Notification Infrastructure) and we offer several ways to interact with it.

Once an alert has been submitted to the GNI infrastructure it will arrive to Opensearch and depending on the data, will as well arrive to one of the supported targets.

Document schema

When integrating with GNI, the shaped document will have a set of fields available, based on the integration mechanism the name of the labels to submit might vary, check the supported integrations documentation to get a better understanding on those.

alarm_name: The name of the alarm
alarm_type: The alarm type, possible values:
- app: Default type, used for any alert that's specific to application layer
- hw: Used for alerts related to hardware components (disks, memory modules...)
- os: Used for alerts related to OS (cpu/memory utilisation...)
collectd_namespace: If Collectd generated information about the namespace
correlation: The condition that was met to generate the alarm
description: A more verbose (than summary) description of the alarm
entities: The entity being alarmed, such as hosts, services, etc
metrics: The name of the metrics that triggered the alert
roger_alarmed: The masking of the alert in the Roger service
source: The name of your dataset/producer
status: The alarm state, possible values: OK, WARNING, FAILURE
status_flag: A numeric value for the state: OK -> 1, WARNING -> 0.5, FAILURE -> 0
summary: A short summary of the alarm, if not available the alertname will be used instead
targets: An array with information about where to send the alarm, possible values: snow, email and sms e.g: ["snow"]
troubleshooting: A link providing troubleshooting details

On top of these, there are also target specific fields, these appear in a nested field (i.e: "{snow: {}, email: {}, sms: {}}"):

snow:
- service_element: the SNOW SE of the incident
- functional_element: the SNOW FE of the incident
- assignment_level: the SNOW assignment level of the incident (as an int)
- grouping: (deprecated, use hostgroup_grouping) if the incident should be grouped by hostgroup
- hostgroup_grouping: if the incident should be grouped by hostgroup
- auto_closing: if grouping is disabled, OK alerts automatically close tickets
- notification_type: the alarm type, possible values: app (default), hw, os
- watchlist: String representing a comma separated list of emails that will form the watchlist of the SNOW incident e.g: "email1,email2,..."
email:
- to: the recipient(s) of the email
- cc: the address(es) to include as CC
- bcc: the address(es) to include as BCC
- reply_to: the email to reply to
- send_ok: if enabled, emails will be sent upon receiving OK alarms; defaults to false
sms:
- to: the recipient of the text message in international format (restrained to CERN numbers, i.e. +4175411xxxx)
- send_ok: if enabled, text messages will be sent upon receiving OK alarms; defaults to false

Supported targets

SNOW

It is possible to create ServiceNow incidents from alerts integrated in the MONIT infrastructure.

Ticket formatting

Short Description: summary

Description: Most of the lines will only appear if the linked label is available

  Alarm: *alarm_name*
  Metrics: *metrics*
  Status: *status*
  Entities: *entities**
  Producer: *source*
  Grouping: *Hostgroup if hostgroup_grouping, else Entities*
  Hostgroup: *submitter_hostgroup*
  Collectd namespace: *collectd_namespace*
  *description*

Grouping

All alerts are grouped by default and the infrastructure allows to specify if alerts are grouped by "entity" or "hostgroup". To control this selection please use the snow_hostgroup_grouping parameter.

The grouping of alerts for SNOW follows these rules: * If there is no "hostgroup_grouping" field or is set to false the events will be grouped in an incident by "alarm_name" and "entity". * If there is "hostgroup_grouping" set to true and there is not "submitter_hostgroup" the events will be grouped by "alarm_name" and "entity". * If there is "hostgroup_grouping" set to true and there is "submitter_hostgroup" the events will be grouped by "alarm_name" and "submitter_hostgroup".

Auto closing

Incidents created in SNOW can be closed automatically when and OK event is received matching and open ticket by "alarm_name" and "entity".

These events are only sent to SNOW when the "hostgroup_grouping" option is disabled (set to false). Since it's not possible to know when a ticket containing multiple entities should be closed.

Email

Another supported endpoint for GNI notifications comes in the shape of emails.

Email formatting

Subject: summary on entities

Body: Most of the lines will only appear if the linked label is available

  Exception: *alarm_name*
  Status: *status*
  Entities: *entities**
  Producer: *source*
  Collectd namespace: *collectd_namespace*
  Troubleshooting: *troubleshooting*
  Correlation: *correlation*
  Grouping: *Hostgroup if hostgroup_grouping, else Entities*
  Hostgroup: *submitter_hostgroup*

SMS

Finally we can also forward alerts in the shape of SMS to CERN numbers.

Inhibitions (Roger)

The GNI infrastructure will always check for the status of the alerting entity in Roger. On top if an alert is shipped with roger_alarmed=false all notification endpoints will be ignored and no action will be taken (ticket creation in SNOW for example).

Storages

All alerts integrated using the GNI endpoints will be stored by default in our short term OpenSearch cluster, under the monit_prod_alarm_raw_gni* index pattern.