Using SLA alerting to build a 2-tier monitoring system

About SLA alerting

  • SLA alerting concepts were introduced in TrackMe 2.0.92.

  • The concepts of SLA alerting can be leveraged to design a 2-tier monitoring system, where a first alert is emitted when TrackMe entities switch to a red state, and a second independent alert is emitted when the SLA is breached.

  • This can be translated in “alert: this TrackMe entity is red!”, then later on “alert: This entity has now spent too long in alert and is breaching its SLA!”.

  • Entities are associated with a given SLA class, each SLA class has a threshold in seconds and a numerical rank value.

  • Available SLA classes and their configuration are defined at the level of the TrackMe system configuration. (A JSON object which defines the list of classes and their parameters)

  • You can then define the SLA class per entity (which if defined will take precedence against policies based classes), or via SLA policies, which are regular expressions orchestrated by TrackMe and handle the SLA class of each matching entities.

  • Finally, you can create an SLA alert, which leverages the SLA information to emit an alert when the SLA is breached.

SLA Classes and Thresholds

SLA classes are defined at the level of the TrackMe system configuration, SLA classes define a threshold in seconds and a numerical rank value:

configure_classes.png

The default SLA classes are:

{
    "gold": {
        "sla_threshold": 14400,
        "rank": 3
    },
    "silver": {
        "sla_threshold": 86400,
        "rank": 2
    },
    "platinum": {
        "sla_threshold": 172800,
        "rank": 1
    }
}

Notes:

  • You can add / remove / change classes as needed.

  • You can define the default class to be used when TrackMe entities are discovered. (parameter: default_sla_class)

  • The Threshold value is in seconds, for the SLA to be breached, a given TrackMe entity must be in alert for a continuous amount of time higher than the threshold.

  • The rank value is a numerical value, it is used to handle any conflict when applying SLA policies, the highest rank value will always win.

SLA Tab in TrackMe UI

The SLA feature is first translated into a tab called SLA which describes the current SLA status for the selected TrackMe entity:

The entity is in green state, therefore the SLA cannot be breached:

sla_tab_green.png

The entity is red state, however it has not yet breached the SLA threshold by spending a continuous amount of time in alert higher than the threshold:

sla_tab_red_not_breached.png

The entity is red state, and it has breached the SLA threshold by spending a continuous amount of time in alert higher than the threshold:

sla_tab_red_breached.png

Defining the SLA class per entity

You can define the SLA class manually on per entity basis: (via the UI or via the associated endpoint)

Hint

If the SLA has been defined manually, it will take precedence against policies based classes. (see next section)

sla_per_entity_001.png sla_per_entity_002.png

Defining SLA policies to assign SLA classes automatically

You can define policies, which are regular expressions applied and orchestrated by TrackMe automatically:

sla_policies_001.png

For instance, say we want to match entities containing “cribl”:

sla_policies_002.png sla_policies_003.png

Let’s execute the SLA tracker now:

sla_policies_004.png

From this stage, all matching entities will get the highest ranked policy and its associated threshold:

sla_policies_005.png

SLA Alerts

From TrackMe alert tabs, you can now create an SLA alert:

Hint

The TrackMe component must be defined and match your target (there would be one alert per component in the tenant)

sla_alerts_001.png

Notes:

  • This alert is independent from the TrackMe main alerts and notable alert

  • If it triggers, it means that the SLA was breached for one or more entities

  • If the SLA is breached, the concept is to say that basically we had a first alert, the issue is not fixed after an acceptable amount of time so we generate a second alert once the SLA threshold has been breached. (2 tiers alerting system)

Example:

sla_alerts_002.png