Using SLA alerting to build a 2-tier monitoring system
About SLA alerting
SLA alerting concepts were introduced in TrackMe 2.0.92.
The concepts of SLA alerting can be leveraged to design a 2-tier monitoring system, where a first alert is emitted when TrackMe entities switch to a red state, and a second independent alert is emitted when the SLA is breached.
This can be translated in “alert: this TrackMe entity is red!”, then later on “alert: This entity has now spent too long in alert and is breaching its SLA!”.
Entities are associated with a given SLA class, each SLA class has a threshold in seconds and a numerical rank value.
Available SLA classes and their configuration are defined at the level of the TrackMe system configuration. (A JSON object which defines the list of classes and their parameters)
You can then define the SLA class per entity (which if defined will take precedence against policies based classes), or via SLA policies, which are regular expressions orchestrated by TrackMe and handle the SLA class of each matching entities.
Finally, you can create an SLA alert, which leverages the SLA information to emit an alert when the SLA is breached.
SLA Classes and Thresholds
SLA classes are defined at the level of the TrackMe system configuration, SLA classes define a threshold in seconds and a numerical rank value:
The default SLA classes are:
{
"gold": {
"sla_threshold": 14400,
"rank": 3
},
"silver": {
"sla_threshold": 86400,
"rank": 2
},
"platinum": {
"sla_threshold": 172800,
"rank": 1
}
}
Notes:
You can add / remove / change classes as needed.
You can define the default class to be used when TrackMe entities are discovered. (parameter: default_sla_class)
The Threshold value is in seconds, for the SLA to be breached, a given TrackMe entity must be in alert for a continuous amount of time higher than the threshold.
The rank value is a numerical value, it is used to handle any conflict when applying SLA policies, the highest rank value will always win.
SLA Tab in TrackMe UI
The SLA feature is first translated into a tab called SLA which describes the current SLA status for the selected TrackMe entity:
The entity is in green state, therefore the SLA cannot be breached:
The entity is red state, however it has not yet breached the SLA threshold by spending a continuous amount of time in alert higher than the threshold:
The entity is red state, and it has breached the SLA threshold by spending a continuous amount of time in alert higher than the threshold:
Defining the SLA class per entity
You can define the SLA class manually on per entity basis: (via the UI or via the associated endpoint)
Hint
If the SLA has been defined manually, it will take precedence against policies based classes. (see next section)
Defining SLA policies to assign SLA classes automatically
You can define policies, which are regular expressions applied and orchestrated by TrackMe automatically:
For instance, say we want to match entities containing “cribl”:
Let’s execute the SLA tracker now:
From this stage, all matching entities will get the highest ranked policy and its associated threshold:
SLA Alerts
From TrackMe alert tabs, you can now create an SLA alert:
Hint
The TrackMe component must be defined and match your target (there would be one alert per component in the tenant)
Notes:
This alert is independent from the TrackMe main alerts and notable alert
If it triggers, it means that the SLA was breached for one or more entities
If the SLA is breached, the concept is to say that basically we had a first alert, the issue is not fixed after an acceptable amount of time so we generate a second alert once the SLA threshold has been breached. (2 tiers alerting system)
Example: