Using SLA alerting to build a 2-tier monitoring system¶
About SLA alerting
SLA alerting concepts were introduced in TrackMe 2.0.92.
The concepts of SLA alerting can be leveraged to design a 2-tier monitoring system, where a first alert is emitted when TrackMe entities switch to a red state, and a second independent alert is emitted when the SLA is breached.
This can be translated as “alert: this TrackMe entity is red!”, then later on “alert: This entity has now spent too long in alert and is breaching its SLA!”.
Entities are associated with a given SLA class, each SLA class has a threshold in seconds and a numerical rank value.
Available SLA classes and their configuration are defined at the level of the TrackMe system configuration. (A JSON object which defines the list of classes and their parameters)
You can then define the SLA class per entity (which if defined will take precedence over policy-based classes), or via SLA policies. SLA policies support two modes:
Regex-based policies match regular expressions against the entity name and assign an SLA class.
Lookup-based policies leverage existing Splunk lookups (CSV or KVstore) to assign SLA classes based on field mappings, making it easy to integrate with a CMDB or asset inventory.
If multiple policies match, the highest ranked SLA class wins (based on the rank value defined in the SLA class configuration).
Finally, you can create an SLA alert, which leverages the SLA information to emit an alert when the SLA is breached.
SLA Classes and Thresholds¶
SLA classes are defined at the level of the TrackMe system configuration, SLA classes define a threshold in seconds and a numerical rank value:
The default SLA classes are:
{
"gold": {
"sla_threshold": 14400,
"rank": 3
},
"silver": {
"sla_threshold": 86400,
"rank": 2
},
"platinum": {
"sla_threshold": 172800,
"rank": 1
}
}
Notes:
You can add / remove / change classes as needed.
You can define the default class to be used when TrackMe entities are discovered. (parameter: default_sla_class)
The Threshold value is in seconds. For an SLA to be breached, a given TrackMe entity must be in alert for a continuous amount of time exceeding the threshold.
The rank value is a numerical value, it is used to handle any conflict when applying SLA policies, the highest rank value will always win.
SLA Tab in TrackMe UI¶
The SLA feature is first translated into a tab called SLA which describes the current SLA status for the selected TrackMe entity:
The entity is in green state, therefore the SLA cannot be breached:
The entity is red state, however it has not yet breached the SLA threshold by spending a continuous amount of time in alert higher than the threshold:
The entity is red state, and it has breached the SLA threshold by spending a continuous amount of time in alert higher than the threshold:
Defining the SLA class per entity¶
You can define the SLA class manually on per entity basis: (via the UI or via the associated endpoint)
Hint
If the SLA has been defined manually, it will take precedence against policies based classes. (see next section)
Defining SLA policies to assign SLA classes automatically¶
TrackMe supports the management of SLA classes via policy, which can be defined per Virtual Tenant:
SLA policies allow you to automate SLA class assignment across entities using regex-based or lookup-based rules.
Regex-based policies match regular expressions against the entity name (
objectfield) and assign the configured SLA class.Lookup-based policies leverage existing Splunk lookups (CSV or KVstore) to assign SLA classes based on field mappings, making it easy to integrate with a CMDB or asset inventory.
Matching entities are automatically updated with the SLA class defined in the policy.
If multiple policies match a given entity, the highest ranked SLA class takes precedence across all policy types (based on the rank value defined in the SLA class configuration).
An entity managed by policies can still be updated manually, and the policy will not override the manual update.
Lookup-based SLA policies¶
Lookup-based policies leverage existing Splunk lookup transforms (CSV files or KVstore collections) to assign SLA classes to entities based on field mappings. This is the recommended approach when integrating with an external CMDB, asset inventory, or any structured data source.
Creating a lookup-based SLA policy:
Click Create new policy and select the Lookup mode
Enter a policy identifier (or let TrackMe auto-generate one)
Select a Splunk lookup transform from the dropdown — TrackMe lists all available lookup transforms in your Splunk environment
Configure field mappings to map lookup fields to entity fields — for example, map the lookup field
indexto the entity fielddata_index, and the lookup fieldhostto the entity fieldobjectSelect the SLA class field — the field in the lookup that contains the SLA class value
Optionally configure SLA class value mappings to translate foreign values to TrackMe’s SLA classes (e.g.,
tier1→gold,tier2→silver)Select the match mode: Exact (case-insensitive string matching) or Wildcard (supports
*and?patterns)Click Simulate to preview which entities would be matched and what SLA classes would be assigned
Lookup policy configuration details
Field mappings: Define how lookup fields map to entity fields. All mapped fields must match for an entity to be selected. For example, if you map
index→data_indexandhost→object, both conditions must be satisfied.SLA class field: The column in the lookup containing the SLA class value (e.g., a
sla_tierorservice_levelcolumn).SLA class value mappings (optional): If your lookup uses custom values like
tier1,tier2,tier3, you can map them to TrackMe SLA classes (gold,silver,platinum, etc.). If no mappings are configured, the lookup values must already match TrackMe’s SLA class names.Match modes:
Exactperforms case-insensitive string comparison.Wildcardsupports*(matches any characters) and?(matches a single character) for flexible pattern matching.
Regex-based SLA policies¶
Regex-based policies match a regular expression pattern against the entity name (object field). When the pattern matches, the configured SLA class is assigned to the entity.
Creating a regex-based SLA policy:
Click Create new policy and select the Regex mode
Enter a policy identifier (or let TrackMe auto-generate one)
Enter the regex pattern to match against entities
Select the SLA class to assign to matched entities
Click Simulate to preview which entities would be matched before saving
For instance, say we want to match entities containing “cribl”:
SLA class hierarchy and conflict resolution¶
When multiple policies (regex, lookup, or a mix of both) match the same entity, TrackMe applies the highest ranked SLA class across all matching policies.
The rank is determined by the numerical rank value defined in the SLA class configuration (see SLA Classes and Thresholds above). The policy with the highest rank value wins.
For example, if a regex policy assigns silver (rank 2) and a lookup policy assigns gold (rank 3) to the same entity, the entity will receive the gold SLA class.
Running the SLA policy tracker¶
After creating one or more policies, you can execute the SLA policy tracker to apply all configured policies to your entities:
Click the Run policy tracker button in the policy management screen
TrackMe will evaluate all regex and lookup policies against all entities in the selected component
A summary of the results is displayed, including the number of entities updated, matched, and any errors encountered
From this stage, all matching entities will get the highest ranked policy and its associated threshold:
SLA Alerts¶
From TrackMe alert tabs, you can now create an SLA alert:
Hint
The TrackMe component must be defined and match your target (there would be one alert per component in the tenant)
Notes:
This alert is independent from the TrackMe main alerts and notable alert
If it triggers, it means that the SLA has been breached for one or more entities
If the SLA is breached, the concept is to say that basically we had a first alert, the issue is not fixed after an acceptable amount of time so we generate a second alert once the SLA threshold has been breached. (2 tiers alerting system)
Example: