Splunk Workload (splk-wlk)

Note

TrackMe Workload component

  • This feature is part of TrackMe’s restricted features and allows tracking the key value activity and KPIs of your Splunk scheduling, from alerts to scheduled reports and Enterprise Security correlation searches.

  • We track and alert on sophisticiated concepts such as the detection of execution anomalies or delayed execution, consumption beaviour changes using Machine Learning as well as version control and changes in the searches themseleves.

  • The TrackMe Workload component is extremely powerful and provides the missing deep visibility on your criticial Splunk workflow, whenever you are using Splunk for security or IT operations, or any other use case.

1. Introduction

The objective of the Splunk Workload component for TrackMe is to continuously track the scheduling activity of a Splunk deployment, and perform key activity and detection:

  • Discover Splunk scheduled activity across the environment, an entity will be the association of the Splunk application name space, the owner, and the search name

  • Track the scheduled activity behaviour, and detect issues that can affect the well behaving of Splunk scheduled reports and alerts

  • Track and identify the scheduled objects Metadata, such as their definition, allow to investigate and detect changes impacting searches behaviours over time

  • Generate consistent and key value metrics from Splunk introspection, scheduler and SVC usage for Splunk Cloud customers

  • Use Machine Learning Outliers detection to track for abnormal behaviours and changes, such as an abnormal increase in costs, run time or consumptions

  • Alert when scheduled reports and alerts are affected by issues, and get benefit from the full set of TrackMe’s features and workflow to handle the life cycle of Splunk scheduled activity

The TrackMe Workload component, associated with TrackMe’s unique workflow is a game changing companion for Splunk, allowing it to fastly detect and alert on the Splunk most critical activity, as well as providing the keys to understand scheduled related costs like never before.

screen0.png screen1.png

Note

Grouping in the Workload component: group and overgroup

  • By default, the Workload component groups entities by app which represents the Splunk application namespace hosting the scheduled report or alert.

  • When creating the tenant and also later on when creating trackers manually, since TrackMe 2.0.70, you can optionnally override this behaviour by setting up an overgroup value.

  • By doing so, the Workload will group entities based on a custom term instead of applications, this can be useful if for instance you want to have multiple Search Head tiers in the same tenant.

  • Refer to the Creating a Workload tenant to host mutliple Search Head tiers with overgroup section for more information.

2. Workload entities

Once a Workload TrackMe Virtual Tenant has been configured, TrackMe starts to track for any scheduled activity and will create and maintain associated entities:

An entity name is composed of:

  • app + “:” + owner + “:” + savedsearch name

Should any of this information change, TrackMe will consider this as a new entity to be created and maintained, for instead of a search reassignment, if the search is moved between different application name spaces, or if the search knowledge object identifier is changed.

3. Anomaly reason

3.1 Anomaly reason definition

TrackMe considers each entity individually, and will trigger a status change based on the following criterias:

Anomaly Reason

Conditions

none

The entity is green and is healthy, there are no issues detected currently

skipping_searches_detected

TrackMe detected skipping executions for that scheduled, by default orange<5% and red>%5

orphan_search_detected

A previously active scheduled search is now orphan, which means the owner is not valid any longer and the search cannot be executed

execution_errors_detected

Execution errors are detected and the scheduled report/alert is not working properly

anomaly_outliers_detected

The Machine Learning outliers engine detected issues in one or more active ML models, this can be for instance an abnormal increase in the schedule runtime

execution_delayed

TrackMe uses the Metadata and the cron schedule translation to monitor if the search is delayed compared to its expected schedule (with 5 min of grace time by default)

status_not_met

The status of the entity is red, for unclassified reason

out_of_monitoring_window

The status of the entity is not healthy, however the current period is out of the monitoring window set for this entity

3.2 Checking the Anomaly Reason

You can observe the current anomaly reason in different locations:

  • In the table column called “Anomaly Reason”

  • By right clicking on the entity, in the contextual menu

  • When the alert is sent over, as part of the fields resulting from the alert

  • When the alert fires, as part of the TrackMe notable event

table anomaly reason column:

anomaly_reason_column.png

right click contextual menu on the entity name:

anomaly_reason_context.png

The status message is conditioned by the anomaly reason value and translated into a detailed message:

status_message.png

The anomaly reason is part of a notable event when an alert triggered:

notable_event.png

3.3 Filtering on Anomaly Reasons

You can use the filter functions to check all entities with a given anomaly reason:

anomaly_reason_filter.png

4. Workload metrics

Once a scheduled entity was discovered, TrackMe tracks its activity and generates various key metrics:

Metric name

Description

count_execution

Scheduler metric: the number of requested executions for this schedule

count_completed

Scheduler metric: the number of successfully completed executions for this schedule

count_skipped

Scheduler metric: the number of skipped executions for this schedule

count_errors

Scheduler metric: the number of executions errors detected for this schedule

elapsed

Introspection metric: the run time of the search, in seconds

pct_cpu

Introspection metric: the aggregated percentage of CPU used for this schedule (which can be more than 100%)

pct_memory

Introspection metric: the aggregated percentage of Memory used for this schedule (which can be more than 100%)

scan_count

Introspection metric: the number of events scanned for this schedule

svc_usage

Splunk Cloud metric: the SVC usage for this consumer

The Workload metrics are then used to condition the status of the entity, and feed the Machine Learning Outliers engine.

4.1 Accessing metrics

4.1.1 Metrics summary table

Metrics summary in the table:

TrackMe shows a summary of the metrics of the last known 24 hours per entity:

metrics_column.png

Note that metrics shown in the summary JSON vary depending on the metrics available, for instance an entity which has not been experiencing skipping searches will not show the metric as it is null.

4.2.2 Metrics chart over time

You access to the metric chart selector by opening an entity overview:

metrics_chart.png

Use the metric selector, and optionally other settings according to your needs, in the following examples, we are looking at the average elapsed for that entity:

metrics_chart2.png

5. Metadata Versioning

TrackMe monitors the versioning of scheduled entities, to detect changes and allow linking a knowledge object state with its performance metrics.

Hint

Diff search/earliest/latest (new in TrackMe 2.0.72)

  • Since TrackMe 2.0.72, the diff of the search and/or earliest/latest quantifiers is performed and stored as part of the versioning events and records

  • This means that TrackMe automatically identifies the difference of the search when a change is detected against the previously known version, stored as part of the diff_<context> fields

  • It will also attempt to identify the user that performed the change and when this change occurred (if the change occurred via Splunk Web/Splunk API, not application configuration push)

metadata-diff.png

5.1 How the Metadata Versioning works under the cover

This activity is performed by two associated scheduled TrackMe trackers:

  • metadata

  • orphan

metadata1.png

Via a Python based integration, TrackMe maintains the knowledge of scheduled entities, and identity a certain state as an MD5 hash which is composed by:

  • The search code definition, basically the whole SPL statement whatever its complexity is

  • The earliest time quantifier

  • The latest time quantifier

5.2 Accessing Metadata information via the UI

Regularly, active entities are inspected, if a change is detected, a new version_id is defined, a new set of Metadata are extracted and stored in a KVstore collection:

metadata2.png

5.3 Accessing Metadata information via the KVstore

The Metadata version is stored in a JSON structure within the KVstore record associated with a given entity, in a persistent fashion:

| inputlookup trackme_wlk_versioning_tenant_<replace with tenant_id> | search object="<replace with object name>"
metadata3.png

5.4 Accessing Metadata information via TrackMe indexed events

When TrackMe detects a new version of a monitored scheduled entity, it will as well generate an event in the trackme_summary index of the tenant, with the sourcetype:

index=trackme_summary sourcetype=trackme:wlk:version_id tenant_id="<replace with tenant_id> object="<replace with object name>"
metadata4.png

5.5 Use cases for versioning

So, what are the use cases for the versioning TrackMe feature, associated with TrackMe’s workflow, incident management, metrics generation and so forth? There are many valuable use cases of course.

For instance, it becomes fairly easy to link a change in the search logic to a massive increase of its run time or computing costs, potentially suddenly leading to alerts in TrackMe caused by skipping searches.

When accessing the metrics overview, use the break down selector to add the version_id as part of the break by statement:

metadata5.png

Perhaps TrackMe started to detect bad behaviour of the scheduled object, for instance starting to generate a certain level of skipping searches:

metadata6.png

Wait what, don’t we have information about flipping statuses? We also know the “when”:

metadata7.png

As we can see in the chart, a clear increase in in the elapsed (run time in seconds) became suddenly visible, which we have been able to easily associate with a change of the Splunk knowledge object itself:

metadata8.png

A search which was very efficient suddenly starts consuming a huge amount of resources, potentially impacting the functional use case itself, perhaps even the platform and starting to cost real money to the company, with the need for additional resources and so forth!

Use cases for the Workload features and the versioning are legions, its value grows exponentially along with your deployment size, scale and maturity.

6. Delayed searches execution detection

As part of the Workload component, TrackMe verifies that scheduled entities have been executing properly according to their requested schedule.

A delayed search execution can have severe implications, from a functional perspective this can mean security events which are not considered properly for instance, with all the range of potential consequences.

This is verification is called “execution_delayed” anomaly reason status, and works the following way:

  • When the versioning components inspect the Metadata of an entity, it retrieves the cron schedule expression too, which defines how often a scheduled is going to be scheduled

  • This cron schedule is transformed into a per seconds value defining how often the search should have been executing (using th croniter Python library), and stored in the field cron_exec_sequence_sec along with the metadata record of the entity

delayed1.png

Finally, TrackMe verifies and stores the scheduled successful execution traces, if a search which should have executed at least once in the past 5 minutes has not been since active since more than 5 minutes, plus the grace period, then TrackMe can consider this entity as unhealthy and trigger and alert accordingly,

For instance, let’s disable the scheduling of an active schedule (a mistake? bad behaviour?), after a short period of time, TrackMe detects the conditions which immediately impacts the entity, generates an alert and TrackMe notable!

we can observe the dates of latest inspection and latest execution in the contextual menu:

delayed2.png

After some minutes, TrackMe detected that the scheduled search is delayed:

delayed3.png

From this stage, we automatically detect an abnormal condition for our critical scheduled activity, and we can start acting accordingly before it starts having any severe consequences!

7. Orphan searches detection

Orphan search condition is as well detected by Trackme, for any search that is has been actively discovered, in short:

  • Once per hour, TrackMe verifies via the orphan tracker the current status of all active scheduled objects for the past 7 days

  • It updates the Metadata information with the orphan boolean

  • In return, TrackMe as part of its condition verifies that the search has not been detected as an orphan search, which otherwise will be considered as a critical condition for this entity

orphan1.png orphan2.png orphan3.png

8. Skipping searches detection

Skipping search condition is achieved by comparing the collected count_execution and count_completed collected metrics, which gives the skipping search percentage:

  • If the skipping search percentage is somewhere between 0 to less than 5%, the entity status will be orange

  • Over 5% of skipping search percentage, the entity will turn red

  • The skipping percentage can be seen in the summary metrics JSON in the table, or via the metric over time inspector

skipping1.png skipping2.png

9. Error executions detection

Detecting at scale and automatically execution errors is another challenge TrackMe tackles in the Workload component:

  • Execution errors can happen for all sorts of reasons, for instance a related knowledge objects (macro, lookup etc) which is not available to the search logic

  • Errors in the application development, lack of qualification, unexpected deployment side effect, etc many combination can lead to a scheduled search to be failing pretty much silently

  • TrackMe handles the challenge by continuously looking at the scheduler activity, and extracting the executions errors turning these into a metric which we can take into account

errors1.png errors2.png

From the overview (currently looking at the metrics), you can as well directly access to the scheduler errors:

errors4.png errors5.png

10. Machine Learning Outliers detection

Machine Learning Outliers detection can be used to detect abnormal trends for the scheduled entities, based on any of the available KPIs:

outliers1.png

Machine Learning models created and trained by default depend on the Virtual Tenant configuration, additional ML models can be created on demand:

outliers2.png