Monitor Splunk Workload with TrackMe’s Workload component

Monitoring Splunk Search Head Clusters with TrackMe Flex Objects

  • This tutorial demonstrates the monitoring of Splunk Workload with the TrackMe Workload component.

  • The Workload component is a component restricted to licensed customers. Please contact Splunk Sales for more information.

  • Using these steps will enable TrackMe to continuously monitor Splunk scheduled activity, detect execution anomalies and change in behaviour, as well as performing versioning of searches definition.

  • Monitoring the Splunk scheduling workload is a critical part for Splunk, for instance detecting when your SIEM correlation searches are affected by any issue is as crucial as detecting when feeds have stopped alimenting your use cases.

  • With TrackMe’s remote search capabilities, you can monitor as many Search Head tiers as you need from a single pane of glass.

Requirements

In most of the cases, you will want to target a different Search Head tier than the one where the TrackMe instance is running.

For example, if you are running Splunk Enterprise Security, this is normally in a different Search Tier, we will therefore start by creating a Splunk Remote Account in TrackMe to access the Search Head tier, programmatically speaking.

Step 1: Create a Splunk Remote Deployment Account for the SHC

The first step is to create a Splunk Remote Deployment Account for the Cluster Manager. For more information about TrackMe Remote Search capabilities and configuration:

Splunk remote deployments (splunkremotesearch)

On the Search Head Cluster, create a new Splunk bearer token for the TrackMe Remote Deployment Account:

screen001.png

In TrackMe, click on Configure / Remote Deployment Accounts and add a new account:

  • If running a Search Head Cluster, You can specify each of the SHC members in a comma separated list or use a load balancer URL.

  • If running a standalone Search Head, you can specify the splunk API URL of the Search Head.

  • If multiple endpoints are specified, TrackMe automatically dispatches searches randomly amongst available and responding members (it validates connectivity and authentication)

screen002.png screen003.png

Managing multiple Search Head tiers

  • If you have multiple Search Head tiers to be monitored, you can create a Remote Deployment Account for each tier.

  • You will then be able to manage and monitor as many Search Head tiers as you need from a single pane of glass in TrackMe.

Step 2: Create a Workload tenant and use the wizard to create trackers automatically

Access the tenant creation wizard and select the Workload component

screen001.png screen002.png

Choose the tenant_id and other main information

screen003.png

Define the type of Splunk deployment (Enterprise/Cloud)

Splunk Enterprise vs Splunk Cloud

  • The main difference here is that if Splunk Cloud is selected, TrackMe will create necessary trackers to monitor the SVC consumption of monitored scheduled

  • Select Splunk Cloud is the environment being monitored is a Splunk Cloud environment.

screen004.png

Select the Search Head tier target

Search Head target

  • Although some of the search logics TrackMe implements could indeed be run on the TrackMe instance itself if both can search the same indexers, there are much deep API related logics which come into account for the Workload components.

  • Therefore, it is important to select the right Splunk Remote Deployment Account targeting the desired Search Head tier.

screen005.png

Define Search Restrictions

Search Restrictions

  • For each of the main categories of activity, you need to define and/or restrict the Search Heads to be considered for monitoring.

  • This is important to avoid monitoring searches that are out of the scope of the Search Head you target, but which underneath activity can be accessed from the TrackMe instance.

  • In the example below, we restrict to a given Search Tier using the filter (host=sh1 OR host=sh2 OR host=sh3)

screen006.png

Tracking Enterprise Security Notables

Tracking Notable events

  • The Workload component can track notable events generated by correlation search, and turns this into a TrackMe metric.

  • You can then use TrackMe to review the number of notable events created by your correlation searches.

screen007.png

ML Outliers Detection

Outliers detection and Workload

  • TrackMe can use ML Outliers detection to automatically detect abnormal trends for a given metric.

  • In the context of the Workload, TrackMe by default generates and train models against the run time in seconds of the searches, so it can detect an abnormal trend in the time taken by searches to be executed.

  • You can customise and/or disable this behaviour.

screen008.png

Grouping

Workload grouping

  • In the Workload component, the default behaviour is to group TrackMe entities per Splunk application namespace.

  • This means you should not monitor multiple Search Head tiers within the same tenant. (entities would be mixed all together)

  • If you wish to target multiple Search Head tiers in the same tenant, you need to use the grouping to group against a custom pattern.

  • You can also simply use this setting to force TrackMe not to group per application name space.

screen009.png

Inactive entities

Managing entities without any activity

  • When a search schedule is disabled, it will become inactive after some times.

  • This setting instructs what TrackMe should do with these entities, the default behaviour is to remove entities which have not been active for more than 15 days.

screen010.png

Validate the tenant creation and start

After the tenant is created, wait up to 5 minutes before seeing the first entities in TrackMe

  • Immediately after the tenant is created, TrackMe start to orchestrate the Workload detection logic

  • This can take up to 5 minutes before you can start seeing entities created in the Workload component

screen011.png screen012.png

Step 3: Start using the Workload component, detect and investigate scheduling issues

Ignoring specific Splunk application namespaces

  • You can ignore specific Splunk application namespaces from the Workload component.

  • This can be useful to remove Splunk out of the box applications which are not relevant for your monitoring.

  • Click on “Manage: enable/disable apps” to access the configuration screen

screen013.png screen014.png

Use case: detecting skipping searches

What are skipping searches and how these affect Splunk workload

  • Skipping searches happen when a given search takes too long to be executed, and is then skipped by the scheduler.

  • The current cron schedule defines how often the search is executed, and indirectly defines how long in the seconds the search can run before a new search should normally be executed.

  • If Splunk encounters that a search is already running at this stage, its execution will be skipped.

Depending on the percentage of skipping searches, TrackMe sets the entities status to orange or red:

uc_skipped001.png uc_skipped002.png

TrackMe’s versioning, cron schedule and cron sequence

  • TrackMe performs versioning of all monitored searches, and detects when a search is updated.

  • It also extracts the cron schedule, and uses croniter python library to calculate the maximum run time sequence in seconds of the search.

  • You can use this information to easily understand, without quitting TrackMe, the reasons for skipping searches.

In this example, we can see that the cron schedule is set to run every 5 minutes, and that the maximum run time sequence is 300 seconds:

uc_skipped003.png

However, The run time execution is well over acceptable values, and the search is skipped:

uc_skipped004.png

This search is clearly encountering quality issues, and needs to be reviewed and redesigned to perform properly or have a more adapted cron schedule.

Use case: detecting searches suffering from execution anomalies

Execution errors typically indicate non working searches and may slightly affect your Splunk use cases

  • There are plenty of reasons why a Splunk search could start failing at some point of its life cycle.

  • This may happen due to temporary infrastructure issues, incorrect syntax, breaking change introduction, third party application installation or updates… plenty!

  • This gets also more complex and challenging with the continuous growth of the environment and use cases, very often use cases fail silently and are not detected until a major issue is encountered.

  • With TrackMe’s Workload component, anomaly executions are detected at scale, made easily available and automatically investigated for use cases and platform owners to review as soon as possible.

Searches suffering from execution anomalies are tagged with the anonaly_reason=execution_errors_detected

uc_execution_anomalies001.png

Open the entity and review when execution errors were detected

uc_execution_anomalies002.png

In a single click, you can use the SmartStatus to retrieve a meaningful sample of the Splunk scheduler execution errors:

Hint

SmartStatus provides access to the _internal execution errors to use cases owners with no access to the _internal index

  • With the SmartStatus and TrackMe’s RBAC management, use case owners can transparently access to the execution errors from the _internal, even if they do not have access to the _internal index.

  • This is possible via controlled elevation of privileges, users that are granted access to the TrackMe API endpoints through the RBAC configuration, will be able to execute the SmartStatus action which itself is executed with system level privileges.

  • Finally, TrackMe also executes the SmartStatus automatically via the builtin alert action, and indexes the results of it into the associated TrackMe summary index.

uc_execution_anomalies003.png uc_execution_anomalies004.png uc_execution_anomalies005.png uc_execution_anomalies006.png

Use case: Use TrackMe’s version to establish relations between search updates and changes in behaviour

Say you have a search which suddenly started to generate anomalies execution, TrackMe captures this fact and impacted the entity:

uc_versionning001.png uc_versionning002.png

Looking at the version tab, you can see the changes made to the search, and when these were made:

uc_versionning003.png uc_versionning004.png

From the metrics perspective, we break by the version_id and identify when the change was made and how this impacted the search behaviour:

uc_versionning005.png uc_versionning006.png

You can also look at the corresponding versioning events:

index=trackme_summary sourcetype="trackme:wlk:version_id" object=*
uc_versionning007.png

Use case: Detect delayed searches

By monitoring scheduled, TrackMe also detects if a search has stopped being executed

  • Using its versioning capabilities, TrackMe interprets the cron schedule and defines the cron_exec_sequence_sec. (using the Python library croniter)

  • This value in seconds represents the time expected between two executions of the search.

  • If a search that was discovered suddenly stops being executed, TrackMe will detect it and tag the entity as delayed.

  • TrackMe applies a 60 minutes delay grace time before impacting the entity.

  • This is a vital verification, especially for SIEM environments, being capable of detecting use cases that are not executed anymore is very important and should lead to the investigation of the issue.

In the versioning tab, you can see the cron_schedule as well as the cron_exec_sequence_sec:

uc_delayed001.png

TrackMe detects that the search has stopped being executed, and tags the entity as delayed:

uc_delayed002.png uc_delayed003.png

The search was indeed disabled at the Splunk level, which might have been intentional, or not, and should be reviewed:

uc_delayed003.png

Use case: Monitoring Data Models Acceleration and Reports Acceleration behaviours and performance

Splunk Accelerated Searches in the Workload component

  • TrackMe also discovers and monitors searches that relate to Splunk Data Models (DMA) and Reports Acceleration.

  • Especially for DMA, it is important to monitor the level of Skipping searches per Data Model, as well as the elapsed time performance of these searches.

  • TrackMe will for instance automatically alert if these are suffering from high skipping searches ratios.

  • Finally, TrackMe also uses Machine Learning to detect Outliers regarding the performance of the searches, and may alert if an abnormal increasing trend is detected.

DMA and report acceleration searches are automatically discovered and monitored by TrackMe:

uc_dma001.png

All metrics are available, such as the elapsed: (run time in seconds)

uc_dma002.png

This environment is struggling to maintain DMA for instance, indicating underneath issues that can affect your security use cases:

uc_dma003.png

Splunk Cloud customers can easily review related SVC usage:

uc_dma004.png

Splunk Enterprise & Cloud can also review CPU usage and memory consumption metrics:

uc_dma005.png uc_dma006.png

TrackMe’s unique ML Outliers detection clearly catched the bad situation:

uc_dma007.png

There are plenty more use cases that can be covered by the Workload component, and the above are just a few examples.

Some more documentations: