Use Case Demo: This is All About Feeds

Use Case Demo: This is All About Feeds

  • This white paper provides an overview of Splunk feeds tracking with TrackMe.

  • It covers the key concepts and features around tracking data sources and hosts availability and indexing performance.

  • Feeds tracking is a key feature of TrackMe, representing the root use case of TrackMe, and in many cases, the first reason why customers use TrackMe.

  • This use case demo documentation explains and demonstrates the key features of TrackMe feeds tracking.

High Level Overview

At the source of TrackMe concepts stands the idea of tracking data sources and hosts availability and indexing performance.

The key features can be summarized as follows:

  • Discovery: Discover and organize data sources automatically based on meaningful key metadata.

  • Persistence: Persist entities and their key information over iterations, using various features from Splunk KVstore collections to events and metrics indexing.

  • Availability: Track entities availability over time based on delay thresholds, to detect when a given entity has stopped forwarding events to Splunk.

  • Performance: Track entities indexing performance over time based on latency thresholds, to detect when a given entity is suffering from indexing performance issues.

  • Quality: Provide out-of-the-box and automated capabilities to detect meaningful lack of quality in the entities data.

  • Metrics: Generate and store metrics in a high-performance way, using Splunk metric store capabilities.

  • Knowledge: Accumulate knowledge over time to improve investigations and decision-making.

  • Machine Learning: Implement meaningful and automated Machine Learning models to detect volume variations and potential anomalies on entities.

  • Flexibility: Provide flexible and powerful capabilities to cover from basic tracking use cases to the most complex ones.

  • Alerting: Provide out-of-the-box powerful and advanced alerting capabilities, from state-aware alerting to rich email deliveries, state-aware actions, and more.

  • Scalability: Scale at any level, from few terabytes environments to hundreds of terabytes environments.

  • User Experience: Provide a rich, easy, and fast user experience, with user interfaces and APIs that provide fast answers and browsing capabilities with TrackMe entities.

This use demo documentation covers four common and most valuable feeds related use cases:

  • Feed Interruption Detection

  • Feed Indexing Performance Issues Detection

  • Feed Volume Variation Detection

  • Feed Quality Issues Detection

  • Feed Hosts Distinct Count Anomaly Detection

Many more use cases can be implemented with TrackMe, and through our Flex Trackers components, you can create custom monitoring solutions tailored to your specific needs.

Feeds Tracking with TrackMe Data Sources Monitoring (splk-dsm)

The Data Sources Monitoring TrackMe component (splk-dsm) is the key component to track data sources availability and indexing performance.

In its simplest form, it is tracking feeds, creating and organising entities based on indexes and sourcetypes metadata.

Virtual Tenants allow customers to segment use cases based on their needs, from geographically oriented use cases to use cases based on data sensitivity or perimeters:

tenants_overview01.png

Virtual Tenant Entities Overview:

tenants_overview02.png tenants_overview03.png

The Home UI is the main user interface to manage your TrackMe entities:

home_ui01.png

To Access to the status of a given entity, representing a Splunk feed (by default through its index and sourcetype metadata), click on the left hande side icon:

home_ui02.png

The entity details page provides a rich set of information about the entity, including its status, availability, indexing performance, metrics, knowledge, machine learning, flexibility, alerting, scalability, user experience, and more:

entity_details01.png

Hosts Tracking with TrackMe Hosts Sources Monitoring (splk-dhm)

The second main components for feeds tracking is the Hosts Sources Monitoring TrackMe component (splk-dhm).

It acts similarly to the Data Sources Monitoring TrackMe component (splk-dsm), at the difference that it is tracking data from host events producer perspective, which means sourcetypes associates with a given host depending on the tracker scope of data:

hosts_overview01.png hosts_overview02.png

Primary Feeds Key Performance Indicators (KPIs)

TrackMe is all about Key Performance Indicators (KPIs) and high performing metrics, this means that whatever the use case, we turn data into metrics then act against these.

In all components, you will find a tab called Performance Metrics which lists available metrics for the given entity.

For Data Sources and Hosts, the primary KPIs are:

  • Latency: The performance at the indexing level from the Splunk perspective, as the difference between the indexed time and the event time.

  • Delay: How late the data is, calculated as the difference between now when the tracker is executed, and the very last event time.

Latency metrics:

performance_metrics01.png

Delay metric:

performance_metrics02.png

These two primary KPIs are those linked to the primary thresholds used to detect potential interruptions or performance issues for sources:

primary_thresholds01.png primary_thresholds02.png

Use Case: Feed Interruption Detection

A first essential use case you want to cover when it comes to feeds availability monitoring is the detection of feed interruptions. In many use cases, you might expect the data to be flowing into Splunk in real-time or semi-real-time, though this isn’t necessarily the case, and you need to have the capabilities to handle this on a case-by-case basis.

To cover this use case, TrackMe tracks the very latest event available on a per-entity basis, that is the event from the timestamp point of view (field _time). We calculate the difference between now, when the tracker is executed, and the last event, which gives you an indicator in seconds.

If this indicator goes beyond the established delay threshold, this is considered as a condition for anomaly, with anomaly_reason=”delay_threshold_breached”:

example_delay_threshold01.png example_delay_threshold02.png example_delay_threshold03.png example_delay_threshold04.png

Depending on the conditions, an interruption in the data flow can also be generating an Outliers anomaly detection:

example_delay_threshold_mldetection.png

If alerting has been setup, TrackMe would generate an alert with this content:

example_delay_threshold05.png example_delay_threshold06.png example_delay_threshold07.png

At some point, the issue was resolved and the data is back in Splunk:

example_delay_threshold08.png example_delay_threshold09.png

The alert is resolved and a closure notification is generated and sent:

example_delay_threshold10.png example_delay_threshold11.png example_delay_threshold12.png example_delay_threshold13.png

About Adaptive Delay Thresholding

In addition with the threshold concept above, TrackMe also leverages Machine Learning to adapt the delay threshold based on the knowedge accumulated over time on a per entity basis, and depending on the conditions.

This is called “Adaptive Delay Thresholding” and and is widely documented here: Splunk Feeds Delayed & Inactive Entities (splk-feeds).

This feature is very valuable to reduce the administration costs by automatically defining the best suitable values according to the entity behaviour, it can increase and decrease thesholds in function of the events and statuses.

Use Case: Feed Indexing Performance Issues Detection

Understanding Latency in Splunk

Latency is a critical metric that measures the time gap between when an event is generated and when it becomes available in Splunk. Think of it as the “travel time” for your data:

example_latency5.png

Why Latency Matters

When latency issues occur, they can cause significant problems:

  • Search Inconsistency: Searches run at different times may return different results

  • Delayed Insights: Critical information arrives later than expected

  • Resource Impact: Can indicate underlying system problems

Common Causes of Latency

Latency issues can stem from various sources:

  • Queue congestion

  • Resource constraints

  • Network bottlenecks

  • System overload

  • Configuration issues

How TrackMe Detects Latency Issues

TrackMe monitors latency by comparing the indexed time against the event time. When latency exceeds your configured threshold, TrackMe triggers an anomaly with anomaly_reason="latency_threshold_breached".

Let’s consider the following entity, currently the latency is very low and the entity is considered as healthy:

example_latency_threshold01.png example_latency_threshold02.png

Suddenly, the entity starts experiencing latency issues. We begin receiving events with a significant time gap between their generation and indexing:

Some additional comments about latency:

  • It is likely that latency will also affect the delay, but this is not necessarily the case.

  • You can well be receiving a mix of real-time or quasi-real-time events, while at the same time receiving events made available with a large latency.

  • This can be the case for a variety of reasons, from data feeds catching up situations to simply having collectors suffering from performance issues while others sending to the same context do not.

The situation escalates as the entity experiences high latency combined with a dramatic increase in event volume:

example_latency_threshold03.png example_latency_threshold04.png example_latency_threshold05.png

TrackMe immediately responds by opening an incident and sending a notification:

example_latency_threshold06.png example_latency_threshold07.png example_latency_threshold08.png

After the issue is resolved, the entity returns to a healthy state and the incident is automatically closed:

example_latency_threshold09.png example_latency_threshold10.png example_latency_threshold11.png example_latency_threshold12.png

Use Case: Abnormally Low Volume Trend Detection (Outliers Detection)

In the following example, our entity is healthy, we have data making it, not suffering from latency, and the volume is stable and expected:

example_outliers01.png example_outliers02.png example_outliers03.png example_outliers04.png example_outliers05.png

At some point, we start to see issues, we are still receiving events, but in a proportion that is not expected and indicates a potential issue:

example_outliers06.png example_outliers07.png

As this entity had a disruption queue configured, once the duration is breached, the entity transitioned to red and is now officially in alert:

example_outliers08.png

A significant amount of events is still making it to Splunk, without looking at this from the lens of Machine Learning and accumulated knowledge, we would have a hard time understanding that an abnormally low volume is happening.

example_outliers09.png example_outliers10.png

Once the disruption duration is over, the entity is now in alert:

example_outliers11.png

An incident is opened and a notification is sent:

example_outliers12.png example_outliers13.png example_outliers14.png

Once the issue is resolved, the entity will return to green after the Outliers anomaly condition is resolved and TrackMe has processed the ML rendering for this entity: (this can take some time depending on your settings and scale)

example_outliers15.png

The incident is closed and a closure notification is sent:

example_outliers16.png example_outliers17.png

Use Case: Quality Issues Detection (Event Format Recognition)

In Data Source monitoring (splk-dsm), TrackMe performs automated event format recognition using the events format recognition engine.

During this process, we will pick samples of events regularly, 10k events by default, which will be processed through the engine to detect the format of events, classified by types.

By default, TrackMe would alert against a lack of quality if it detects that the main format goes below 98% of the events being correctly classified.

Let’s consider the following entity, the data sampling engine detected a single format and the entity is considered as healthy:

example_quality01.png example_quality02.png

At some point, new sources are being on-boarded within the same sourcetype, however, mistakenly the expected format is not configured properly and events are making it with a different and non-expected structure:

example_quality03.png

This may or may not be a strong issue, but in most use cases, receiving different types of events within the same sourcetype is not expected and leads to failures at the index time level, for instance, the timestamp extraction, or at search time with failures to extract fields for unstructured events.

The data sampling engine will detect that the format has changed and that the main format previously detected is not matching the required percentage of events, and that new event formats were introduced, leading to an alert for quality issues:

example_quality04.png example_quality05.png example_quality06.png example_quality07.png

The quality issue detected influences the entity status, and an anomaly_reason=”data_sampling_anomaly” is generated:

example_quality08.png example_quality09.png example_quality10.png

Once the root cause is identified and the issue is fixed, after some time the entity will return to green when the expected format is correctly detected again:

example_quality11.png example_quality12.png example_quality13.png

Use Case: Hosts Distinct Count Anomaly

Another use case is to detect anomalies in your feeds based on the amount of distinct hosts making it to Splunk.

This is a valuable use case especially at large scale that allows easily detecting issues, such as the potential loss of intermediate collectors or the potential loss of a major source.

This activity can be performed using a static threshold against one of the distinct hosts metrics, as well as for more advanced use cases using Machine Learning Outliers detection.

The following example demonstrates the use of a static threshold to detect anomalies in the distinct hosts count, we can observe with the following entity that a certain amount of hosts are systematically available:

example_hosts_distinct_count01.png

The big picture must be considered:

example_hosts_distinct_count02.png

Based on this observation, we can configure a static minimal threshold for this entity, if the amount of distinct hosts goes below this threshold, the entity will be considered as in alert:

example_hosts_distinct_count03.png

An alternative method or used in conjunction with the static threshold, is to use Machine Learning Outliers detection to detect anomalies in the distinct hosts count while taking into account time patterns:

example_hosts_distinct_count04.png example_hosts_distinct_count05.png example_hosts_distinct_count06.png example_hosts_distinct_count07.png

If at some point, the amount of distinct hosts goes below the threshold, the entity will be considered as in anomaly, and an incident will be opened:

Notes:

  • The “global” distinct count KPI measures the number of unique hosts across the entire tracker period. Due to this global aggregation, there may be a delay in detecting anomalies as the system needs to accumulate enough data to establish a clear pattern.

  • A decrease in the number of distinct hosts often correlates with a decrease in event volume. In such cases, TrackMe’s Machine Learning Outliers detection may trigger alerts for both volume-based and distinct hosts models simultaneously, providing multiple indicators of the same underlying issue.

For the purposes of this demo, we have set both the static threshold and Machine Learning Outliers detection. After some time, both detection methods trigger:

example_hosts_distinct_count08.png example_hosts_distinct_count09.png example_hosts_distinct_count10.png example_hosts_distinct_count11.png example_hosts_distinct_count12.png example_hosts_distinct_count13.png example_hosts_distinct_count14.png example_hosts_distinct_count15.png example_hosts_distinct_count16.png

As usual, once the issue is resolved, the entity returns to green after the Outliers anomaly condition is resolved and TrackMe has processed the ML rendering for this entity:

example_hosts_distinct_count17.png example_hosts_distinct_count18.png example_hosts_distinct_count19.png example_hosts_distinct_count20.png