.. _white_paper_log_messages_logging_level_and_mldetection: Analyse log messages logging level to detect behaviour anomalies using TrackMe's Flex Object and Machine Learning Anomaly Detection ################################################################################################################################### .. admonition:: Detecting increasing volumes, globally and per index with TrackMe - This TrackMe whitepaper tutorial demonstrates how you can leverage TrackMe to to detect **Splunk instances and deployments abnormal conditions by analysing log messages and their logging level through the lens of Machine Learning Outliers detection**. - This use case tracks Splunk internal logs and their logging level, turning these into KPI's and submitting them to TrackMe's Machine Learning Outliers detection engine. - The objective is then to detect abnormal trends, such as an increase of the number of errors or a lack of informational messages, which would be symptomatic of potentially serious conditions affecting Splunk instances and deployments. - The same use philosophy can be applied to any application log messages, and detect abnormal behaviour in any application or system. .. hint:: **Requires TrackMe licence:** - This use case is designed using TrackMe's restricted component Flex Object. - This component is only available to licensed users of TrackMe, and is not available with the free community edition of TrackMe. Objective: Making sense of Splunk log messages logging level and detect Splunk abnormal conditions ================================================================================================== **The basic logic is to track Splunk internal log messages through the lens of their logging level, similarly to the following Splunk search:** *Let's take an example:* :: index=_internal host=* log_level=* (log_level=ERROR OR log_level=INFO OR log_level=WARN OR log_level=FATAL) ``` The basis break by statement logic happens on a per host basis, you may want to update the logic to group by tiers logic, such as grouping by indexer cluster, Search head, Search Head Cluster and so forth ``` | bucket _time span=30m | stats count(eval(log_level="INFO")) as info, count(eval(log_level="WARN")) as warn, count(eval(log_level="ERROR")) as error, count(eval(log_level="FATAL")) as fatal by _time, host | fillnull value=0 | timechart span=30m sum(info) as info, sum(warn) as warn, sum(error) as error, sum(fatal) as fatal **Several challenges need to be tackled:** - **volume**: the volume of log messages can be very high, running large searches can be very resource consuming, or not even realistic in a continuous monitoring context. - **variability**: the volume of log messages can vary a lot, depending on time conditions, users activity and so forth. - **relevance:**: Simply tracking for errors is not meaningful, errors happen all the time, and are not necessarily symptomatic of a problem. The key is to detect abnormal trends. **The following chart shows the challenges quite well:** .. image:: img_v2/white_papers/logging_level/chart_uc_challenge1.png :alt: chart_uc_challenge1.png :align: center :width: 1200px :class: with-border **An ad-hoc investigation can possibly detect an abnormal condition, our goal is to leverage TrackMe to do this work efficiently:** .. image:: img_v2/white_papers/logging_level/chart_uc_challenge2.png :alt: chart_uc_challenge2.png :align: center :width: 1200px :class: with-border **TrackMe's Machine Learning Outliers detection:** .. image:: img_v2/white_papers/logging_level/chart_uc_challenge3.png :alt: chart_uc_challenge3.png :align: center :width: 1200px :class: with-border Implementation in TrackMe ========================= Step 1: Create a Virtual Tenant ------------------------------- **First, let's create a new Virtual Tenant for our use case, we may well simply use an existing Virtual Tenant, for the purposes of the documentation we will simply create a new tenant:** .. image:: img_v2/white_papers/logging_level/create_vtenant1.png :alt: create_vtenant1.png :align: center :width: 1200px :class: with-border .. image:: img_v2/white_papers/logging_level/create_vtenant2.png :alt: create_vtenant2.png :align: center :width: 1200px :class: with-border Step 2: Create the new Flex Object tracker ------------------------------------------- **The use case was integrated in the Flex Object library in TrackMe 2.0.97, so you can simply call the template splk_splunk_infra_log_level_variations:** *Note: You can find the use case search at the end of this document* .. image:: img_v2/white_papers/logging_level/create_flexobject1.png :alt: create_flexobject1.png :align: center :width: 1200px :class: with-border .. image:: img_v2/white_papers/logging_level/create_flexobject2.png :alt: create_flexobject2.png :align: center :width: 1200px :class: with-border .. image:: img_v2/white_papers/logging_level/create_flexobject3.png :alt: create_flexobject3.png :align: center :width: 1200px :class: with-border **Once the tracker has been executed at least once, we start tracking the log messages logging level on a per Splunk instance basis (which you can customise):** .. image:: img_v2/white_papers/logging_level/create_flexobject4.png :alt: create_flexobject4.png :align: center :width: 1200px :class: with-border .. image:: img_v2/white_papers/logging_level/create_flexobject5.png :alt: create_flexobject5.png :align: center :width: 1200px :class: with-border Step 3: (optional) backfill KPIs -------------------------------- **You can optionally backfill the KPIs, so ML models can be trained with proper dataset immediately:** *Note: ensure to replace the name of the tenant in the Splunk search* :: index=_internal host=* log_level=* (log_level=ERROR OR log_level=INFO OR log_level=WARN OR log_level=FATAL) earliest=-30d@d latest=now ``` The basis break by statement logic happens on a per host basis, you may want to update the logic to group by tiers logic, such as grouping by indexer cluster, Search head, Search Head Cluster and so forth ``` | bucket _time span=30m | stats count(eval(log_level="INFO")) as "trackme.splk.flx.splunk.log_level.info" count(eval(log_level="WARN")) as "trackme.splk.flx.splunk.log_level.warn" count(eval(log_level="ERROR")) as "trackme.splk.flx.splunk.log_level.error" count(eval(log_level="FATAL")) as "trackme.splk.flx.splunk.log_level.fatal" by _time, host | fillnull value=0 | eval alias=host | lookup trackme_flx_tenant_splunk-log-level alias OUTPUT _key as object_id, object, object_category, tenant_id | mcollect index=trackme_metrics split=t object, object_category, object_id, tenant_id **Once proceeded, KPIs should be backfilled depending on if the internal index retention allowed it:** .. image:: img_v2/white_papers/logging_level/backfill_kpis1.png :alt: backfill_kpis1.png :align: center :width: 1200px :class: with-border Step 4: Review Machine Learning Outliers detection -------------------------------------------------- **Once TrackMe has trained Machine Learning models, you can review the results in the Machine Learning Outliers detection dashboard:** *Note: You can manually run the mltrain job to force training the models now, or train a particular model through the UI* .. image:: img_v2/white_papers/logging_level/ml_outliers1.png :alt: ml_outliers1.png :align: center :width: 1200px :class: with-border **In our demo case, we know we already have an abnormal period, which can add as an exclusion so the models are not affected by this period:** .. image:: img_v2/white_papers/logging_level/ml_outliers2.png :alt: ml_outliers2.png :align: center :width: 1200px :class: with-border **Detecting abnormal error logging increasing trend:** .. image:: img_v2/white_papers/logging_level/ml_outliers3.png :alt: ml_outliers3.png :align: center :width: 1200px :class: with-border .. image:: img_v2/white_papers/logging_level/ml_outliers4.png :alt: ml_outliers4.png :align: center :width: 1200px :class: with-border **Detecting insufficient informational logging:** .. image:: img_v2/white_papers/logging_level/ml_outliers5.png :alt: ml_outliers5.png :align: center :width: 1200px :class: with-border **Our detection is now fully ready, and TrackMe will alert us if abnormal conditions are detected on the Splunk instances and deployments.**