.. _trackme_admin_outliers: Outliers Anomaly Detection ########################## Machine Learning Outliers Anomaly Detection in TrackMe ====================================================== **TrackMe implements Machine Learning Outliers Anomaly detection in every component, from the feeds tracking to the monitoring of scheduled activity in Splunk, based on the following concepts:** - TrackMe relies on the **Splunk Machine Learning Toolkit** and the **Python Scientific Packages** with its own custom logic and workflow which orchestrates the life cycle of anomaly detection in the product - We use the ``apply`` and ``fit`` commands from the toolkit and orchestrate their usage when entities are discovered and maintained, the ``density function`` is used for the anomaly detection calculation purposes - TrackMe orchestrates the Anomaly Detection workflow in two essential steps, the ML models generation and training (``mltrain``), and the ML models rendering phase (``mlmonitor``) where TrackMe verifies the Anomaly detection status for a given entity - Depending on the TrackMe ``component``, ML models are generates automatically using metrics that are relevant for the component activity, models can be created, deleted, and customised easily to change the model behaviours if necessary - See the following white paper for a great use case around Machine Learning: :ref:`white_paper_detect_events_drop` .. hint:: **ML models learn over time from historical data** - ML detection in TrackMe requires historical data and gets more accurate over time - Depending on the settings, this can require up to several weeks or months of historical data with the most granular parameters - Historical data translates to the metrics stored in TrackMe's metric store indexes, not the raw data itself - Metrics start to get generated as soon as TrackMe discovered and started to maintain an entity - While TrackMe learns about the data by training ML models, it applies various safeties to avoid generating false positive alerts .. hint:: **ML confidence (new in TrackMe 2.0.72)** - Since TrackMe 2.0.72, ML models have a confidence level which is calculated based on the number of days of historical data available for the models training - By default, TrackMe defines a minimal requirement of 7 days before the confidence level is set to ``normal``, otherwise it is set to ``low`` - The confidence level and confidence_reason are stored in the rules KVstore collection, used while rendering models and also displayed in the **manage Outliers screen** - The minimal number of days of historical metrics to define the confidence level ro ``normal`` is driven by a system wide configuration option called ``splk_outliers_min_days_history`` .. hint:: **Time factor "none" for non seasonality driven outliers (new in TrackMe 2.0.72)** - Since TrackMe 2.0.72, ML models can be set with the ``time_factor`` defined to ``none`` - This enables TrackMe's outliers calculations to exclude seasonal concepts, in some cases this can be better address KPIs which are not driven by seasonality - This can be set on a per entity/model basis, or also chosen as the default model setting when TrackMe initiates ML models (option: ``splk_outliers_detection_timefactor_default``) .. hint:: **TrackMe 2.0.84 Outliers evolutions and enhancements** - In TrackMe 2.0.84, we have released major enhancements to the Outliers engine, the following notes describe the most important changes. - **splunk-system-user and private ownership:** ML Models are now owned by the splunk-system-user and private, this avoids having the models showing up in Splunk Web or the Splunk App for lookup editing, as well as increasing Splunk API response time when loading a large number of lookups. - **schema-upgrade:** TrackMe via its schema upgrade process will automatically reassign existing ML models within 5 minutes of the upgrade. - **Improved Outliers endpoints and custom commands:** TrackMe's Outliers related endpoints have been improved with better, more sophisticated, smarter and more efficient code! Training and rendering Outliers TrackMe commands are notably improved with enhanced outputs and behaviors. - **Minimal thresholds for LowerBound and UpperBound Outliers:** You can now on a per model basis define minimal thresholds values for LowerBound and UpperBound outliers, values not respecting these thresholds are rejected automatically. - **Outliers investigations and managements user interfaces:** Outliers related user interfaces have been improved to provide more visibility, this includes single views showing the distinction between LowerBound and UpperBound outliers, rejected Outliers and corrected Outliers for each category. - **True context Simulation model training:** TrackMe's Outliers simulation is now running in ``true context``, this means that when you run a simulation, TrackMe trains a simulation model which 100% respects the live model behaviours, this provides a true context and prevents any deviation or inconsistency compared with results the live Outliers view. - **Automated ML training at the backend level when TrackMe calls rendering functions:** TrackMe now automatically trains an ML model when it calls the rendering phase if it detects that the model is out of date and has not been recently updated, the maximum amount of days since last training is controlled by a system wide parameter ``splk_outliers_max_days_since_last_train_default``. (15 days by default) - **Orphan ML records and ML models cleanup:** TrackMe now automatically cleans up orphan ML records, as well as orphan ML models, this is controlled by a new health global tracker called ``trackme_general_health_manager`` which runs daily, this job handles global health related tasks for all TrackMe tenants. - **Bulk actions:** Various key actions for Outliers management can be performed in bulk via the user interface: reset Outliers status, enable/disable, run ML train and ML monitor operations. .. hint:: **TrackMe 2.0.89 Using custom algorithms, fit and apply extra parameters, additional selectionable time periods and customisable boundaries extraction** - In TrackMe 2.0.89, we have released various additional capabilities, notably for customers with advanced Machine Learning requirements or practice. - **Custom algorithms:** You can define custom algorithms at the global configuration level, (Configuration UI), these alrogithms become available for selection when creating or updating ML models, and you can as well define the default algorithm to use when TrackMe initiates ML models. - **boundaries exaction macro:** TrackMe now refers to a Splunk macro for the extraction of the boundaries, to address any requirements, you can also add custom bounaries extraction macros, and influence the default macro used when creating or updating ML models. - **fit extra parameters:** You can define extra parameters for the ``fit`` command, defined by default and/or modified per ML model, with this feature, you can define extra parameters allowed by the Splunk MLTK, for more information see: https://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms. (An example of usage, you can define the extra parameters to ``exclude_dist="beta"`` to exclude the Beta distributions in the density function) - **apply extra parameters:** Similarly, you can define extra parameters for the ``apply`` command, defined by default and/or modified per ML model. - **addtional time periods:** More periods options have been added, so you can extend the training of models for long time ranges beyond 90 days. .. hint:: **TrackMe 2.1.10 Machine Learning Outliers selective enablement and UI charting fix** - In TrackMe 2.1.10, we have released the capability to selectively enable or disable Machine Learning Outliers on a per tenant AND component basis. - This relies on the ``mloutliers`` and ``mloutliers_allowlist`` parameters, the allowlist parameters permits to restrict the list of components for which Machine Outliers enablement applies. - As well, we have fixed a remaining UI issue that was leading to charing not loading under some circumstances. Data seasonality and behaviours =============================== In most of the cases, data have typical patterns which we eventually can recognize when running investigations. The situation is however more complex when it comes to automating this recognition, Machine Learning and current major progresses in AI are leading the way through new powerful ways to tackle these challenges. .. note:: **Generating samples for outliers detection** - You can find, download and use with no restrictions the following content: https://github.com/trackme-limited/mlgen-python - We use this content to generate samples with seasonality concepts for the purposes of development, qualification and documentation *Sample pattern over past 30 days, seasonality by week days with higher activity during the working hours* .. image:: img_v2/ml_outliers/sample_pattern.png :alt: sample_pattern.png :align: center :width: 1200px :class: with-border Data not driven by seasonality ================================ In some cases, applying seasonal concepts to the data may not be the best approach, KPIs with no variation depending on week days or hours are good examples. Since TrackMe 2.0.72, ML models can be set with the ``time_factor`` defined to ``none``, this enables TrackMe's outliers calculations to exclude seasonal concepts: .. image:: img_v2/ml_outliers/time_factor_none.png :alt: time_factor_none.png :align: center :width: 1200px :class: with-border Confidence level ================ Since TrackMe 2.0.72, TrackMe establishes a confidence level when training ML models, this confidence level can: - ``low``: TrackMe maintains ML Outliers for training and rendering purposes, but the Outliers status will not influence the entity status - ``normal``: TrackMe maintains ML Outliers for training and rendering purposes, and the Outliers status will influence the entity status The value of confidence as well as confidence_reason are stored in the rules KVstore collection, it can be easily viewed in the **manage Outliers screen** as well as by accessing the rules: .. image:: img_v2/ml_outliers/confidence_level.png :alt: confidence_level.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/confidence_level2.png :alt: confidence_level2.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/confidence_level3.png :alt: confidence_level3.png :align: center :width: 1200px :class: with-border Minimal and Maximum thresholds for LowerBound and UpperBound Outliers breaches ============================================================================== Since TrackMe 2.0.84, you can define minimal thresholds for LowerBound and UpperBound Outliers, values not respecting these thresholds are rejected automatically: - ``min_lower_bound_threshold``: The minimal value for the LowerBound Outliers, values below this threshold are rejected (for LowerBound breaches) - ``max_upper_bound_threshold``: The maximal value for the UpperBound Outliers, values above this threshold are rejected (for UpperBound breaches) When an Outliers is detected, TrackMe's backend verifies if a min or max threshold has been defined (depending on the type of Outliers), if the value does not respect the threshold, the Outliers is rejected and not taken into account in the entity status. *True context Outliers simulation screen:* .. image:: img_v2/ml_outliers/min_max_thresholds1.png :alt: min_max_thresholds1.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/min_max_thresholds2.png :alt: min_max_thresholds2.png :align: center :width: 1200px :class: with-border *Live Outliers screen:* .. image:: img_v2/ml_outliers/min_max_thresholds3.png :alt: min_max_thresholds3.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/min_max_thresholds4.png :alt: min_max_thresholds4.png :align: center :width: 1200px :class: with-border *Rendering TrackMe's commands also show rejected counters and reasons:* .. image:: img_v2/ml_outliers/min_max_thresholds5.png :alt: min_max_thresholds5.png :align: center :width: 1200px :class: with-border Demonstrating Machine Outliers detection in TrackMe =================================================== How ML Outliers works in TrackMe -------------------------------- **In short:** - All components are eligible to Machine Learning Outliers - Outliers rely on TrackMe generated metrics only - This allows to run fast and efficient training and rendering searches, with the minimal costs in terms of resources **For the purpose of this demonstration, we create a Flex Object TrackMe tenant which takes into account our ML generator:** - Fore more information about TrackMe restricted Flex Objects component: :ref:`trackme_admin_flx` *Our Flex tracker:* - Tracker name: "demo" - Runs every 5 minutes (earliest: -5m, latest: now) :: index=mlgen ref_sample=* | stats avg(dcount_hosts) as dcount_hosts, avg(events_count) as events_count by ref_sample | eval group = "demo" | eval object = "mlgen" . ":" . ref_sample, alias = ref_sample | eval object_description = "Demo Outliers in TrackMe" | eval metrics = "{'dcount_hosts': " . dcount_hosts . ", 'events_count': " . events_count . "}" | eval outliers_metrics = "{'dcount_hosts': {'alert_lower_breached': 1, 'alert_upper_breached': 1}, 'events_count': {'alert_lower_breached': 1, 'alert_upper_breached': 1}}" | eval status=1 | eval status_description="Machine Learning Outliers detection demo" | table group, object, alias, object_description, metrics, outliers_metrics, status, status_description ``` alert if inactive for more than 3600 sec``` | eval max_sec_inactive=3600 This Flex Tracker creates entities by monitoring the availability of data in our ML index, it also generates metrics and automates the definition of models which alert on both lower bound and upper bound outliers. **We have our ML generator running and having backfilled the past 90 days of data, it currently does not generate any outliers:** :: index=mlgen ref_sample=sample001 earliest=-90d | timechart span=1h avg(events_count) as events_count .. image:: img_v2/ml_outliers/sample_pattern2.png :alt: sample_pattern2.png :align: center :width: 1200px :class: with-border **Our ML generator takes into account the week days, we can use the following search to compare the relative activity of the current week days against the past 4 previous same week day:** :: index=mlgen ref_sample=sample001 earliest=@d latest=+1d@d | timechart span=5m avg(events_count) as events_count_today | appendcols [ search index=mlgen ref_sample=sample001 earliest=-7d@d latest=-6d@d | timechart span=5m avg(events_count) as events_count_ref ] | appendcols [ search index=mlgen ref_sample=sample001 earliest=-14d@d latest=-13d@d | timechart span=5m avg(events_count) as events_count_ref2 ] | appendcols [ search index=mlgen ref_sample=sample001 earliest=-21d@d latest=-20d@d | timechart span=5m avg(events_count) as events_count_ref3 ] *results:* .. image:: img_v2/ml_outliers/sample_pattern3.png :alt: sample_pattern3.png :align: center :width: 1200px :class: with-border *TrackMe automatically discovered the entity, let's take note of its internal identifier which we will use to manually backfill TrackMe metrics, as if we had been monitoring this entity since the beginning:* .. image:: img_v2/ml_outliers/ex_entity1.png :alt: ex_entity1.png :align: center :width: 1200px :class: with-border *We use mcollect to force backfilling metrics, pay attention to replace with the valid tenant_id value:* :: index=mlgen ref_sample=* earliest=-90d latest=-5m | bucket _time span=5m | stats avg(dcount_hosts) as trackme.splk.flx.dcount_hosts, avg(events_count) as trackme.splk.flx.events_count by _time, ref_sample | eval alias=ref_sample | lookup trackme_flx_tenant_demo-outliers alias OUTPUT tenant_id, _key as object_id, object, object_category | where isnotnull(object_id) | mcollect index=trackme_metrics split=t object, object_category, object_id, tenant_id *Opening the entity shows we have backfilled metrics now:* .. image:: img_v2/ml_outliers/ex_entity2.png :alt: ex_entity2.png :align: center :width: 1200px :class: with-border *Depending on if TrackMe already ran or not the ML training job for the tenant, ML Outliers may not be ready yet:* .. image:: img_v2/ml_outliers/ex_entity3.png :alt: ex_entity3.png :align: center :width: 1200px :class: with-border *We can either run the mltrain job manually, or train the models via the UI:* .. image:: img_v2/ml_outliers/ex_entity4.png :alt: ex_entity4.png :align: center :width: 1200px :class: with-border *If we refresh TrackMe, we can now see ML is ready:* - We have no outliers yet - TrackMe applied defaults settings for the ML definition (training over the past 30 days, time per hour) .. image:: img_v2/ml_outliers/ex_entity5.png :alt: ex_entity5.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/ex_entity6.png :alt: ex_entity6.png :align: center :width: 1200px :class: with-border *Let's access and review the models definition, for now we will only increase the training period to the past 90 days:* - Click on "Manage Outliers detection" - Update the models to increase the time range for the calculation - Manually run a training for each model - Click on Simulate Selected to review the results (we have selected the event count model), this is looking great for now .. image:: img_v2/ml_outliers/ex_entity7.png :alt: ex_entity7.png :align: center :width: 1200px :class: with-border *You can click on See ML model details to access and review to the models information:* .. image:: img_v2/ml_outliers/ex_entity8.png :alt: ex_entity8.png :align: center :width: 1200px :class: with-border Scenario: Detecting a lower bound outliers ------------------------------------------ Although we know we have a week day behaviour in the data, for now we will stick with the default settings and we will start generating a lower bound outlier. To achieve this, we stop the ``run_backfill.sh`` and we start ``run_gen_lowerbound_outlier.sh``, this basically: - Influence metrics with a large decrease of the curve, by an approximately 75%, accordingly to the magnitude of the week day / hour range **After a few minutes, we effectively start to see a clear outlier using the previous days comparison timechart search:** .. image:: img_v2/ml_outliers/ex_entity9.png :alt: ex_entity9.png :align: center :width: 1200px :class: with-border **This outlier will also be reflected in TrackMe, this can take 5/10 minutes to be detected as an effective outlier:** .. image:: img_v2/ml_outliers/ex_entity10.png :alt: ex_entity10.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/ex_entity10-2.png :alt: ex_entity10-2.png :align: center :width: 1200px :class: with-border **Let's zoom in the period:** .. image:: img_v2/ml_outliers/ex_entity11.png :alt: ex_entity11.png :align: center :width: 1200px :class: with-border Very nice, the sudden decrease in the activity has been detected successfully! The next phase is to review the ML rendering phase, this means: - The ``_mlmonitor_`` scheduled backend runs on a regular basis and attempts to review as many entities in a given portion of time - Depending on the volume of entities, this process can therefore require some time before the outliers is effectively noticed **Let's run the mlmonitor job:** .. image:: img_v2/ml_outliers/ex_entity12.png :alt: ex_entity12.png :align: center :width: 1200px :class: with-border - By default, the ML monitor will attempt to verify any entity that has not been verified since more than an hour **This behaviour can be customised in the system wide configuration:** .. image:: img_v2/ml_outliers/ex_entity13.png :alt: ex_entity13.png :align: center :width: 1200px :class: with-border For the purposes of the documentation, we will reduce this to 5 minutes and will re-run the monitor backend: .. image:: img_v2/ml_outliers/ex_entity14.png :alt: ex_entity14.png :align: center :width: 1200px :class: with-border From this output, we already know that the outliers were detected, TrackMe stores the outliers results in a dedicated KVstore collection: *the name of the lookup transforms is trackme__outliers_entity_data_tenant_* :: | inputlookup trackme_flx_outliers_entity_data_tenant_02-demo-outliers .. image:: img_v2/ml_outliers/ex_entity15.png :alt: ex_entity15.png :align: center :width: 1200px :class: with-border **After some minutes, the TrackMe tracker updated the Metadata and the entity appears in red:** .. image:: img_v2/ml_outliers/ex_entity16.png :alt: ex_entity16.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/ex_entity17.png :alt: ex_entity17.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/ex_entity18.png :alt: ex_entity18.png :align: center :width: 1200px :class: with-border The job is all done, and we have been successfully detecting an abnormal change in behaviour, the entity was impacted, and our alerting configuration would raise an alert accordingly! .. image:: img_v2/ml_outliers/ex_entity19.png :alt: ex_entity19.png :align: center :width: 1200px :class: with-border Fine tuning the ML models ------------------------- True context simulation since TrackMe 2.0.84 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Since this TrackMe release, the principal Simulation screen runs simulations in ``true context``, this means that TrackMe will train an ML model dedicated for that simulation, allowing the Outliers detection to 100% reflect the live Outliers detection behaviors. .. image:: img_v2/ml_outliers/true_context_simulation1.png :alt: true_context_simulation1.png :align: center :width: 1200px :class: with-border Once a true context simulation has run, the simulation model details become visible in the entity rules: (click on the button ``See ML models rules``) :: | trackmesplkoutliersgetrules tenant_id="02-demo-outliers" component="flx" object="demo:mlgen:sample001" .. image:: img_v2/ml_outliers/true_context_simulation2.png :alt: true_context_simulation2.png :align: center :width: 1200px :class: with-border Fine tuning models ^^^^^^^^^^^^^^^^^^ As we previously mentioned, our generated data has indeed a concept of week days behaviours, which we do not leverage yet in the calculation. To demonstrate this behaviour, we will stop our current data generation, we will update our sample entity name and will generate a new data set by running the script ``run_backfill.sh`` again. - We process the same steps as previously to backfill the metrics using mcollect once the entity was discovered - Then, we edit the models to increase the period, and this time will ask TrackMe to take into account the week days in the outliers calculation We can observe that the behaviour is slightly different, with much closer lower bound and upper bound ranges, this is because our data is very eligible and has been showing stability enough over time. .. image:: img_v2/ml_outliers/ex_entity20.png :alt: ex_entity20.png :align: center :width: 1200px :class: with-border Focus last 24 hours: .. image:: img_v2/ml_outliers/ex_entity21.png :alt: ex_entity21.png :align: center :width: 1200px :class: with-border Note that this also means that our outlier detection will be much more sensitive, which can lead to false positive alerts. .. image:: img_v2/ml_outliers/ex_entity22.png :alt: ex_entity22.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/ex_entity23.png :alt: ex_entity23.png :align: center :width: 1200px :class: with-border TrackMe implements a concept of "auto correction" based on minimal variation lower and upper dimensions, to avoid generating false positive: .. image:: img_v2/ml_outliers/ex_entity24.png :alt: ex_entity24.png :align: center :width: 1200px :class: with-border This time, we will generate upper bound outliers, we stop the current script and start the ``run_gen_upper_outlier.sh`` which slightly increases the volume of our metrics. After, some minutes, we can observe the variation: .. image:: img_v2/ml_outliers/ex_entity25.png :alt: ex_entity25.png :align: center :width: 1200px :class: with-border A little moment later, TrackMe notices the upper bound outlier: .. image:: img_v2/ml_outliers/ex_entity26.png :alt: ex_entity26.png :align: center :width: 1200px :class: with-border After a run of the ML monitor (which happens automatically), the upper bound condition is detected and the alert is raised accordingly. .. image:: img_v2/ml_outliers/ex_entity27.png :alt: ex_entity27.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/ex_entity28.png :alt: ex_entity28.png :align: center :width: 1200px :class: with-border Now, let's assume the "storm" is over, we stop the upper bound outlier gen script and instead we call ``run_normal.sh``, after some time the outliers are not happening any longer, TrackMe will notice and the entity will come back to a green status. .. image:: img_v2/ml_outliers/ex_entity29.png :alt: ex_entity29.png :align: center :width: 1200px :class: with-border Soon after, TrackMe sees the entity back to green: .. image:: img_v2/ml_outliers/ex_entity30.png :alt: ex_entity30.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/ex_entity31.png :alt: ex_entity31.png :align: center :width: 1200px :class: with-border And the job is done! Note that from TrackMe version 2.0.62, you can exclude the anomaly period from the ML model, this will allow the ML model to learn from the data without the incident, and will therefore be more accurate in the future. See: :ref:`ML period exception: excluding periods of time` SmartStatus and Outliers ======================== The SmartStatus is a TrackMe feature which automatically runs investigations when a given entity enters an alerting mode (red), when it comes to Outliers, the SmartStatus investigates automatically the Outliers condition: .. image:: img_v2/ml_outliers/smartstatus_outliers1.png :alt: smartstatus_outliers1.png :align: center :width: 1200px :class: with-border SmartStatus is also an alert action which indexes its results in TrackMe's summary index, so you can review the actual condition when the alert came through: :: `trackme_idx(02-demo-outliers)` source="flx_smart_status" sourcetype=trackme:smart_status object="demo:mlgen:sample001" .. image:: img_v2/ml_outliers/smartstatus_outliers2.png :alt: smartstatus_outliers2.png :align: center :width: 1200px :class: with-border Accessing the ML models ======================= **You can access the ML models rules using the following command:** :: | trackmesplkoutliersgetrules tenant_id="" component="" object="" **TrackMe stores the ML models definition in a KVstore:** :: | inputlookup trackme__outliers_entity_rules_tenant_ Accessing the ML models current results ======================================= **You can access the ML models data (results) using the following command:** :: | trackmesplkoutliersgetdata tenant_id="" component="" object="" **TrackMe stores the ML models current results in a KVstore:** :: | inputlookup trackme__outliers_entity_data_tenant_ Disabling alerting on Outliers ============================== **You can very simply disable alerting on Outliers on a per Virtual Tenant basis, access to the Configuration UI and the Virtual Tenant account:** .. image:: img_v2/ml_outliers/disable_outliers.png :alt: disable_outliers.png :align: center :width: 1200px :class: with-border ML training scheduled jobs ========================== When creating a Virtual Tenant, TrackMe creates a scheduled job called ``_ml_train_``, this job is responsible for training the ML models for the entities of the tenant and a given component. .. image:: img_v2/ml_outliers/mltrain_job1.png :alt: mltrain_job1.png :align: center :width: 1200px :class: with-border **The scheduled ML training job behaves as follow:** - The job is scheduled to run every once per hour - It runs for a certain amount of time which is driven by an implicit argument ``max_runtime`` to the command ``trackmesplkoutlierstrainhelper`` (defaults to 60 minutes minus a margin) :: max_runtime_sec = Option( doc=""" **Syntax:** **max_runtime_sec=**** **Description:** The max runtime for the job in seconds, defaults to 60 minutes less 120 seconds of margin.""", require=False, default="3600", validate=validators.Match("max_runtime_sec", r"^\d*$"), ) - The job is influenced by system wide options:, see: :ref:`ML Outliers system wide options` - The job runs a first Splunk search to discover entities to be trained - The jobs logs its activity as follows: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutlierstrainhelper *For instance:* :: 2023-10-17 16:07:02,992 INFO trackmesplkoutlierstrainhelper.py generate 481 { "tenant_id": "01-feeds", "action": "success", "results": "outliers models training job successfully executed", "run_time": 11.839, "entities_count": 12, "processed_entities": [ { "object_category": "splk-dsm", "object": "webserver:apache:access:json", "search": "| trackmesplkoutlierstrain tenant_id=\"01-feeds\" component=\"dsm\" object=\"webserver:apache:access:json\"", "runtime": "0.5111777782440186" }, { "object_category": "splk-dsm", "object": "webserver:nginx:plus:kv", "search": "| trackmesplkoutlierstrain tenant_id=\"01-feeds\" component=\"dsm\" object=\"webserver:nginx:plus:kv\"", "runtime": "0.5510256290435791" }, ], "failures_entities": [], "search_errors_count": 0, "upstream_search_query": "| inputlookup trackme_dsm_outliers_entity_rules_tenant_01-feeds where object_category=\"splk-dsm\"\n | `trackme_exclude_badentities`\n | lookup local=t trackme_dsm_tenant_01-feeds object OUTPUT monitored_state\n | where monitored_state=\"enabled\"\n | eval duration_since_last=if(last_exec!=\"pending\", now()-last_exec, 0)\n | where duration_since_last=0 OR duration_since_last>=600\n | sort - duration_since_last" } In some cases and if you wish to reduce the number of changes performed by TrackMe, especially in a co-located SHC context (which we would recommend), you can update the scheduling plan and reduce its frequency. **After having loaded the list of entities to be trained, the ML training backend attempts to sequentially train as many entities as possible in the allowed max run time.** - Each search is an highly efficient search relying on TrackMe metrics (mstats search) - Searches call the ``MLTK apply`` command to load the previously trained models - Searches are driven by a TrackMe command called ``trackmesplkoutlierstrain`` - logs are available here: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutlierstrain ML monitor scheduled jobs ========================= When creating a Virtual Tenant, TrackMe creates an ``_mlmonitor_`` scheduled job, this job is responsible for monitoring the ML models for the entities of the tenant and a given component. .. image:: img_v2/ml_outliers/mlmonitor_job1.png :alt: mlmonitor_job1.png :align: center :width: 1200px :class: with-border **The scheduled ML monitor job behaves as follow:** - The job is scheduled to run every 20 minutes - It runs for a maximum of 15 minutes (to avoid generating skipping searches), influenced by an implicit argument to the TrackMe command: :: max_runtime = Option( doc=""" **Syntax:** **max_runtime=**** **Description:** Optional, The max value in seconds for the total runtime of the job, defaults to 900 (15 min) which is subtracted by 120 sec of margin. Once the job reaches this, it gets terminated""", require=False, default="900", validate=validators.Match("object", r"^\d*$"), ) - The job is influenced by system wide options:, see: :ref:`ML Outliers system wide options` - The job runs a first Splunk search to discover entities to be rendered - The jobs logs its activity as follows: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutlierstrackerhelper - The job processes sequentially entities to be rendered and runs an highly efficient search to render the ML models using mstats, orchestrated by the TrackMe command ``trackmesplkoutliersrender`` - The command logs its activity as follows: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutlierstrackerhelper - The command automatically updates the records in the outliers data KVstore collection :: | inputlookup trackme__outliers_entity_data_tenant_ ML period exception: excluding periods of time ============================================== **From TrackMe version 2.0.62, you can exclude or more periods of time on a ML model basis:** - If an incident occurs, you can exclude the period of time where the incident happened - Doing so will allow the ML model to learn from the data without the incident, and will therefore be more accurate in the future - When the exclusion period is expired because the latest time of the period is now out of the exclusion period, this period is deleted automatically from the ML model during the next ML training phase **For instance, the following entity was impacted by an abnormal behaviour, we can exclude the period of time where the incident happened:** .. image:: img_v2/ml_outliers/exclude_period1.png :alt: exclude_period1.png :align: center :width: 1200px :class: with-border **To exclude this period, click on "Manage Outliers detection" and then on "Period Exclusions", note that exclusions apply per ML model:** .. image:: img_v2/ml_outliers/exclude_period2.png :alt: exclude_period2.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/exclude_period3.png :alt: exclude_period3.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/exclude_period4.png :alt: exclude_period4.png :align: center :width: 1200px :class: with-border **When the next ML training happens for this entity, accessing the ML models details will show the exclusion period: (click on See ML model details)** .. image:: img_v2/ml_outliers/exclude_period5.png :alt: exclude_period5.png :align: center :width: 1200px :class: with-border **Finally, when the excluded period is out of the time range scope of the ML training, for instance if the ML is trained for the past 30 days and the exclusions period is beyond that, the exclusion period is deleted automatically:** :: index=_internal sourcetype="trackme:custom_commands:trackmesplkoutlierstrain" "period exclusion" *example:* :: 2023-10-17 16:46:15,044 INFO trackmesplkoutlierstrain.py generate 486 tenant_id=02-demo-outliers, object=demo:mlgen:sample111, model_id=model_144991952201473 rejecting period exclusion as it is now out of the model period calculation: { "period_exclusion_id": "180ea7b1a6ec737c823cfc38a6cd9414", "earliest": 1682947800, "earliest_human": "Mon May 1 13:30:00 2023", "latest": 1683041400, "latest_human": "Tue May 2 15:30:00 2023", "ctime": "1697561162.0" } ML Outliers system wide options =============================== **The following options are applied globally, these influence the Outliers detection behaviours and/or ML models definition: (when entities are discovered or ML is being reset)** .. image:: img_v2/ml_outliers/system_wide_options.png :alt: system_wide_options.png :align: center :width: 1200px :class: with-border **Options:** .. list-table:: ML Outliers system wide options :widths: 30 70 :header-rows: 1 * - Option - Purpose * - **Min days historical metrics for confidence** - The minimal number of days of historical metrics required to compute the confidence level of the outliers detection, defaults to 7 days * - **Requested time models training** - The time value in seconds requested for ML models to be trained for entities, a given entity will regularly get ML models trained if possible based on this value. (defaults to 1 day) * - **Requested time models monitor** - The time value in seconds requested for ML models to be monitored for entities, a given entity will regularly get ML models monitored if possible based on this value. (defaults to 1 hour) * - **Max runtime models training** - The time value in seconds requested to limit the max duration of the ML training models, defaults to 15 min (reduced by 30 sec) and should be set according to the cron schedule of the ML training job * - **Max time since last training** - When executing a rendering operation, TrackMe verifies the last time this model was trained, if this time exceeds the value set here, the model will be retrained automatically before rendering. (defaults to 15 days) * - **Disable outliers at discovery** - When a new entity is discovered, enable or disable the volume based outliers detection by default for that entity. The feature can still be managed on demand for that entity. * - **Outliers default calculation** - The default calculation mode used for anomaly outliers detection, can be updated per entity. * - **Density lower threshold** - The default value of the lower threshold applied to the DensityFunction algorithm, set at discovery and be updated per entity. * - **Density upper threshold** - The default value of the upper threshold applied to the DensityFunction algorithm, set at discovery and be updated per entity. * - **Volume lower breached** - Alert when the lower bound threshold is breached for volume based KPIs. * - **Volume upper breached** - Alert when the upper bound threshold is breached for volume based KPIs. * - **Latency lower breached** - Alert when the lower bound threshold is breached for volume based KPIs. * - **Latency upper breached** - Alert when the upper bound threshold is breached for latency based KPIs. * - **Default period for calculation** - The relative period used by default for outliers calculations, applied during entity discovery and can be updated per entity * - **Default outliers time factor** - The default time factor applied for the outliers dynamic thresholds calculation * - **Default latency kpi metric** - The default kpi metric for latency outliers detection * - **Default volume kpi metric** - The default kpi metric for volume outliers detection * - **Default auto correct** - When defining the model, enable or disable auto_correct by default, which uses the concept of auto correction based on min lower and upper deviation. * - **Perc min lower deviation** - If an outlier is not deviant (LowerBound) from at least that percentage of the current KPI value, it will be considered as a false positive. * - **Perc min upper deviation** - If an outlier is not deviant (UpperBound) from at least that percentage of the current KPI value, it will be considered as a false positive. * - **splk_outliers_mltk_algorithms_list** - TrackMe uses the MLTK DensityFunction algorithm, you can add custom algorithms as a comma seperated list of values, these will become selectable automatically in the different Outliers configuration screens in TrackMe. * - **splk_outliers_mltk_algorithms_default** - If you have multiple algorithms, you can define here which algorithm should be used by default when TrackMe defines the ML models rules, which happens usually at the entities discovery, or when adding/resetting ML models. * - **splk_outliers_fit_extra_parameters** - You can optionally add extra parameters to be added to the MLTK fit command (training phase) at the time of the definition of the ML rules (generally when entities are discovered), for instamce: exclude_dist=\"beta\" to exclude Beta distributions for the density function, see MLTK documentation for more information. * - **splk_outliers_apply_extra_parameters** - You can optionally add extra parameters to be added to the MLTK apply command (rendering phase) at the time of the definition of the ML rules (generally when entities are discovered), for instamce: sample=\"True\", see MLTK documentation for more information. Default is empty for no extra parameters. * - **splk_outliers_boundaries_extraction_macro_default** - This defines the name of the bundaries extraction macro which is used when defining ML models rules, usually at the time of the entity discovery or when defining a new model. * - **splk_outliers_boundaries_extraction_macros_list** - This defines the list of boundaries macros, if you need to define a custom macro to extract bundaries according to a custom algorithm, you can add a comma separarted list of macros which will become automatically selectable in TrackMe Outliers management screens. ML Outliers options =================== **Options per ML models can be accessed via the TrackMe UI:** .. image:: img_v2/ml_outliers/access_ml_options1.png :alt: access_ml_options1.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/access_ml_options2.png :alt: access_ml_options2.png :align: center :width: 1200px :class: with-border **The following options can be defined per ML Outlier model:** .. list-table:: ML Outliers models options :widths: 30 70 :header-rows: 1 * - Option - Purpose * - **kpi_metric** - The Key Performance Indicator associated with the ML model * - **kpi_span** - The span time value used for the calculations * - **method_calculation** - The calculation method to be applied (e.g., average, perc95...) * - **period_calculation** - The period for the calculation for the model training purposes * - **time_factor** - Defines the time-based granularity for the ML model training * - **density_lowerthreshold** - The lower bound threshold for the MLTK density function * - **density_upperthreshold** - The upper bound threshold for the MLTK density function * - **auto_correct** - Enable or disable the auto-correction features; its goal is to limit false positives using the deviation settings * - **perc_min_lowerbound_deviation** - The min percentage of deviation between the lower bound and the KPI current value * - **perc_min_upperbound_deviation** - The min percentage of deviation between the upper bound and the KPI current value * - **alert_lower_breached** - Alert if the lower threshold is breached * - **alert_upper_breached** - Alert if the upper threshold is breached * - **min_value_for_lowerbound_breached** - The min value for the lower bound to be breached, outliers below this will be rejected * - **min_value_for_upperbound_breached** - The min value for the upper bound to be breached, outliers above this will be rejected * - **is_disabled** - Enable or disable training and monitoring for this model Understanding and Troubleshooting ML rendering results ====================================================== **When TrackMe calls the ML rendering, the following command is called:** .. image:: img_v2/ml_outliers/access_render1.png :alt: access_render1.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/access_render2.png :alt: access_render2.png :align: center :width: 1200px :class: with-border **You can go in statistics mode and add the following SPL to review in details the ML decisions:** :: | fields _time, _raw | trackmeprettyjson fields=_raw .. image:: img_v2/ml_outliers/access_render3.png :alt: access_render3.png :align: center :width: 1200px :class: with-border The JSON object contains various information such as the origin MLTK rendering results per time frame, and TrackMe Auto correction decisions. **While in simulating mode, the same command is called with the simulation definition of the ML instead:** .. image:: img_v2/ml_outliers/access_render4.png :alt: access_render4.png :align: center :width: 1200px :class: with-border You can apply the same SPL lines to review the ML decisions: .. image:: img_v2/ml_outliers/access_render5.png :alt: access_render5.png :align: center :width: 1200px :class: with-border Troubleshooting ML training logs ================================ The first level of the training logic is handled via the command ``trackmesplkoutlierstrackerhelper``, you can access logs via: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutlierstrackerhelper *notes: this command orchestrates the ML training activity for the tenant and the component, it will then proceed by iteration to the training calling the trackmesplkoutlierstrain command* The per entity ML training logic is handled via the command ``trackmesplkoutlierstrain``, you can access logs via: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutlierstrackerhelper Troubleshooting ML rendering (monitoring) logs ============================================== Similarly, the ML monitor phased is orchestrated by the ``trackmesplkoutlierstrackerhelper`` command, you can access logs via: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutlierstrackerhelper It will then call the command ``trackmesplkoutliersrender``, you can access logs via: :: index=_internal sourcetype=trackme:custom_commands:trackmesplkoutliersrender REST API endpoints for ML in TrackMe ==================================== **TrackMe exposes different REST API endpoints for the ML Outliers purposes, the same endpoints are used by the UI:** .. image:: img_v2/ml_outliers/rest_api1.png :alt: rest_api1.png :align: center :width: 1200px :class: with-border .. image:: img_v2/ml_outliers/rest_api2.png :alt: rest_api2.png :align: center :width: 1200px :class: with-border **TrackMe provides usage examples with very endpoint, both in cURL and in SPL:** .. image:: img_v2/ml_outliers/rest_api3.png :alt: rest_api3.png :align: center :width: 1200px :class: with-border **You can for instance reset ML models, or force training via the REST API:** *train:* :: | trackme mode=post url="/services/trackme/v2/splk_outliers_engine/write/outliers_train_models" body="{'tenant_id':'02-demo-outliers','component':'flx','object':'demo:mlgen:sample101'}" *reset:* :: | trackme mode=post url="/services/trackme/v2/splk_outliers_engine/write/outliers_reset_models" body="{'tenant_id':'02-demo-outliers','component':'flx','object':'demo:mlgen:sample101'}" Expanding ML models results and definition ========================================== In TrackMe version 2.0.67, we have included a streaming command ``trackmesplkoutliersexpand`` which can expand the models results or definition, this can be useful to add more context for custom alerting or reporting. ML Models are stored in complex dictionaries, and even more when there are more than a single model in a given entity, accessing the models deep information is challenging, the Splunk ``spath`` command is designed to handle these use cases properly. *The following example expands results from our Flex tenant 02-demo-outliers:* *the point is to access information within the models which are complex objects to manipulate:* :: | inputlookup trackme_flx_outliers_entity_data_tenant_02-demo-outliers .. image:: img_v2/ml_outliers/expand1.png :alt: expand1.png :align: center :width: 1200px :class: with-border *The following command expands the models for each entity, we then get as many row as we have entities and models, with access to the model details:* :: | inputlookup trackme_flx_outliers_entity_data_tenant_02-demo-outliers | fields object_category object models_summary | trackmesplkoutliersexpand .. image:: img_v2/ml_outliers/expand2.png :alt: expand2.png :align: center :width: 1200px :class: with-border *For instance, if you wish to extract the time and percentage of decrease when Outliers are detected:* :: | inputlookup trackme_flx_outliers_entity_data_tenant_02-demo-outliers where object="demo:mlgen:sample401" | fields object_category object models_summary | trackmesplkoutliersexpand | table object model isOutlierReason | rex field=isOutlierReason "time=\"(?