Demystifying AI in Zabbix: Can AI Correlate Events?

Demystifying AI in Zabbix: Can AI Really Correlate Events?

Good morning, everyone! Dimitri Bellini here, back with you on Quadrata, my YouTube channel dedicated to the open-source world and the IT topics I’m passionate about. This week, I wanted to tackle a question that I, and many members of the Zabbix community, get asked all the time: Why doesn’t Zabbix have more built-in AI?

It seems like every monitoring product out there is touting its AI capabilities, promising to solve all your problems with a touch of magic. But is it all hype? My colleagues and I have been digging deep into this, exploring whether an AI engine can truly correlate events within Zabbix and make our lives easier. This blog post, based on my recent video, will walk you through our thought process.

The AI Conundrum: Monitoring Tools and Artificial Intelligence

Let’s be honest: integrating AI into a monitoring tool isn’t a walk in the park. It requires time, patience, and a willingness to experiment with different technologies. More importantly, it demands a good dose of introspection to understand how all the pieces of your monitoring setup fit together. But why even bother?

Anyone who’s managed a complex IT environment knows the struggle. You can be bombarded with hundreds, even thousands, of alerts every single day. Identifying the root cause and prioritizing issues becomes a monumental task, even for seasoned experts. Severity levels help, but they often fall short.

Understanding the Challenges

Zabbix gives us a wealth of metrics – CPU usage, memory consumption, disk space, and more. We typically use these to create triggers and set alarm thresholds. However, these metrics, on their own, often don’t provide enough context when a problem arises. Here are some key challenges we face:

  • Limited Metadata: Event information and metadata, like host details, aren’t always comprehensive enough. We often need to manually enrich this data.
  • Lack of Visibility: Monitoring teams often lack a complete picture of what’s happening across the entire organization. They might not know the specific applications running on a host or the impact of a host failure on the broader ecosystem.
  • Siloed Information: In larger enterprises, different departments (e.g., operating systems, databases, networks) might operate in silos, hindering the ability to connect the dots.
  • Zabbix Context: While Zabbix excels at collecting metrics and generating events, it doesn’t automatically discover application dependencies. Creating custom solutions to address this is possible but can be complex.

Our Goals: Event Correlation and Noise Reduction

Our primary goal is to improve event correlation using AI. We want to:

  • Link related events together.
  • Reduce background noise by filtering out less important alerts.
  • Identify the true root cause of problems, even when buried beneath a mountain of alerts.

Possible AI Solutions for Zabbix

So, what tools can we leverage? Here are some solutions we considered:

  • Time Correlation: Analyzing the sequence of events within a specific timeframe to identify relationships.
  • Host and Host Group Proximity: Identifying correlations based on the physical or logical proximity of hosts and host groups.
  • Semantic Similarities: Analyzing the names of triggers, tags, and hosts to find connections based on their meaning.
  • Severity and Tag Patterns: Identifying correlations based on event severity and patterns in tags.
  • Metric Pattern Analysis: Analyzing how metrics evolve over time to identify patterns associated with specific problems.

Leveraging scikit-learn

One promising solution we explored involves using scikit-learn, an open-source machine learning library. Our proposed pipeline looks like this:

  1. Event Processing: Collect events from our Zabbix server using streaming capabilities.
  2. Encoding Events: Use machine learning techniques to vectorize and transform events into a usable format.
  3. Cluster Creation: Apply algorithms like DBSCAN to create clusters of related events (e.g., network problems, operating system problems).
  4. Merging Clusters: Merge clusters based on identified correlations.

A Simple Example

Imagine a scenario where a router interface goes down and host B becomes unreachable. It’s highly likely that the router issue is the root cause, and host B’s unreachability is a consequence.

Implementation Steps

To implement this solution, we suggest a phased approach:

  1. Temporal Regrouping: Start by grouping events based on their timing.
  2. Host and Group Context: Add context by incorporating host and host group information.
  3. Semantic Analysis: Include semantic analysis of problem names to identify connections.
  4. Tagging: Enrich events with tags to define roles and provide additional information.
  5. Iterated Feedback: Gather feedback from users to fine-tune the system and improve its accuracy.
  6. Scaling Considerations: Optimize data ingestion and temporal window size based on Zabbix load.

Improvements Using Existing Zabbix Features

We can also leverage existing Zabbix features:

  • Trigger Dependencies: Utilize trigger dependencies to define static relationships.
  • Low-Level Discovery: Use low-level discovery to gather detailed information about network interfaces and connected devices.
  • Enriched Tagging: Encourage users to add more informative tags to events.

The Reality Check: It’s Not So Simple

While the theory sounds great, real-world testing revealed significant challenges. The timing of events in Zabbix can be inconsistent due to update intervals and threshold configurations. This can create temporary discrepancies and make accurate correlation difficult.

Consider this scenario:

  • File system full
  • CRM down
  • DB instance down
  • Unreachable host

A human might intuitively understand that a full file system could cause a database instance to fail, which in turn could bring down a CRM application. However, a machine learning algorithm might struggle to make these connections without additional context.

Exploring Large Language Models (LLMs)

To address these limitations, we explored using Large Language Models (LLMs). LLMs have the potential to understand event descriptions and make connections based on their inherent knowledge. For example, an LLM might know that a CRM system typically relies on a database, which in turn requires a file system.

However, even with LLMs, challenges remain. Identifying the root cause versus the symptoms can be tricky, and LLMs might not always accurately correlate events. Additionally, using high-end LLMs in the cloud can be expensive, while local models might not provide sufficient accuracy.

Conclusion: The Complex Reality of AI in Monitoring

In conclusion, integrating AI into Zabbix for event correlation is a complex challenge. A one-size-fits-all solution is unlikely to be effective. Tailoring the solution to the specific needs of each client is crucial. While LLMs offer promise, the cost and complexity of using them effectively remain significant concerns.

We’re continuing to explore this topic and welcome your thoughts and ideas!

Let’s Discuss!

What are your thoughts on using AI in monitoring? Have you had any success with similar approaches? Share your insights in the comments below or join the conversation on the ZabbixItalia Telegram Channel! Let’s collaborate and find new directions for our reasoning.

Thanks for watching! See you next week!

Bye from Dimitri!

Watch the original video: Quadrata Youtube Channel

Leave a comment

Your email address will not be published. Required fields are marked *