Use Cases
- Cloud Native
  - Cloud Native
  - AWS
    - AWS
    - Amazon EC2
    - Amazon RDS
    - Amazon S3
  - Microsoft & Azure
- Data Center
  - Data Center
  - Virtualization
    - Virtualization
    - VMware
    - Hyper-V
    - Nutanix
    - Proxmox
  - Databases
  - Unstructured Data
    - Unstructured Data
    - NAS
- SaaS Apps and Endpoints
- Industries
  Industries
- Accelerate Cyber Resilience
  Reduce costs, accelerate cyber recovery and simplify management
  
  Multi-Cloud Resiliency
  Secure data within AWS/Azure or across cloud environments without hardware headaches.
  
  Modernize Data Protection
  Data protection for your data center and cloud workloads, SaaS apps, and edge micro services
Why Druva
- The Druva Difference
  The Druva Difference
- About Druva
  About Druva
- Explore
  Explore
  - Customers
  - Careers
  - Events
  - Newsroom
  - Blog
- Customer Spotlight
  
  ZS Associates cuts recovery from days to just hours
  Case Study
  
  Contact Us
  
  Our experts are here to help.
  Reach out
Products
- Data Security Cloud
  Data Security Cloud
  Fully managed data security across enterprise, cloud, SaaS, and end user.
  Dru AI
  With agentic AI, explore backup health and trends, accelerate troubleshooting, and enhance threat investigation.
- Data Protection
  Data Protection
  Protect cloud-native, SaaS, hybrid, and endpoint data with Druva’s unified cloud data protection platform. Scale effortlessly and ensure 100% immutability.
- Cyber Response & Recovery
  Cyber Response & Recovery
  Bounce back from cyber attacks with data that is always safe and ready.
- eDiscovery & Compliance
  eDiscovery & Compliance
  Ensure compliance and accelerate eDiscovery with Druva’s cloud-native SaaS. Instantly search backup data, apply legal holds, and simplify governance.
  - eDiscovery & Legal Hold
  - Compliance & Sensitive Data Governance
- Identity Resilience
  Identity Resilience
  Enterprise Cloud Backup and data management across edge, on-premises and cloud workloads
Learning Center
- Resource Library
  Resource Library
- Explore
- Product Resources
- Druva is a 2025 Gartner® Magic Quadrant™ Leader
  Get the Report
  
  Switch to Druva, Reduce TCO by up to 40%
  Calculate Your Savings
Partners
- Alliances
  Alliances
  - AWS
  - Dell
  - Microsoft
- Ecosystem
  Ecosystem
  - Security Integrations
  - Technology Partners
- Value Added Resellers
  Value Added Resellers
- Managed Service Providers
  Managed Service Providers
- Partner Portal
  - Partner Portal Login
  - Managed Service Center
- Join Our Partner Network
  
  Deliver cyber resilience with ZERO hardware, ZERO infrastructure, ZERO hassle
  Apply now
  
  Druva Marketplace
  
  Discover trusted integrations to extend Druva and simplify your cyber resilience workflows.
  Explore the Marketplace
Get Started
Search queries sent to third parties.
Support
Login

News/Trends, Tech/Engineering

Machine Learning to Detect Anomalies from Application Logs

February 13, 2017 Adwait Bhave

Much of the massive amount of data today is generated by automated systems, and harnessing this information to create value is central to modern technology and business strategies. Machine learning has emerged as a valuable method for many applications—image recognition, natural language processing, robotic control, and much more.

Download White Paper: Public Cloud Backup Delivers More for Less

By applying machine learning to system-generated debugging logs, we’ve gained key insights and transformed these logs into critically valuable data sources.

Existing Monitoring: Manual Thresholds

Most software products generate logs that are used for root-cause analysis and troubleshooting. Though these logs offer useful insights into real-time performance, mining them for actionable knowledge is challenging. The data they contain lacks structure, and often, they simply don’t contain enough analytical information.

Software as a service (SaaS) solutions present even more difficulties. Cloud operations teams must monitor the logs in real time, in addition to the numerous other data streams under their watch.

Typically, monitoring teams are tasked with:

Identifying and alerting if any part of the service is not operating as expected.
Looking for signals that indicate imminent cascading failures.
Understanding how normal cycles of operations shift during upgrades, patches, and hotfixes.

Traditionally, teams and developers monitoring cloud operations depend on their past experiences with the systems, as well as certain manually defined rules, to detect anomalous behavior. However, rule-based monitoring is convoluted and isn’t scalable. The complexities of modern systems with multiple components in a dynamic business environment make it difficult for a single team to see and understand all of the patterns. In addition, these static, manual rules often fail to catch anomalies, creating false negatives—or even worse, they trigger alerts about anomalies when there are none, creating false positives.

At Druva, we wanted to develop an efficient way to detect anomalies as quickly as possible. Rather than relying on product knowledge and manually set thresholds to detect anomalous behavior, our goal was to leverage technologies that learn the complex patterns, account for seasonality, and perform with better precision than manual systems.

Deeper Look At Anomalies

What exactly are anomalies? Simply put, an anomaly is any deviation from standard behavior. In the world of software, there are three main types of anomalies, which are depicted in the following graphs.

Figure 1 shows a normal data representation.

Figure 2 illustrates point anomalies, which are anomalies in a single value in the data.

Figure 3 shows context anomalies, which happen when an anomalous point’s value is within the range of typical values seen so far. But when we look at the cyclic nature of the data, the point is anomalous.

Figure 4 depicts collective anomalies. More than a separate type, this is subsequent data that doesn’t conform to recent or frequent patterns. These anomalies are harder to define and detect.

Review of Analytical Anomaly Detection

The most popular method of anomaly detection is statistical analysis, which uses a forecast model to predict the next point in the stream. Operating under the assumption that the observed data is generated by a stochastic model, statistical analysis creates data instances that are mostly normal with a few anomalies. Normal instances typically appear in high probability regions, and anomalies often occur in low probability regions of the stochastic model.

Another type of modeling is distance-based clustering. The idea here is intuitive. Anomalies are the points located farthest from the densest regions of the distribution.

Given that these both types of models are well established, we wanted to apply cutting-edge techniques to solve the problem by creating the model with no explicit feature creation or assumption about data distributions.

Machine-Learning Networks

There are two main types of networks—simple feed-forward and recurrent neural—we can use to detect anomalies. Let’s explore the results on each type of network.

Simple Feed-Forward Networks

Simple feed-forward networks learn from examples with no connection or relation to any two examples. That means in training a sequence of examples, the second example fed to the network has no context of the first one. What the network is learning, though, is weights Wi and Wo. (Note that the Hidden Units box in this diagram is merely representative—hidden units can have multiple layers of neurons.) The arrows are showing the forward pass of the computation. The backward pass updates the weights Wi and Wo, based on how close the calculated output is to the actual output.

Recurrent Neural Networks

A recurrent neural network (RNN) is a special kind of neural net that has produced great results when modeling sequence data. The fundamental feature of RNNs is that the network contains at least one feedback connection, so the activations flow in a loop. RNNs also maintain a hidden state, which can learn and remember part of a previously fed training example—making them well suited to sequence prediction problems in which context (the data seen thus far) is important.

We decided to use the long short-term memory (LSTM) variant of RNN, because it’s easier to train, and it remembers more time steps than an ordinary RNN. LSTM also has input and output gates that control updates to the network’s internal memory.

The diagram below contains two inputs. One (t) is a current learning example, and the other (t-1) is the learning example which was fed just before the current one. The sequence can be longer than one, allowing the network to remember input from previous steps to provide context for the current example. This process largely mirrors that of feed-forward networks.

We used the RNN to parse the log file and convert it into counts of variables in time-frequency windows. This creates a distribution for all statistically significant log messages. Over the period, as the service cycles, these counts form patterns. Each time window can be configured so that the count is large enough to be modeled. Example frequencies range from five minutes to 30 minutes, depending on the update frequency of the logs.

Once you model the variable, it looks like a sequence of counts over time.

Each training example can be created by a specific time period in history. The neural network is trained to predict the next in the sequence. The general idea is to predict the message count in the next window, and compare it to the actual output. If the deviation exceeds the threshold percentage, it’s an anomaly.

Once you have the data in the correct form, it’s time to figure out the layers of the network, tune the parameters, and see what works best. We used a deep network with more than 36 LSTM cells to model the problem. We also experimented with training the network in batch and online modes.

Here is the data on which the network is trained:

And here are the sample results of prediction:

Predictions are plotted in blue, and actual results are in gray. Anomalies are highlighted where predictions differ from actual results by a large margin.

Summary

System-generated logs contain information showing the cyclic nature of the system, which can then be modeled as a continuous time series. The advances of recurrent neural networks have made it possible to learn data patterns in noisy time series. RNNs can be used to detect anomalous behaviors, and by adding machine learning intelligence, teams can control how to monitor their data. Since anomaly limits aren’t hard-coded, this opens up a number of exciting possibilities for SaaS and cloud operations teams to detect anomalies that can identify threats like ransomware.

Recommended Resources: Druva 2017 VMware Cloud Migration Survey

Extend cyber resilience to Microsoft Power BI

Machine Learning to Detect Anomalies from Application Logs

Existing Monitoring: Manual Thresholds

Druva Blog: Cloud Technology & Data Protection Articles

The Druva Platform

Druva vs. Competitors

Use Cases

Company

Legal