I’m pleased to announce that VMware has acquired Log Insight together with its technology and team from Pattern Insight. The technology provides real time search and analytics for large scale unstructured log data and will enable VMware to meet our customers’ growing need for advanced analytics-based approaches to operations management in highly dynamic virtualized and cloud environments. Pattern Insight’s log search and analytics technology will enhance VMware’s portfolio of solutions to help customers more easily detect, diagnose and prevent IT infrastructure issues, delivering improved quality and decreased cost of service.
Operational logs are like a Twitter feed from infrastructure software. Every infrastructure software component generates logs as streams of log messages and these logs contain information that is not available to traditional metrics-based monitoring tools. Each log message is a short string of unstructured text from a particular software component. Log messages can have any content, but are often used to describe internal state changes and make note of unexpected conditions. Most messages have no interest or value to an operator, but other messages identify significant events or can help to recreate how software reached a certain state. Log analytics technology allows an operator to rapidly answer questions about their infrastructure software, track trends , and recreate the steps leading up to a failure. Log analytics can answer questions like “which of my customers encountered error 303 in my software” or “show me the trend in the number of recoverable failures over the last 12 hours.”
To help understand the importance of log analytics to VMware’s customers, let us consider the prototypical VMware IT environment. VMware products are at the core of many of the largest enterprise IT deployments in the world and the larger and more complex an IT environment, the harder it is to prevent, detect, and diagnose issues within it. VMware technologies like vMotion and High Availability have significantly increased the quality of service in such customer IT environments, but there are still many ways that failures or degradations can occur. For example, disk drives can start to experience “soft” errors that cause no data loss, but can slow down I/O. Failing network cables can cause unpredictable performance degradations. Software misconfigurations can cause all sorts of havoc. The list goes on.
Ask any experienced IT operator after a few beers and they will reveal many things that “went wrong,” even in the most highly-available environment. In a more perfect IT world, every possible failure would result in the IT operator receiving a well-worded alert telling them what went wrong and how to fix it. But in our pre-Nirvana, 21st century reality, IT operators frequently encounter situations where their system is failing or underperforming, and there is no useful alert to help them diagnose. That is where operational logs come in.
Sharp IT Ops folks know these operational logs can be a goldmine when trying to understand imperfections in the operation of their environments. They can frequently avoid costly support calls, and more quickly resolve their issues by consulting their operational logs. These logs can be used to isolate the source of an issue and provide detailed information about what failed when. For example, a failing disk drive results in clearly identifiable operating system kernel error messages. With a failing network cable, you may see repeated renegotiation of link speed in ESXi logs, and configuration errors can often be detected through careful analysis of the log files. Sophisticated log search and analytics technology will help our customers search through and make sense of potentially massive quantities of log data generated by their IT infrastructure.
Yesterday’s simple tools for analyzing logs will not be sufficient in the cloud scale world. In the old days, when I was an assistant sysadmin for the campus’ single Unix machine (a VAX running BSD Unix*) the standard Posix file search utility named grep was all I needed to find the truth in log files. But my eyes were opened when I became CTO of Mozy.
Mozy had more than a million customers, over 50 Petabytes of storage (on a distributed storage system developed in-house), and many Gb/s of new data streaming in 24×7. We had to deal with more than a Terabyte (TB) of log files being generated daily across thousands of nodes. Grep can take more than an hour to run on a TB of data even if you’re lucky to have all the data in one place. And typically you don’t know exactly what you’re looking for, so you have to experiment with different searches, time ranges, and data sources. As a result, troubleshooting is slow and costly, and issues can easily go unnoticed or unfixed. Mozy spent a great deal of time writing in-house tools to make massive log volumes more usable. VMware would like to relieve our customers of this burden.
We believe our customers will increasingly face larger and more complex log analysis challenges, so we want to help them stay on top of their challenges by providing them with better tools. Trends like Cloud Computing and Big Data will drive datacenters of increased scale and complexity. Over the next decade, in order to remain competitive, companies will be driven to collect more data about their customers and their own processes, and perform deeper and deeper analysis. Datacenters will need significantly more storage nodes for that data, in addition to more compute nodes for processing it. And they’ll need more complex analysis tools. These new IT environments will be more challenging to troubleshoot when they fail, both due to the volume of the logs, but also because of process complexity.
There are powerful synergies between log search and analytics and our existing VMware vCenter Operations Management Suite. Today, vCenter Operations Management Suite features patented analytics that crunch numeric time series data to learn what is normal for your application workload from millions of collected metrics, and then detect anomalous activity within your IT environment. vC Ops does this with an amazingly low false-positive rate. vCenter Operations Management Suite can further help you locate the objects in your datacenter that were the root cause of an anomalous incident (like the machine with a hard drive experiencing soft errors). However, in order to get insight about how that machine or software is failing, our customers often still need to look in the logs. Time series analytics plus log search and analytics make a powerful combination that delivers sophisticated root cause analysis for our customers.
In closing, I just want to call out how thrilled I am to have the opportunity to work with the folks from Pattern Insight. They are a world-class team and they ‘ve done a fantastic job developing their log analytics technology and are a great addition to the VMware team! Welcome aboard!