Return to site

Designing a Metrics & Functions System for Monitoring, Machine Learning & AIOPS

Ideas & Trade-offs for a Unified Operations Data Platform

· Blog

Our Siglos development team was in the market for a new metrics and function specification for our core monitoring, alerting, graphing, machine learning, rule engine & AI system. Here is the why and how of what we came up with.

We looked at a LOT of different systems and have tried to merge all the best practices and our own biases into a unified system we hope you’ll find interesting. Or at least you’ll point out all the things we did wrong.

Taking a step back, we have a unified IT operations platform, Siglos, where we combine lots of different monitoring, configuration, log, and other data. on top of that we do various alerting things, plus graphs, then mix in machine learning, expert systems, and recently a bit of AI-oriented stuff, too.

We have a common data interface so all the UI, Dashboards, API, Graphs, Governance Policy Engines, Expert Systems, Tech Centers, Cloud Consoles, and so on use a common data system. This lets us completely abstract all data consumers from the diverse set of data producers and integrations we have.

Ideally we can represent all the data and all the mathematical functions in a simple unified manner, one that’s easy to understand, use, and expand. All the caller does is supply a set of keys, scopes & time-ranges plus the function, and get back advanced and diverse data processing on demand.

Our goal is to push as much definition and functionality down the stack as we can, simplifying the upper layers and data consumers so they can be very generic, even for very advanced calculations, visualizations and transformations.

Our solution has evolved a few times, starting with direct source system access, which is very messy. Then it evolved to a simple data definition system that includes things like data source (Zabbix, Datadog, CloudWatch, Logs, etc.), metric’s key/ID, units, labels, display format, common name, etc.

That’s very helpful to simplify our data consumer uses (alerts, graphs, dashboards) and avoids a lot of duplication we had across the stack. But it lacked time-range control, scoping filters, and any processing power at all.

So we looked very carefully at what everyone else in this space is doing, across a wide-range of monitoring and metric tools, providers, and services. We found a rather vast array of methods, structures, and underlying ideas.

Some like Datadog, are somewhat close to what we were looking for, but very hard to generalize or expand, especially to add more advanced functionality. Plus their inner and outer functions plus separated keywords struck us as too messy and inconsistent for complex use.

Several other solutions were also very inconsistent in their naming, formatting, and functional specifications. This is, in part, because overall this is a very hard problem, and because several have grown up over the years.

Speaking of problems, what exactly was the problem we were trying to solve?

Overall, our problem was straight-forward, as follows:

Our most common source data is from a single key for a single host in a monitoring system, such as CPU% for a web server over a time range. This is a simple time series which comes to use as a horizontal vector.

If we have multiple servers, we’ll have multiples sets of values, which are a semi-synchronized time series in the shape of a matrix. This matrix is defined by a key (cpu), a scope (the servers), and a time range (though the time is not exact, as not all data is gathered at the same instant).

This will be raw time series data, but can also be buckets for downsampled or processed aggregations (such as hourly/daily), which would have multiple values per time point (e.g. min, max, avg), per server.

The result of that is a 3D or nested matrix, with each row containing the CPU time series (itself consisting of a matrix of values and time), and the separate rows per host. This gets a little complicated.

We then want to perform mathematical operations on this matrix, which is where the fun begins.

First, we want to be able to perform multiple functions on this data, something Datadog is very good at. We want to do this by chaining functions together, Javascript style, such as abs().avg().max() and so on.

That means we must always be thinking about the shape of our data as inputs and outputs to/from each functions, since this is a true chain. It flows from the data source, though scoping and ranging, and in/out of the first, second, etc. functions to a final result we pass back to our caller.

Callers themselves also need to be shape-aware as various consumers and visualizations depend on various data shapes (or need to do their own post-processing to get there). For example, a single number display on a dashboard needs a single scalar from the data function system, or something like a vertical vector it can min/max/avg() to get to a single number. Graphs and other consumers have different expectations.

Data shapes get interesting & really make you think.

Our simplest function takes a matrix and returns a scalar, such as avg(). The avg() function simply averages all the data in all the rows and columns of the data and returns a single number, the overall average. This can be used in a dashboard value or as input to alerting, etc.

But what if I want the average for each server, not a single number? This is a different type of average, as the output is a vertical vector (array) of values, one CPU average per server over the time-range.

Some people also want to group these averages, so the result is not a scalar, but also not a single value per server. Instead, they want an average for web servers, one for app servers, and one for DB servers. So this might be taking 20 servers and collapsing the results into three numbers. This is a key-based downsample of a large matrix into a smaller one.

Essentially, we can combine these two needs together into an average_by() function that takes a grouping key, which might be ALL or * if we want an average of each, or a grouping key if we want to group. The output can have different shapes.

Now, what if I want a moving average of various types, such as simple average, weighted averages, or even more complex things like ARIMA. These are different from the above averages, as they output a matrix with the same number of time series and data points as the input. Of course, we can then chain them through more functions if we want.

How about a bucking average? This is essentially a down-sampling process that takes high-resolution metrics, such as every minute, and aggregates them into buckets of say one hour, with an average value per bucket. The result is still a matrix, but with fewer data points than the input. Note this type of bucketing is also done automatically by some consumers such as graphing systems to reduce resolution.

That’s just for averages as the starting point. Perhaps you can see how this gets a little complex, though the above principles cover most of the various options for various functions such as standard deviations, normality tests, and more (discussed below).

So, what did we do and how did we do it?

In the end, we ended up with chained functions that take and return a matrix of various shapes: matrix, horizontal vector, vertical vector, or a scalar. The output matrices and vectors may be of the same or smaller sizes than the input matrices. Nested or 3D matrices are still being experimented with.

The simplest function spec is:

metric : scope(s) . range(r) . func(f) . func(f) …

Where:

  • metric — The common data metric that is data source. Usually a time series (TS), but can be scalar like status, counts, configs, etc.
  • scope —One or more key:values such as host, tag, or All/*
  • range — The time series time range, such as (-1d, now, real dates)
  • func — The function to apply such as avg() with various arguments

The functions are chained, and must check their input shapes to ensure they don’t get a matrix when they expect a vector, or vice-versa. Some vectors are vertical (cross-host/scope), while others are horizontal (usually time series).

Functions are pretty broadly-defined, from simple avg() math, to actual machine learning, anomaly detection, and AI models. Anything that can handle matrix/vector/scalar inputs & outputs.

Note the fundamental metric, which is the first element of the above, is itself defined at a lower level and can itself be a function or have other custom processing. It may be a count of recent alerts, extracted log metrics, cloud disk sizes, nginx config values, etc. but these are all invisible to the above system.

However, this does provide a second level of processing that is closer to, and customized for, the actual data source. For some advanced data sources like ElasticSearch, we have a look-through system where the scopes, ranges, and some functions can be pushed down to the source since it can do a bunch of work for us, especially on millions of items we can’t handle in the upper stack.

All Single Metric so far …

Note this is all single-metric, e.g. CPU. To do cross-metric work, we use a function that itself takes a metric spec, such as:

cpu:scope(host:web1).range(-1d).correlation( ram:scope().range() )

Future enhancements include handling multiple metrics at once, probably by combining them early on, similar to the correlation but adding in additional series, such as this, which results in a matrix with both CPU and RAM:

cpu:scope(host:web1).range(-1d).combine(ram:scope().range())

This can be expanded further to include parallel function processing, otherwise we have to call the system multiple times when we want the original CPU data, a moving average of the data, and an anomaly detection envelope for the data — ideally those three vectors could be generated with one call, such as the following, which would result in three parallel vectors in the final matrix that we can then put in a graph:

cpu:scope(host:web1).range(-1d).combine(moving_avg()).combine(anom_sarima())

We continue to expand this system, recently including more cloud and service configurations, cloud billing, log metrics, and more.

Data processing systems are always interesting, especially making them easy, flexible, and powerful all at once. We’ve tried to cover all the bases and build something we can use and leverage across a very diverse set of data sources and consumers. It’s an on-going journey.

I’m Steve Mushero, CEO of Siglos.io - Siglos is a Unified Cloud & Operations Platform, built on 10+ years of large-scale global system management experience. Siglos includes Design, Build, Management, Monitoring, Governance, Billing, Automation, Troubleshooting, Tuning, Provisioning, Ticketing and much more. For end users and Managed Service Providers.

See www.Siglos.io for more information.

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OK