Case study in Anomaly Detection using Autoencoder-based Spotfire Template

Introduction

Here’s a walkthrough of using the Spotfire anomaly detection template on the Case Western University Ball Bearing data. This dataset is commonly used as a benchmark for bearing fault classification algorithms, which contains vibration signal data of normal and anomalous bearings. It has been widely studied - see the references here for example: Lite and Efficient Deep Learning Model for Bearing Fault Diagnosis Using the CWRU Dataset - PMC - Yoo et al.

Additional information is available at https://www.kaggle.com/code/brjapon/crwu-bearings-svm-fault-classification

The data is actually oriented towards classification of anomalies. We have repurposed it here to include unsupervised anomaly detection and clustering.

Part 1: Split The Data

The data consists of 33 columns of data. The first column is a sequence number. The other 32 are measurements from various sensors. There was no timestamp, so we generated an artificial one to identify each row.

There are 147,201 rows.

Note - we have renamed the columns so that each column number has a 2-digit identifier. The reason for this will become clear below.

Here’s a snippet from Spotfire:

The data is available from https://www.kaggle.com/datasets/brjapon/cwru-bearing-datasets/download?datasetVersionNumber=1 (registration at Kaggle.com required and free)

Unlike a raw dataset, this is organized into 10 sections, each representing measurements adjacent in time. The first 9 are each examples of specific anomalies, while the last section is the golden batch, aka Normal Operating Conditions. The labels for these sections are available in a separate file, but were not used here, because part of our purpose was to test unsupervised anomaly detection.

The Golden batch can be seen clearly in the visualizations of the raw data. For example:

Note that the Anomaly Detection dxp also can work with data that does not have a defined subset for NOC, accomplished by using robust methods. Here, we select the Training and Validation samples from the NOC to get the best possible model that describes the NOC. The test sample can then be used to predict the reconstruction error in the test sample.

These figures show how we did this:

Mark a date range within the NOC portion of the sample:

Then press the “Training Sample” button to assign these points to the training sample.

Define the validation sample similarly, still within the NOC.

4. Press the “Validation Sample” button.

5. Define the Test Sample; this should be outside the NOC. Here I’m using all the rest of the data, but you can use a subset, which may help speed up the processing.

6. Press the “Test Sample” Button to register your choice.

This ensures that the model is trained and validated on NOC, while also scoring and reporting on a test sample known to have anomalies.

Part 2: Select and Configure Predictors

The dxp is configured to allow selection of which variables to include in the analysis. We want all the variables to be included in the analysis with the exception of the label and the timestamp. The table of variables looks like this:

Top:

Bottom:

We then created 5 sets of parameters (including LSTM as well as Autoencoder modeling techniques) which resulted in the fitting of 5 models:

Part 3: Compare Fitted Models

The table above shows some summary statistics for the 5 models. In addition, we show 5 learning curves, with trajectories for the Training and Validation samples. For LSTM method, increasing the lookback parameter shows only a small reduction (improvement) in the validation error. Autoencoder method with the same neuron structure as LSTM is better, which means introducing lookback does not bring value. It seems that looking at individual time points instead of looking also at the previous points is sufficient. Therefore, for our purposes, we will select Model1AE and see what we can learn from it.

Remark: Further training may improve the fit. Also, additional improvements may be possible by exploring the hyperparameters.

Part 4: View Results

Reconstruction Error Time Series

The colors show the sample (Train, Validation, Test) for each point; the training and validation samples are the points on the far right. Recall that these represent NOC. As such, the reconstruction errors are quite small since this is where the model was trained and validated. The majority of the points are from the test sample. These show higher reconstruction errors, indicating that they are anomalous versus the baseline.

Let us zoom in to the Training and Validation set first. We can see that all reconstruction error values are comfortably below the value 4. Please, compare the scale of this graph and the graph above which includes the validation set as well.

Now, let us focus on the anomalies section. There are 9 different anomaly situations, so let us display some of them (notice the scale is different for these):

We see for different types of anomalies that the high reconstruction errors appear in separate incidents, where they are elevated over the threshold for a period of time and then subside back to a lower level. Because of the nature of this dataset, where measurements were taken on known anomalies, these incidents are quite frequent; in many cases, with more typical data, they would be relatively rare. In addition, we can observe some kind of periodicity and some profile of incidents (by incident is meant several consecutive points above threshold). This profile is different for different anomalies.

Let’s mark the points in one of these incidents:

What is causing these anomalies? Spotfire summarizes the components of these marked points;

We see the sensor measurements which contribute the most to the overall reconstruction error for the 5 points in this incident. We also see that the largest contributions are from sensors with adjacent numbers- 06 and 07, likely indicating nearby sensors.

If we mark all incidents for this type of anomaly, we will get a bit different graph:

This indicates that even for one type of an anomaly (which looks from the reconstruction error graph like the repeated pattern of incidents) the different sensors are influenced for different incidents. This is a very important finding for understanding the results which will follow.

This is useful in itself, but it may be important to understand the full range of incidents which are in the data. In order to do this, we automate the identification of incidents.

First, we select only Testing data, where we would like to do clustering of anomalies. Using the slider at left, we select a cutoff that identifies 70% of the observations as normal and 30% as anomalous.This results in a cutoff of 4.28 which seems reasonable concerning the ranges of the reconstruction errors for Training and Validation data. When the mean squared reconstruction error is greater than 4.28, we identify the point as anomalous (We used this threshold already before in the screenshots to give you the same reference line).

In order to reduce false positives, we further test to see whether at least 5 consecutive points are above the cutoff point. These are counted as Incidents; in the figure, we have identified 12,461 incidents.

Clustering Analysis

We next cluster these incidents to get a better idea of what features are driving each type of anomaly. We will use kmeans. Kmeans requires a target number of clusters, so we create an elbow plot:

The inflection point is 2 suggesting that this is a reasonable choice.

If we examine resulting clusters we can conclude that the clusters are distinguished simply to low and high error incidents.

This is not distinguishing anomalies per label as we would expect (we need to mention that still we are doing unsupervised analysis, but if the labels are consistent with our clusters, we might get great interpretation for different clusters of anomalies). For curiosity let us increase the number of clusters to 9, which is number of different anomalies we have in the data.

Let’s examine the resulting clusters in a few ways:

We can see that again, clusters are more focused on the value of reconstruction error rather than different types of anomalies. We also have more clusters within the same levels of reconstruction error, this is the consequence of the finding we made earlier where for the same type of anomaly (the same label) we do not have always the same sensors responsible for that anomaly.

This is even more clear from the graph below where are the profiles of clusters per sensors. Each cluster is characterized by the means of the 32 features. We show them as a line chart:

This is somewhat hard to read; but if we focus on one cluster, we can clearly see a specific structure over the set of sensors on the x-axis (this was the reason for introducing leading zeros in the naming convention, as we mentioned):

Note that the x-axis here represents the 32 sensors in sequence by the numbering system provided. We don’t know their positions or types, but we can assume that those nearby in the numbering system are similar in some way. The clustering validates this, since it tends to group together measurements with similar patterns in the values.

Unfortunately, but not surprisingly, the clusters we discovered do not correspond to the labels, which are grouped in consecutive times. One reason (as was already mentioned) is that anomalies can manifest on different sensors even for anomalies with the same label. The reason behind might be that the problem occurs in different places in the ball bearing or manifest in different time intervals (we do not know if the sensors are space or time related).

Another reason why clustering in the application does not work is the fact that we expect an incident to be a set of consecutive anomalies of the same kind, here we always have some profile which is changing during the time of an incident.

Conclusion

The approach of training the model on normal operating conditions led to the model which is able to properly reveal that anomalous behavior is present in the testing data and that anomaly manifests itself in time separated incidents. This is true for all types of anomalous behavior. In that context the model is doing exactly what we would expect from a good anomaly detector.

Concerning clustering of found anomalies (which could bring additional value for anomaly detection) the default approach is not working well and cannot be used for the benefit of analysis. The reason is in the special correlation structure of sensor measurements which are clearly depending on each other in periodic/cyclic way and at the same time there is a complication that anomaly can manifest on different sensors.

Now that we know better the structure and behavior of the data, we would most probably suggest some alternative approaches to do better the job mainly in clustering of anomalies. We could investigate some time-frequency representation of the data; we could analyze some aggregated time segments instead of raw sensor data; or we can use matrix profiling on the reconstruction error time series in order to classify profiles of incidents.

Sign In

Case study in Anomaly Detection using Autoencoder-based Spotfire Template

Introduction

Part 1: Split The Data

Part 2: Select and Configure Predictors

Part 3: Compare Fitted Models

Part 4: View Results

Reconstruction Error Time Series

Clustering Analysis

Conclusion

Table of contents

User Feedback

Recommended Comments

Industries