Training and Detection Data Requirements
To use the service, you have to prepare proper training and testing data to build the model and test it.
Training and testing data can only contain timestamps and other numeric attributes that typically represent sensor or signal readings.
Data Format and Quality Requirements
The training and testing data are required to represent values from multiple attributes (such as signals or sensors) recorded at certain timestamps in chronological order that have:
- Columns containing the timestamp as the first column, and other numeric attributes, signals, and sensors following it.
- Each row represents one observation of those attributes, signals, and sensors at the specific timestamp.
These requirements ensure that training is successful, and that the trained model is of high quality:
- Timestamp
-
- The timestamp column is optional. Either the timestamp is provided for each value row or not specific at all.
- If a timestamp column is provided, you must name it
timestamp
(all lowercase without any spaces) in the first column. - The timestamps in the data are increasing in order and can have no duplicates.
- The timestamps can have different frequencies. For example, 50 observations in one hour and 200 observation in the next hour.
- If a timestamp column is provided, you must name it
- If no timestamp is specified, then the data is assumed to be sorted chronologically by time.
- The timestamp column is optional. Either the timestamp is provided for each value row or not specific at all.
- Attribute
-
-
The value can be missing, which must be represented as null.
-
The entire attributes, signals, and sensors columns can't have all the values as missing values.
-
The attribute names are required to be unique. The total number of attributes can't be more than 300.
-
Signal and sensor names can't be MSET.
-
If sensors and signals aren't correlated, then the service builds all univariate models.
-
Window size is a valid concept only for univariate models.
-
- Training
-
-
The number of observations and timestamps in training data must be at least
8 × number of attributes
, or 80 whichever is greater. -
Signal and sensor names can't be MSET.
-
If sensors and signals aren't correlated, then service builds all univariate models.
-
Window size is a valid concept only for univariate models.
For example, with 100 sensors the minimum rows required are
Max(8 x 100, 80) = 800
rows. With four sensors the minimum rows required areMax(8 × 4, 80) = 80
rows.Note
By default, model training happens using univariate algorithms.
-
- Detection
-
You can choose to use synchronous detection or asynchronous detection depending on the use case.
-
For univariate models, window size worth of rows - 1 aren't detected for anomalies.
-
If you provide fewer than
windowSize
rows, then anomaly detection doesn't occur for univariate signals. -
Data points are the number of signals multiplied by the number of rows.
- Synchronous Detection
-
- Use when detection datasets are smaller than 30,000 data points and are time-sensitive.
- For a batch detection call, the maximum size of a detection payload is up to 300 for signals or maximum of 30,000 data-points any combination of signals and rows.
- Asynchronous Detection Using Jobs
-
- Use when detection datasets are larger than 30,000 data-points and aren't time-sensitive.
- For asynchronous detection jobs, the maximum size of a detection payload can vary depending on the type of request.
- For inline request input, the maximum size of the request is limited to 11MB and 500,000 data points.
- For Object Storage input, the maximum size of the file is limited to 500MB and 10M data points.
-
Data Preparation
Data preparation is essential for MSET to ensure that the data used for model training is clean, consistent, and appropriate for analysis. The following is a brief overview of some common techniques with corresponding examples using Python for use with Anomaly Detection.
- Interquartile Range (IQR)
-
IQR is a measure of statistical dispersion for use in identifying and removing outliers in time series data. The IQR is calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data. Outliers can be identified by comparing the data values with the upper and lower bounds, which are calculated as Q3 + 1.5 * IQR and Q1 - 1.5 * IQR, respectively. For example:
- Input:
time_series_data
- A NumPy array or pandas series containing the time series data. - Output:
cleaned_data
- A NumPy array or pandas Series containing the cleaned time series data with outliers removed.
Python example:
import numpy as np # Generate some example time series data time_series_data = np.random.normal(0, 1, 100) # Calculate the IQR Q1 = np.percentile(time_series_data, 25) Q3 = np.percentile(time_series_data, 75) IQR = Q3 - Q1 # Define the upper and lower bounds for outlier detection upper_bound = Q3 + 1.5 * IQR lower_bound = Q1 - 1.5 * IQR # Identify and remove outliers outliers = (time_series_data < lower_bound) | (time_series_data > upper_bound) cleaned_data = time_series_data[~outliers]
- Input:
- Outlier Detection
-
Various methods can be used for outlier detection in time series data, such as Z-score, moving average, and machine learning based approaches. Isolation Forest algorithm is a machine learning based approach suitable for unsupervised anomaly detection. For example:
- Input:
time_series_data
- A NumPy array or pandas series containing the time series data. - Output:
cleaned_data
- A NumPy array or pandas Series containing the cleaned time series data with outliers removed.
Python example:
import numpy as np from sklearn.ensemble import IsolationForest # Generate some example time series data time_series_data = np.random.normal(0, 1, 100) # Train the Isolation Forest model model = IsolationForest(contamination=0.05) # Specify the contamination level (i.e., expected proportion of outliers) model.fit(time_series_data.reshape(-1, 1)) # Predict outlier labels outlier_labels = model.predict(time_series_data.reshape(-1, 1)) # Extract the clean data cleaned_data = time_series_data[outlier_labels == 1]
- Input:
- Handling Highly Correlated Signals
-
Data preparation for time series analysis with highly correlated signals involves several key steps to handle the correlation between variables. Here are some common techniques:
-
Data Normalization: It's important to normalize the time series data to ensure that all variables have the same scale. This helps in comparing and analyzing the variables with high correlation effectively. Normalization methods, such as minimum-maximum scaling or Z-score normalization, can be used to bring the variables to a similar range.
Feature Selection: Use the Pearson correlation coefficient to identify variables with high correlation and remove redundant features. For example, you could set a threshold for the correlation coefficient and keep only one variable from a pair of highly correlated variables.
-
Calculate Pearson Correlation Coefficient: One common technique for measuring the strength and direction of linear correlation between variables is Pearson correlation coefficient. Pearson correlation coefficient measures the linear relationship between two variables, with values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), and 0 indicating no correlation. You can use the
pearsonr function
from thescipy.stats
module in Python to calculate the Pearson correlation coefficient. Following is an example use of the Pearson correlation coefficient for data preparation in Python:import numpy as np from scipy.stats import pearsonr # Generate some example time series data with two highly correlated signals A and B np.random.seed(0) A = np.random.normal(0, 1, 100) B = A + np.random.normal(0, 0.1, 100) # Calculate Pearson correlation coefficient between A and B correlation_coefficient, _ = pearsonr(A, B) print(f"Pearson correlation coefficient between A and B: {correlation_coefficient:.2f}")
-
-
Dimensionality Reduction: Techniques such as Principal Component Analysis or Singular Value Decomposition can be used to reduce the dimensionality of the time series data, and create a set of uncorrelated variables. This is also known as principal components, while retaining most of the information from the original variables.
-
Other Recommendations
-
If a new attribute is added in the future, then the data has to be trained again including the new attribute label to consider it during detection.
-
If an attribute is detected to be a duplicate of another signal during training, it's automatically dropped.
-
More data in the detection call is better as long as it's within the limits of the maximum allowed data.
Data Format Requirements
The Anomaly Detection service supports the CSV and JSON file formats that contain data with timestamps and numeric attributes.
The service also supports data from ATP and InfluxDB, which have similar requirements in terms of number and format of timestamps, and number of numeric attributes.
Timestamps must follow the ISO 8601 format. We recommend that you use the precise time up to seconds or milliseconds as in the file format examples.
- CSV Format
-
Each column represents sensor data and the row represents the values corresponding to each sensor at a particular timestamp.
CSV formatted data must have comma-separated lines, with first line as the header and other lines as data. The Anomaly Detection service requires that the first column is named
timestamp
when specifying timestamps.For example:
timestamp,sensor1,sensor2,sensor3,sensor4,sensor5 2020-07-13T14:03:46Z,,0.6459,-0.0016,-0.6792,0 2020-07-13T14:04:46Z,0.1756,-0.5364,-0.1524,-0.6792,1 2020-07-13T14:05:46Z,0.4132,-0.029,,0.679,0
Note
-
Missing values are permitted (with null), data is sorted by timestamp, and Boolean flag values have to be converted to a numeric (0 or 1).
-
The last line can't be a new line. The last line is an observation with other signals.
-
- JSON format
-
Similarly, JSON formatted data must also contain timestamps and numeric attributes only. Use the following keys:
{ "requestType": "INLINE", "signalNames": ["sensor1", "sensor2", "sensor3", "sensor4", "sensor5", "sensor6", "sensor7", "sensor8", "sensor9", "sensor10"], "data": [ { "timestamp" : "2012-01-01T08:01:01.000Z", "values" : [1, 2.2, 3, 1, 2.2, 3, 1, 2.2, null, 4] }, { "timestamp" : "2012-01-02T08:01:02.000Z", "values" : [1, 2.2, 3, 1, 2.2, 3, 1, 2.2, 3, null] } ] }
Note
Missing value is coded as null without quotes.