autom8qc.qaqc.outlier

GeneralizedESDTest

Class

class autom8qc.qaqc.outlier.GeneralizedESDTest(max_outliers, alpha=0.05)

Bases: autom8qc.qaqc.base.QAQCTest

The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution. It is especially useful in situations where the number of outliers is not known: in other outlier tests, like the Grubbs test and the Tietjen Moore test, the number of outliers to be found must be specified beforehand. For the ESD test, you just specify an upper bound for the number of outliers.

Important

The test ignores the time-index and considers only the values

Steps:

Iterate over the range of the given number of outliers
Calculate G-value and the determine the related index
Calculates the critical value with the t-distrubution
Check if G-value is greater than critical value. If so, the data point is an outlier. Remove the outlier from the series and go to step 2. If not, there aren’t outlier in the dataset.

Parameters

NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:

max_outliers (int): Maximum number of outliers
alpha (float): Significance level (default: 0.05)

SUPPORTED_STRUCTURES: alias of autom8qc.core.structures.Series

calculate_critial_value(n_points, alpha)

Calculates and returns the critical value.

Parameters

n_points (int) – Number of points
alpha (float) – Significance level

Returns

Critical value

Return type

float

grubbs_stat(series)

Calculates and returns the Grubb’s statistic value (G-value, index).

Parameters: series (pd.Series) – Time series
Returns: (G-value, index)
Return type: (float, int)

perform(data)

Performs the test and returns the probabilities.

Raises: InvalidType – If structure of the given data is not supported
Parameters: data (BaseStructure, pd.Series, pd.DataFrame) – Data points
Returns: Probabilities (1=Valid, 0=Invalid)
Return type: pd.Series

static supported_parameters()

Returns the supported parameters.

Returns: Supported parameters
Return type: ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 55, 3
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import GeneralizedESDTest
test = GeneralizedESDTest(max_outliers=2, alpha=0.05)
test.plot(series=series, series_name="Example")

Visualization

LOFTest

Class

class autom8qc.qaqc.outlier.LOFTest(neighbors, contamination=None, minkowski_p=2)

Bases: autom8qc.qaqc.base.QAQCTest

The Local Outlier Factor algorithm (LOF) compares the density of any given data point to the density of its neighbors. Since outliers come from low-density areas, the ratio will be higher for anomalous data points. As a rule of thumb, a normal data point has a LOF between 1 and 1.5 whereas anomalous observations will have much higher LOF. The higher the LOF the more likely it is an outlier. If the LOF of point X is 5, it means the average density of X’s neighbors is 5 times higher than its local density. In conclusion, the LOF of a point tells the density of this point compared to the density of its neighbors. If the density of a point is much smaller than the densities of its neighbors (LOF ≫1), the point is far from dense areas and, hence, an outlier.

Steps:

Calculate k-distance
Calculate reachability distance
Calculate reachability density
Compare densities and concern contamination for filtering

Important

The test ignores the time-index and considers only the values

Parameters

NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:

neighbors (integer): Number of neighbors to use*
contamination (float): The amount of contamination of the data set, i.e. the proportion of outliers in the data set ([0, 0.5])
minkowski_p (float): Parameter for the Minkowski metric. When p=1, this is equivalent to using manhattan distance (l1) and euclidean distance (l2) for p=2. For arbitrary p, minkowski distance (lp) is used.”

SUPPORTED_STRUCTURES: alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises: InvalidType – If structure of the given data is not supported
Parameters: data (BaseStructure, pd.Series, pd.DataFrame) – Data points
Returns: Probabilities (1=Valid, 0=Invalid)
Return type: pd.Series

static supported_parameters()

Returns the supported parameters.

Returns: Supported parameters
Return type: ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 55, 3
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import LOFTest
test = LOFTest(neighbors=100, contamination=0.002)
test.plot(series=series, series_name="Example")

Visualization

OutlierIQRTest

Class

class autom8qc.qaqc.outlier.OutlierIQRTest(scale=1.5)

Bases: autom8qc.qaqc.base.QAQCTest

This class implements a test to detect outliers using IQR (Inter-Quartile Range). In descriptive statistics, the interquartile range is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles IQR = Q3 − Q1. The algorithm calculates the lower bound (Q1 - scale * IQR) and the upper bound (Q3 + 1.5 * IQR). Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier.

Steps:

Arrange your data in ascending order
Calculate Q1 (the first Quarter)
Calculate Q3 ( the third Quartile)
Find IQR (Q3 - Q1)
Find the lower range = Q1 -(scale * IQR)
Find the upper range = Q3 + (1.5 * IQR)

Important

The test ignores the time-index and considers only the values

Parameters

NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:

scale (float): Scale factor (default: 1.5)

SUPPORTED_STRUCTURES: alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises: InvalidType – If structure of the given data is not supported
Parameters: data (BaseStructure, pd.Series, pd.DataFrame) – Data points
Returns: Probabilities (1=Valid, 0=Invalid)
Return type: pd.Series

static supported_parameters()

Returns the supported parameters.

Returns: Supported parameters
Return type: ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import OutlierIQRTest
test = OutlierIQRTest()
test.plot(series=series, series_name="Example")

Visualization

OutlierMADTest

Class

class autom8qc.qaqc.outlier.OutlierMADTest(threshold=3)

Bases: autom8qc.qaqc.base.QAQCTest

This class implements a test to detect outliers using MAD. The MAD algorithm is commonly used for this type of anomaly detection because it’s highly effective and efficient. The median, or “middle” value, of all the time series at one point in time describes normal behavior for all of the time series at that timestamp. Large deviations from each individual time series and the median indicate that a series is anomalous.

Steps:

Calculate median of values
Calculate absolute difference (points - median)
Calculate median of absolute difference
Set data points (greater than 3) invalid

Important

The test ignores the time-index and considers only the values

Parameters

NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:

threshold (float): Threshold value (default: 3)

SUPPORTED_STRUCTURES: alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises: InvalidType – If structure of the given data is not supported
Parameters: data (BaseStructure, pd.Series, pd.DataFrame) – Data points
Returns: Probabilities (1=Valid, 0=Invalid)
Return type: pd.Series

static supported_parameters()

Returns the supported parameters.

Returns: Supported parameters
Return type: ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import OutlierMADTest
test = OutlierMADTest()
test.plot(series=series, series_name="Example")

Visualization

OutlierZTest

Class

class autom8qc.qaqc.outlier.OutlierZTest

Bases: autom8qc.qaqc.base.QAQCTest

This class implements a test to detect outliers using Z-score. Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean. For the limits, the test uses 3 * standard deviations (i.e., between -3 and 3). Everything lying away from this will be treated as an outlier.

Steps:

Calculate mean and standard deviation
Calculate Z-Scores
Set data points (smaller than -3 or greater than 3) invalid

Note

In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured.

Important

The test ignores the time-index and considers only the values

Parameters

NAME (string) – Name of the test
DESCRIPTION (string) – Description of the test
CATEGORY (string) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters

SUPPORTED_STRUCTURES: alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises: InvalidType – If structure of the given data is not supported
Parameters: data (BaseStructure, pd.Series, pd.DataFrame) – Data points
Returns: Probabilities (1=Valid, 0=Invalid)
Return type: pd.Series

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import OutlierZTest
test = OutlierZTest()
test.plot(series=series, series_name="Example")

Visualization