autom8qc.qaqc.outlier

GeneralizedESDTest

Class

class autom8qc.qaqc.outlier.GeneralizedESDTest(max_outliers, alpha=0.05)

Bases: autom8qc.qaqc.base.QAQCTest

The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution. It is especially useful in situations where the number of outliers is not known: in other outlier tests, like the Grubbs test and the Tietjen Moore test, the number of outliers to be found must be specified beforehand. For the ESD test, you just specify an upper bound for the number of outliers.

Important

The test ignores the time-index and considers only the values

Steps:
  1. Iterate over the range of the given number of outliers

  2. Calculate G-value and the determine the related index

  3. Calculates the critical value with the t-distrubution

  4. Check if G-value is greater than critical value. If so, the data point is an outlier. Remove the outlier from the series and go to step 2. If not, there aren’t outlier in the dataset.

Parameters
  • NAME (str) – Name of the test

  • DESCRIPTION (str) – Description of the test

  • CATEGORY (str) – Category of the test

  • SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)

  • parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:
  • max_outliers (int): Maximum number of outliers

  • alpha (float): Significance level (default: 0.05)

SUPPORTED_STRUCTURES

alias of autom8qc.core.structures.Series

calculate_critial_value(n_points, alpha)

Calculates and returns the critical value.

Parameters
  • n_points (int) – Number of points

  • alpha (float) – Significance level

Returns

Critical value

Return type

float

grubbs_stat(series)

Calculates and returns the Grubb’s statistic value (G-value, index).

Parameters

series (pd.Series) – Time series

Returns

(G-value, index)

Return type

(float, int)

perform(data)

Performs the test and returns the probabilities.

Raises

InvalidType – If structure of the given data is not supported

Parameters

data (BaseStructure, pd.Series, pd.DataFrame) – Data points

Returns

Probabilities (1=Valid, 0=Invalid)

Return type

pd.Series

static supported_parameters()

Returns the supported parameters.

Returns

Supported parameters

Return type

ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 55, 3
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import GeneralizedESDTest
test = GeneralizedESDTest(max_outliers=2, alpha=0.05)
test.plot(series=series, series_name="Example")

Visualization

../_images/GeneralizedESDTest.svg

LOFTest

Class

class autom8qc.qaqc.outlier.LOFTest(neighbors, contamination=None, minkowski_p=2)

Bases: autom8qc.qaqc.base.QAQCTest

The Local Outlier Factor algorithm (LOF) compares the density of any given data point to the density of its neighbors. Since outliers come from low-density areas, the ratio will be higher for anomalous data points. As a rule of thumb, a normal data point has a LOF between 1 and 1.5 whereas anomalous observations will have much higher LOF. The higher the LOF the more likely it is an outlier. If the LOF of point X is 5, it means the average density of X’s neighbors is 5 times higher than its local density. In conclusion, the LOF of a point tells the density of this point compared to the density of its neighbors. If the density of a point is much smaller than the densities of its neighbors (LOF ≫1), the point is far from dense areas and, hence, an outlier.

Steps:
  1. Calculate k-distance

  2. Calculate reachability distance

  3. Calculate reachability density

  4. Compare densities and concern contamination for filtering

Important

The test ignores the time-index and considers only the values

Parameters
  • NAME (str) – Name of the test

  • DESCRIPTION (str) – Description of the test

  • CATEGORY (str) – Category of the test

  • SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)

  • parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:
  • neighbors (integer): Number of neighbors to use*

  • contamination (float): The amount of contamination of the data set, i.e. the proportion of outliers in the data set ([0, 0.5])

  • minkowski_p (float): Parameter for the Minkowski metric. When p=1, this is equivalent to using manhattan distance (l1) and euclidean distance (l2) for p=2. For arbitrary p, minkowski distance (lp) is used.”

SUPPORTED_STRUCTURES

alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises

InvalidType – If structure of the given data is not supported

Parameters

data (BaseStructure, pd.Series, pd.DataFrame) – Data points

Returns

Probabilities (1=Valid, 0=Invalid)

Return type

pd.Series

static supported_parameters()

Returns the supported parameters.

Returns

Supported parameters

Return type

ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 55, 3
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import LOFTest
test = LOFTest(neighbors=100, contamination=0.002)
test.plot(series=series, series_name="Example")

Visualization

../_images/LOFTest.svg

OutlierIQRTest

Class

class autom8qc.qaqc.outlier.OutlierIQRTest(scale=1.5)

Bases: autom8qc.qaqc.base.QAQCTest

This class implements a test to detect outliers using IQR (Inter-Quartile Range). In descriptive statistics, the interquartile range is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles IQR = Q3 − Q1. The algorithm calculates the lower bound (Q1 - scale * IQR) and the upper bound (Q3 + 1.5 * IQR). Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier.

Steps:
  1. Arrange your data in ascending order

  2. Calculate Q1 (the first Quarter)

  3. Calculate Q3 ( the third Quartile)

  4. Find IQR (Q3 - Q1)

  5. Find the lower range = Q1 -(scale * IQR)

  6. Find the upper range = Q3 + (1.5 * IQR)

Important

The test ignores the time-index and considers only the values

Parameters
  • NAME (str) – Name of the test

  • DESCRIPTION (str) – Description of the test

  • CATEGORY (str) – Category of the test

  • SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)

  • parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:
  • scale (float): Scale factor (default: 1.5)

SUPPORTED_STRUCTURES

alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises

InvalidType – If structure of the given data is not supported

Parameters

data (BaseStructure, pd.Series, pd.DataFrame) – Data points

Returns

Probabilities (1=Valid, 0=Invalid)

Return type

pd.Series

static supported_parameters()

Returns the supported parameters.

Returns

Supported parameters

Return type

ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import OutlierIQRTest
test = OutlierIQRTest()
test.plot(series=series, series_name="Example")

Visualization

../_images/OutlierIQRTest.svg

OutlierMADTest

Class

class autom8qc.qaqc.outlier.OutlierMADTest(threshold=3)

Bases: autom8qc.qaqc.base.QAQCTest

This class implements a test to detect outliers using MAD. The MAD algorithm is commonly used for this type of anomaly detection because it’s highly effective and efficient. The median, or “middle” value, of all the time series at one point in time describes normal behavior for all of the time series at that timestamp. Large deviations from each individual time series and the median indicate that a series is anomalous.

Steps:
  1. Calculate median of values

  2. Calculate absolute difference (points - median)

  3. Calculate median of absolute difference

  4. Set data points (greater than 3) invalid

Important

The test ignores the time-index and considers only the values

Parameters
  • NAME (str) – Name of the test

  • DESCRIPTION (str) – Description of the test

  • CATEGORY (str) – Category of the test

  • SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)

  • parameters (ParameterList) – Supported parameters (default: None)

Supported parameters:
  • threshold (float): Threshold value (default: 3)

SUPPORTED_STRUCTURES

alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises

InvalidType – If structure of the given data is not supported

Parameters

data (BaseStructure, pd.Series, pd.DataFrame) – Data points

Returns

Probabilities (1=Valid, 0=Invalid)

Return type

pd.Series

static supported_parameters()

Returns the supported parameters.

Returns

Supported parameters

Return type

ParameterList

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import OutlierMADTest
test = OutlierMADTest()
test.plot(series=series, series_name="Example")

Visualization

../_images/OutlierMADTest.svg

OutlierZTest

Class

class autom8qc.qaqc.outlier.OutlierZTest

Bases: autom8qc.qaqc.base.QAQCTest

This class implements a test to detect outliers using Z-score. Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean. For the limits, the test uses 3 * standard deviations (i.e., between -3 and 3). Everything lying away from this will be treated as an outlier.

Steps:
  1. Calculate mean and standard deviation

  2. Calculate Z-Scores

  3. Set data points (smaller than -3 or greater than 3) invalid

Note

In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured.

Important

The test ignores the time-index and considers only the values

Parameters
  • NAME (string) – Name of the test

  • DESCRIPTION (string) – Description of the test

  • CATEGORY (string) – Category of the test

  • SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)

  • parameters (ParameterList) – Supported parameters

SUPPORTED_STRUCTURES

alias of autom8qc.core.structures.Series

perform(data)

Performs the test and returns the probabilities.

Raises

InvalidType – If structure of the given data is not supported

Parameters

data (BaseStructure, pd.Series, pd.DataFrame) – Data points

Returns

Probabilities (1=Valid, 0=Invalid)

Return type

pd.Series

Example

# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)

# Perform test
from autom8qc.qaqc.outlier import OutlierZTest
test = OutlierZTest()
test.plot(series=series, series_name="Example")

Visualization

../_images/OutlierZTest.svg