autom8qc.qaqc.outlier
GeneralizedESDTest
Class
- class autom8qc.qaqc.outlier.GeneralizedESDTest(max_outliers, alpha=0.05)
Bases:
autom8qc.qaqc.base.QAQCTest
The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution. It is especially useful in situations where the number of outliers is not known: in other outlier tests, like the Grubbs test and the Tietjen Moore test, the number of outliers to be found must be specified beforehand. For the ESD test, you just specify an upper bound for the number of outliers.
Important
The test ignores the time-index and considers only the values
- Steps:
Iterate over the range of the given number of outliers
Calculate G-value and the determine the related index
Calculates the critical value with the t-distrubution
Check if G-value is greater than critical value. If so, the data point is an outlier. Remove the outlier from the series and go to step 2. If not, there aren’t outlier in the dataset.
- Parameters
NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)
- Supported parameters:
max_outliers (int): Maximum number of outliers
alpha (float): Significance level (default: 0.05)
- SUPPORTED_STRUCTURES
alias of
autom8qc.core.structures.Series
- calculate_critial_value(n_points, alpha)
Calculates and returns the critical value.
- Parameters
n_points (int) – Number of points
alpha (float) – Significance level
- Returns
Critical value
- Return type
float
- grubbs_stat(series)
Calculates and returns the Grubb’s statistic value (G-value, index).
- Parameters
series (pd.Series) – Time series
- Returns
(G-value, index)
- Return type
(float, int)
- perform(data)
Performs the test and returns the probabilities.
- Raises
InvalidType – If structure of the given data is not supported
- Parameters
data (BaseStructure, pd.Series, pd.DataFrame) – Data points
- Returns
Probabilities (1=Valid, 0=Invalid)
- Return type
pd.Series
- static supported_parameters()
Returns the supported parameters.
- Returns
Supported parameters
- Return type
ParameterList
Example
# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 55, 3
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)
# Perform test
from autom8qc.qaqc.outlier import GeneralizedESDTest
test = GeneralizedESDTest(max_outliers=2, alpha=0.05)
test.plot(series=series, series_name="Example")
Visualization
LOFTest
Class
- class autom8qc.qaqc.outlier.LOFTest(neighbors, contamination=None, minkowski_p=2)
Bases:
autom8qc.qaqc.base.QAQCTest
The Local Outlier Factor algorithm (LOF) compares the density of any given data point to the density of its neighbors. Since outliers come from low-density areas, the ratio will be higher for anomalous data points. As a rule of thumb, a normal data point has a LOF between 1 and 1.5 whereas anomalous observations will have much higher LOF. The higher the LOF the more likely it is an outlier. If the LOF of point X is 5, it means the average density of X’s neighbors is 5 times higher than its local density. In conclusion, the LOF of a point tells the density of this point compared to the density of its neighbors. If the density of a point is much smaller than the densities of its neighbors (LOF ≫1), the point is far from dense areas and, hence, an outlier.
- Steps:
Calculate k-distance
Calculate reachability distance
Calculate reachability density
Compare densities and concern contamination for filtering
Important
The test ignores the time-index and considers only the values
- Parameters
NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)
- Supported parameters:
neighbors (integer): Number of neighbors to use*
contamination (float): The amount of contamination of the data set, i.e. the proportion of outliers in the data set ([0, 0.5])
minkowski_p (float): Parameter for the Minkowski metric. When p=1, this is equivalent to using manhattan distance (l1) and euclidean distance (l2) for p=2. For arbitrary p, minkowski distance (lp) is used.”
- SUPPORTED_STRUCTURES
alias of
autom8qc.core.structures.Series
- perform(data)
Performs the test and returns the probabilities.
- Raises
InvalidType – If structure of the given data is not supported
- Parameters
data (BaseStructure, pd.Series, pd.DataFrame) – Data points
- Returns
Probabilities (1=Valid, 0=Invalid)
- Return type
pd.Series
- static supported_parameters()
Returns the supported parameters.
- Returns
Supported parameters
- Return type
ParameterList
Example
# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 55, 3
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)
# Perform test
from autom8qc.qaqc.outlier import LOFTest
test = LOFTest(neighbors=100, contamination=0.002)
test.plot(series=series, series_name="Example")
Visualization
OutlierIQRTest
Class
- class autom8qc.qaqc.outlier.OutlierIQRTest(scale=1.5)
Bases:
autom8qc.qaqc.base.QAQCTest
This class implements a test to detect outliers using IQR (Inter-Quartile Range). In descriptive statistics, the interquartile range is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles IQR = Q3 − Q1. The algorithm calculates the lower bound (Q1 - scale * IQR) and the upper bound (Q3 + 1.5 * IQR). Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier.
- Steps:
Arrange your data in ascending order
Calculate Q1 (the first Quarter)
Calculate Q3 ( the third Quartile)
Find IQR (Q3 - Q1)
Find the lower range = Q1 -(scale * IQR)
Find the upper range = Q3 + (1.5 * IQR)
Important
The test ignores the time-index and considers only the values
- Parameters
NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)
- Supported parameters:
scale (float): Scale factor (default: 1.5)
- SUPPORTED_STRUCTURES
alias of
autom8qc.core.structures.Series
- perform(data)
Performs the test and returns the probabilities.
- Raises
InvalidType – If structure of the given data is not supported
- Parameters
data (BaseStructure, pd.Series, pd.DataFrame) – Data points
- Returns
Probabilities (1=Valid, 0=Invalid)
- Return type
pd.Series
- static supported_parameters()
Returns the supported parameters.
- Returns
Supported parameters
- Return type
ParameterList
Example
# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)
# Perform test
from autom8qc.qaqc.outlier import OutlierIQRTest
test = OutlierIQRTest()
test.plot(series=series, series_name="Example")
Visualization
OutlierMADTest
Class
- class autom8qc.qaqc.outlier.OutlierMADTest(threshold=3)
Bases:
autom8qc.qaqc.base.QAQCTest
This class implements a test to detect outliers using MAD. The MAD algorithm is commonly used for this type of anomaly detection because it’s highly effective and efficient. The median, or “middle” value, of all the time series at one point in time describes normal behavior for all of the time series at that timestamp. Large deviations from each individual time series and the median indicate that a series is anomalous.
- Steps:
Calculate median of values
Calculate absolute difference (points - median)
Calculate median of absolute difference
Set data points (greater than 3) invalid
Important
The test ignores the time-index and considers only the values
- Parameters
NAME (str) – Name of the test
DESCRIPTION (str) – Description of the test
CATEGORY (str) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters (default: None)
- Supported parameters:
threshold (float): Threshold value (default: 3)
- SUPPORTED_STRUCTURES
alias of
autom8qc.core.structures.Series
- perform(data)
Performs the test and returns the probabilities.
- Raises
InvalidType – If structure of the given data is not supported
- Parameters
data (BaseStructure, pd.Series, pd.DataFrame) – Data points
- Returns
Probabilities (1=Valid, 0=Invalid)
- Return type
pd.Series
- static supported_parameters()
Returns the supported parameters.
- Returns
Supported parameters
- Return type
ParameterList
Example
# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)
# Perform test
from autom8qc.qaqc.outlier import OutlierMADTest
test = OutlierMADTest()
test.plot(series=series, series_name="Example")
Visualization
OutlierZTest
Class
- class autom8qc.qaqc.outlier.OutlierZTest
Bases:
autom8qc.qaqc.base.QAQCTest
This class implements a test to detect outliers using Z-score. Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean. For the limits, the test uses 3 * standard deviations (i.e., between -3 and 3). Everything lying away from this will be treated as an outlier.
- Steps:
Calculate mean and standard deviation
Calculate Z-Scores
Set data points (smaller than -3 or greater than 3) invalid
Note
In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured.
Important
The test ignores the time-index and considers only the values
- Parameters
NAME (string) – Name of the test
DESCRIPTION (string) – Description of the test
CATEGORY (string) – Category of the test
SUPPORTED_STRUCTURES (tuple or BaseStructure) – Supported data structures (e.g., Series)
parameters (ParameterList) – Supported parameters
- SUPPORTED_STRUCTURES
alias of
autom8qc.core.structures.Series
- perform(data)
Performs the test and returns the probabilities.
- Raises
InvalidType – If structure of the given data is not supported
- Parameters
data (BaseStructure, pd.Series, pd.DataFrame) – Data points
- Returns
Probabilities (1=Valid, 0=Invalid)
- Return type
pd.Series
Example
# Generate sample data
import numpy as np
import pandas as pd
np.random.seed(42)
mu, sigma = 50, 5
values = np.random.normal(mu, sigma, 1000)
values[42] = 70
values[666] = 40
index = pd.date_range(start="1/1/2021", periods=1000, freq="min")
series = pd.Series(values, index=index)
# Perform test
from autom8qc.qaqc.outlier import OutlierZTest
test = OutlierZTest()
test.plot(series=series, series_name="Example")