feature_extraction
FeatureExtractor
absolute_energy()
Compute the absolute energy of a time series.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
absolute_maximum()
Compute the absolute maximum of a time series.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
absolute_sum_of_changes()
Compute the absolute sum of changes of a time series.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
autocorrelation(n_lags)
Calculate the autocorrelation for a specified lag. The autocorrelation measures the linear dependence between a time-series and a lagged version of itself.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_lags
|
int
|
The lag at which to calculate the autocorrelation. Must be a non-negative integer. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
benford_correlation()
Returns the correlation between the first digit distribution of the input time series and the Newcomb-Benford's Law distribution.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
binned_entropy(bin_count=10)
Calculates the entropy of a binned histogram for a given time series. It is highly recommended that you impute the time series before calling this.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bin_count
|
int
|
The number of bins to use in the histogram. Default is 10. |
10
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
c3(n_lags)
Measure of non-linearity in the time series using c3 statistics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_lags
|
int
|
The lag that should be used in the calculation of the feature. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
change_quantiles(q_low, q_high, is_abs=True)
First fixes a corridor given by the quantiles ql and qh of the distribution of x. Then calculates the average, absolute value of consecutive changes of the series x inside this corridor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
q_low
|
float
|
The lower quantile of the corridor. Must be less than |
required |
q_high
|
float
|
The upper quantile of the corridor. Must be greater than |
required |
is_abs
|
bool
|
If True, takes absolute difference. |
True
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
cid_ce(normalize=False)
Computes estimate of time-series complexity[^1].
A more complex time series has more peaks and valleys. This feature is calculated by:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
normalize
|
bool
|
If True, z-normalizes the time-series before computing the feature. Default is False. |
False
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
count_above(threshold=0.0)
Calculate the percentage of values above or equal to a threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
The threshold value for comparison. |
0.0
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
count_above_mean()
Count the number of values that are above the mean.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
count_below(threshold=0.0)
Calculate the percentage of values below or equal to a threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
The threshold value for comparison. |
0.0
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
count_below_mean()
Count the number of values that are below the mean.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
cusum(threshold, warmup_period, drift=0.0)
Cumulative sum (CUSUM) filter to detect abrupt changes in data.
The CUSUM filter is a quality control method, designed to detect a shift in the mean value of the measured quantity away from a target value.
The general formula for the CUSUM filter can be found here: https://en.wikipedia.org/wiki/CUSUM
And the original paper that introduces it can be found here: https://www.tandfonline.com/doi/abs/10.1080/00401706.1961.10489922
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
The threshold for the change (x_t+1 - x_t) to be counted |
required |
warmup_period
|
int
|
The number of observations which are used to estimate the mean and standard deviation of the data. |
required |
drift
|
float
|
The drift coefficient for the CUSUM filter. Default value is 0. |
0.0
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
detrend(method='linear')
Detrends the time series by either removing a fitted linear regression or by removing the mean. This assumes that data is in order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method
|
DetrendMethod
|
Either |
'linear'
|
Returns:
Type | Description |
---|---|
An expression representing detrend-ed column
|
|
energy_ratios(n_chunks=10)
Calculates sum of squares over the whole series for n_chunks
equally segmented parts of the time-series.
All ratios for all chunks will be returned at once.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_chunks
|
int
|
The number of equally segmented parts to divide the time-series into. Default is 10. |
10
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
first_location_of_maximum()
Returns the first location of the maximum value of x. The position is calculated relatively to the length of x.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
first_location_of_minimum()
Returns the first location of the minimum value of x. The position is calculated relatively to the length of x.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
frac_diff(d, min_weight=None, window_size=None)
Compute the fractional differential of a time series.
This particular functionality is referenced in Advances in Financial Machine Learning by Marcos Lopez de Prado (2018).
For feature creation purposes, it is suggested that the minimum value of d is used that removes stationarity from the time series. This can be achieved by running the augmented dickey-fuller test on the time series for different values of d and selecting the minimum value that makes the time series stationary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d
|
float
|
The fractional order of the differencing operator. |
required |
min_weight
|
float
|
The minimum weight to use for calculations. If specified, the window size is computed from this value and not needed. |
None
|
window_size
|
int
|
The window size of the fractional differencing operator. If specified, the minimum weight is not needed. |
None
|
harmonic_mean()
Returns the harmonic mean of the expression
Returns:
Type | Description |
---|---|
An expression of the output
|
|
has_duplicate()
Check if the time-series contains any duplicate values.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
has_duplicate_max()
Check if the time-series contains any duplicate values equal to its maximum value.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
has_duplicate_min()
Check if the time-series contains duplicate values equal to its minimum value.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
index_mass_quantile(q)
Calculates the relative index i of time series x where q% of the mass of x lies left of i. For example for q = 50% this feature calculator will return the mass center of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
q
|
float
|
The quantile. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
large_standard_deviation(ratio=0.25)
Checks if the time-series has a large standard deviation: std(x) > r * (max(X)-min(X))
.
As a heuristic, the standard deviation should be a forth of the range of the values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ratio
|
float
|
The ratio of the interval to compare with. |
0.25
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
last_location_of_maximum()
Returns the last location of the maximum value of x. The position is calculated relatively to the length of x.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
last_location_of_minimum()
Returns the last location of the minimum value of x. The position is calculated relatively to the length of x.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
lempel_ziv_complexity(threshold, as_ratio=True)
Calculate a complexity estimate based on the Lempel-Ziv compression algorithm. The implementation here is currently a Rust rewrite of Lilian Besson'code. Instead of returning the complexity value, we return a ratio w.r.t the length of the input series. If null is encountered, it will be interpreted as 0 in the bit sequence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
threshold
|
Union[float, Expr]
|
Either a number, or an expression representing a comparable quantity. If x > threshold, then it will be binarized as 1 and 0 otherwise. |
required |
as_ratio
|
bool
|
If true, return the complexity divided by length of sequence |
True
|
Returns:
Type | Description |
---|---|
Expr
|
|
Reference
https://github.com/Naereen/Lempel-Ziv_Complexity/tree/master https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv_complexity
linear_trend()
Compute the slope, intercept, and RSS of the linear trend.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
longest_losing_streak()
Returns the longest losing streak of the time series. A loss is counted when (x_t+1 - x_t) <= 0
Returns:
Type | Description |
---|---|
An expression of the output
|
|
longest_streak_above(threshold)
Returns the longest streak of changes >= threshold of the time series. A change is counted when (x_t+1 - x_t) >= threshold. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
The threshold value for comparison. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
longest_streak_above_mean()
Returns the length of the longest consecutive subsequence in x that is greater than the mean of x.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
longest_streak_below(threshold)
Returns the longest streak of changes <= threshold of the time series. A change is counted when (x_t+1 - x_t) <= threshold. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
The threshold value for comparison. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
longest_streak_below_mean()
Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
longest_winning_streak()
Returns the longest winning streak of the time series. A win is counted when (x_t+1 - x_t) >= 0
Returns:
Type | Description |
---|---|
An expression of the output
|
|
max_abs_change()
Compute the maximum absolute change from X_t to X_t+1.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
mean_abs_change()
Compute mean absolute change.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
mean_change()
Compute mean change.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
mean_n_absolute_max(n_maxima)
Calculates the arithmetic mean of the n absolute maximum values of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_maxima
|
int
|
The number of maxima to consider. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
mean_second_derivative_central()
Returns the mean value of a central approximation of the second derivative.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
number_crossings(crossing_value=0.0)
Calculates the number of crossings of x on m, where m is the crossing value.
A crossing is defined as two sequential values where the first value is lower than m and the next is greater, or vice-versa. If you set m to zero, you will get the number of zero crossings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
crossing_value
|
float
|
The crossing value. Defaults to 0.0. |
0.0
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
number_peaks(support)
Calculates the number of peaks of at least support n in the time series x. A peak of support n is defined as a subsequence of x where a value occurs, which is bigger than its n neighbours to the left and to the right.
Hence in the sequence
x = [3, 0, 0, 4, 0, 0, 13]
4 is a peak of support 1 and 2 because in the subsequences
[0, 4, 0][0, 0, 4, 0, 0]
4 is still the highest value. Here, 4 is not a peak of support 3 because 13 is the 3th neighbour to the right of 4 and its bigger than 4.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
support
|
int
|
Support of the peak |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
percent_reoccurring_points()
Returns the percentage of non-unique data points in the time series. Non-unique data points are those that occur more than once in the time series.
The percentage is calculated as follows:
# of data points occurring more than once / # of all data points
This means the ratio is normalized to the number of data points in the time series, in contrast to the
percent_reoccuring_values
function.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
percent_reoccurring_values()
Returns the percentage of values that are present in the time series more than once.
The percentage is calculated as follows:
len(different values occurring more than once) / len(different values)
This means the percentage is normalized to the number of unique values in the time series, in contrast to the
percent_reoccurring_points
function.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
permutation_entropy(tau=1, n_dims=3, base=math.e)
Computes permutation entropy. It is recommended that users should impute the time series before calling this.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tau
|
int
|
The embedding time delay which controls the number of time periods between elements of each of the new column vectors. The recommended value is 1. |
1
|
n_dims
|
int, > 1
|
The embedding dimension which controls the length of each of the new column vectors. The recommended range is 3-7. |
3
|
base
|
float
|
The base for log in the entropy computation |
e
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
range_change(percentage=True)
Returns the range (max - min) over mean of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
percentage
|
bool
|
Compute the percentage if set to True |
True
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
range_count(lower, upper, closed='left')
Computes values of input expression that is between lower (inclusive) and upper (exclusive).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lower
|
float
|
The lower bound, inclusive |
required |
upper
|
float
|
The upper bound, exclusive |
required |
closed
|
ClosedInterval
|
Whether or not the boundaries should be included/excluded |
'left'
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
range_over_mean()
Returns the range (max - min) over mean of the time series.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
ratio_beyond_r_sigma(ratio=0.25)
Returns the ratio of values in the series that is beyond r*std from mean on both sides.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ratio
|
float
|
The scaling factor for std |
0.25
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
ratio_n_unique_to_length()
Calculate the ratio of the number of unique values to the length of the time-series.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
root_mean_square()
Calculate the root mean square.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
streak_length_stats(above, threshold)
Returns some statistics of the length of the streaks of the time series. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.
The statistics include: min length, max length, average length, std of length, 10-percentile length, median length, 90-percentile length, and mode of the length. If input is Series, a dictionary will be returned. If input is an expression, the expression will evaluate to a struct with the fields ordered by the statistics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
above
|
bool
|
Above (>=) or below (<=) the given threshold |
required |
threshold
|
float
|
The threshold for the change (x_t+1 - x_t) to be counted |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
sum_reoccurring_points()
Returns the sum of all data points that are present in the time series more than once.
For example, sum_reoccurring_points(pl.Series([2, 2, 2, 2, 1]))
returns 8, as 2 is a reoccurring value, so all 2's
are summed up.
This is in contrast to the sum_reoccurring_values
function, where each reoccuring value is only counted once.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
sum_reoccurring_values()
Returns the sum of all values that are present in the time series more than once.
For example, sum_reoccurring_values(pl.Series([2, 2, 2, 2, 1]))
returns 2, as 2 is a reoccurring value, so it is
summed up with all other reoccuring values (there is none), so the result is 2.
This is in contrast to the sum_reoccurring_points
function, where each reoccuring value is only counted as often
as it is present in the data.
Returns:
Type | Description |
---|---|
An expression of the output
|
|
symmetry_looking(ratio=0.25)
Check if the distribution of x looks symmetric.
A distribution is considered symmetric if: | mean(X)-median(X) | < ratio * (max(X)-min(X))
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ratio
|
float
|
Multiplier on distance between max and min. |
0.25
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
time_reversal_asymmetry_statistic(n_lags)
Returns the time reversal asymmetry statistic.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_lags
|
int
|
The lag that should be used in the calculation of the feature. |
required |
Returns:
Type | Description |
---|---|
An expression of the output
|
|
var_gt_std(ddof=1)
Is the variance >= std? In other words, is var >= 1?
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ddof
|
int
|
Delta Degrees of Freedom used when computing var/std. |
1
|
Returns:
Type | Description |
---|---|
An expression of the output
|
|
variation_coefficient()
Calculate the coefficient of variation (CV).
Returns:
Type | Description |
---|---|
An expression of the output
|
|
absolute_energy(x)
Compute the absolute energy of a time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
absolute_maximum(x)
Compute the absolute maximum of a time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
absolute_sum_of_changes(x)
Compute the absolute sum of changes of a time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
approximate_entropy(x, run_length, filtering_level, scale_by_std=True)
Approximate sample entropies of a time series given the filtering level. This only works for Series input right now.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
run_length
|
int
|
Length of compared run of data. This is |
required |
filtering_level
|
float
|
Filtering level, must be positive. This is |
required |
scale_by_std
|
bool
|
Whether to scale filter level by std of data. In most applications, this is the default behavior, but not in some other cases. |
True
|
Returns:
Type | Description |
---|---|
float
|
|
augmented_dickey_fuller(x, n_lags)
Calculates the Augmented Dickey-Fuller (ADF) test statistic. This only works for Series input right now.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
n_lags
|
int
|
The number of lags to include in the test. |
required |
Returns:
Type | Description |
---|---|
float
|
|
autocorrelation(x, n_lags)
Calculate the autocorrelation for a specified lag.
The autocorrelation measures the linear dependence between a time-series and a lagged version of itself.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
n_lags
|
int
|
The lag at which to calculate the autocorrelation. Must be a non-negative integer. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
Autocorrelation at the given lag. Returns None, if lag is less than 0. |
autoregressive_coefficients(x, n_lags)
Computes coefficients for an AR(n_lags
) process. This only works for Series input
right now. Caution: Any Null Value in Series will replaced by 0!
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
n_lags
|
int
|
The number of lags in the autoregressive process. |
required |
Returns:
Type | Description |
---|---|
list of float
|
|
benford_correlation(x)
Returns the correlation between the first digit distribution of the input time series and the Newcomb-Benford's Law distribution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
binned_entropy(x, bin_count=10)
Calculates the entropy of a binned histogram for a given time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
bin_count
|
int
|
The number of bins to use in the histogram. Default is 10. |
10
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
c3(x, n_lags)
Measure of non-linearity in the time series using c3 statistics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
n_lags
|
int
|
The lag that should be used in the calculation of the feature. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
change_quantiles(x, q_low, q_high, is_abs)
First fixes a corridor given by the quantiles ql and qh of the distribution of x. It will return a list of changes coming from consecutive values that both lie within the quantile range. The user may optionally get abssolute value of the changes, and compute stats from these changes. If q_low >= q_high, it will return null.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
A single time-series. |
required |
q_low
|
float
|
The lower quantile of the corridor. Must be less than |
required |
q_high
|
float
|
The upper quantile of the corridor. Must be greater than |
required |
is_abs
|
bool
|
If True, takes absolute difference. |
required |
Returns:
Type | Description |
---|---|
list of float | Expr
|
|
cid_ce(x, normalize=False)
Computes estimate of time-series complexity[^1].
A more complex time series has more peaks and valleys. This feature is calculated by:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
A single time-series. |
required |
normalize
|
bool
|
If True, z-normalizes the time-series before computing the feature. Default is False. |
False
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
count_above(x, threshold=0.0)
Calculate the percentage of values above or equal to a threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
threshold
|
float
|
The threshold value for comparison. |
0.0
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
count_above_mean(x)
Count the number of values that are above the mean.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
int | Expr
|
|
count_below(x, threshold=0.0)
Calculate the percentage of values below or equal to a threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
threshold
|
float
|
The threshold value for comparison. |
0.0
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
count_below_mean(x)
Count the number of values that are below the mean.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
int | Expr
|
|
cwt_coefficients(x, widths=(2, 5, 10, 20), n_coefficients=14)
Calculates a Continuous wavelet transform for the Ricker wavelet.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
widths
|
Sequence[int]
|
The widths of the Ricker wavelet to use for the CWT. Default is (2, 5, 10, 20). |
(2, 5, 10, 20)
|
n_coefficients
|
int
|
The number of CWT coefficients to return. Default is 14. |
14
|
Returns:
Type | Description |
---|---|
list of float
|
|
energy_ratios(x, n_chunks=10)
Calculates sum of squares over the whole series for n_chunks
equally segmented parts of the time-series.
E.g. if n_chunks = 10, values are [0, 1, 2, 3, .. , 999], the first chunk will be [0, .. , 99].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
list of float
|
The time-series to be segmented and analyzed. |
required |
n_chunks
|
int
|
The number of equally segmented parts to divide the time-series into. Default is 10. |
10
|
Returns:
Type | Description |
---|---|
list of float | Expr
|
|
fft_coefficients(x)
Calculates Fourier coefficients and phase angles of the the 1-D discrete Fourier Transform. This only works for Series input right now.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
n_threads
|
int
|
Number of threads to use. If None, uses all threads available. Defaults to None. |
required |
Returns:
Type | Description |
---|---|
dict of list of floats | Expr
|
|
first_location_of_maximum(x)
Returns the first location of the maximum value of x. The position is calculated relatively to the length of x.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
first_location_of_minimum(x)
Returns the first location of the minimum value of x. The position is calculated relatively to the length of x.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
fourier_entropy(x, n_bins=10)
Calculate the Fourier entropy of a time series. This only works for Series input right now.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
n_bins
|
int
|
The number of bins to use for the entropy calculation. Default is 10. |
10
|
Returns:
Type | Description |
---|---|
float
|
|
friedrich_coefficients(x, polynomial_order=3, n_quantiles=30)
Calculate the Friedrich coefficients of a time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
TIME_SERIES_T
|
The time series to calculate the Friedrich coefficients of. |
required |
polynomial_order
|
int
|
The order of the polynomial to fit to the quantile means. Default is 3. |
3
|
n_quantiles
|
int
|
The number of quantiles to use for the calculation. Default is 30. |
30
|
Returns:
Type | Description |
---|---|
list of float
|
|
harmonic_mean(x)
Returns the harmonic mean of the of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
has_duplicate(x)
Check if the time-series contains any duplicate values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
bool | Expr
|
|
has_duplicate_max(x)
Check if the time-series contains any duplicate values equal to its maximum value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
bool | Expr
|
|
has_duplicate_min(x)
Check if the time-series contains duplicate values equal to its minimum value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
bool | Expr
|
|
index_mass_quantile(x, q)
Calculates the relative index i of time series x where q% of the mass of x lies left of i. For example for q = 50% this feature calculator will return the mass center of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
q
|
float
|
The quantile. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
large_standard_deviation(x, ratio=0.25)
Checks if the time-series has a large standard deviation: std(x) > r * (max(X)-min(X))
.
As a heuristic, the standard deviation should be a forth of the range of the values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
ratio
|
float
|
The ratio of the interval to compare with. |
0.25
|
Returns:
Type | Description |
---|---|
bool | Expr
|
|
last_location_of_maximum(x)
Returns the last location of the maximum value of x. The position is calculated relatively to the length of x.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
last_location_of_minimum(x)
Returns the last location of the minimum value of x. The position is calculated relatively to the length of x.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
lempel_ziv_complexity(x, threshold, as_ratio=True)
Calculate a complexity estimate based on the Lempel-Ziv compression algorithm. The implementation here is currently a Rust rewrite of Lilian Besson'code. See the reference section below. Instead of return the complexity value, we return a ratio w.r.t the length of the input series. If null is encountered, it will be interpreted as 0 in the bit sequence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
threshold
|
Union[float, Expr]
|
Either a number, or an expression representing a comparable quantity. If x > value, then it will be binarized as 1 and 0 otherwise. If x is eager, then value must also be eager as well. |
required |
as_ratio
|
bool
|
If true, return the complexity / length of sequence |
True
|
Returns:
Type | Description |
---|---|
float
|
|
Reference
https://github.com/Naereen/Lempel-Ziv_Complexity/tree/master https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv_complexity
linear_trend(x)
Compute the slope, intercept, and RSS of the linear trend.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
Mapping[str, float] | Expr
|
|
longest_losing_streak(x)
Returns the longest losing streak of the time series. A loss is counted when (x_t+1 - x_t) <= 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
longest_streak_above(x, threshold)
Returns the longest streak of changes >= threshold of the time series. A change is counted when (x_t+1 - x_t) >= threshold. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
threshold
|
float
|
The threshold value for comparison. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
longest_streak_above_mean(x)
Returns the length of the longest consecutive subsequence in x that is > mean of x. If all values in x are null, 0 will be returned. Note: this does not measure consecutive changes in time series, only counts the streak based on the original time series, not the differences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
int | Expr
|
|
longest_streak_below(x, threshold)
Returns the longest streak of changes <= threshold of the time series. A change is counted when (x_t+1 - x_t) <= threshold. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
threshold
|
float
|
The threshold value for comparison. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
longest_streak_below_mean(x)
Returns the length of the longest consecutive subsequence in x that is < mean of x. If all values in x are null, 0 will be returned. Note: this does not measure consecutive changes in time series, only counts the streak based on the original time series, not the differences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
int | Expr
|
|
longest_winning_streak(x)
Returns the longest winning streak of the time series. A win is counted when (x_t+1 - x_t) >= 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
max_abs_change(x)
Compute the maximum absolute change from X_t to X_t+1.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
A single time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
mean_abs_change(x)
Compute mean absolute change.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
A single time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
mean_change(x)
Compute mean change.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
A single time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
mean_n_absolute_max(x, n_maxima)
Calculates the arithmetic mean of the n absolute maximum values of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
n_maxima
|
int
|
The number of maxima to consider. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
mean_second_derivative_central(x)
Returns the mean value of a central approximation of the second derivative.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Series
|
A time series to calculate the feature of. |
required |
Returns:
Type | Description |
---|---|
Series
|
|
number_crossings(x, crossing_value=0.0)
Calculates the number of crossings of x on m, where m is the crossing value.
A crossing is defined as two sequential values where the first value is lower than m and the next is greater, or vice-versa. If you set m to zero, you will get the number of zero crossings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
A single time-series. |
required |
crossing_value
|
float
|
The crossing value. Defaults to 0.0. |
0.0
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
number_cwt_peaks(x, max_width=5)
Number of different peaks in x.
To estimate the numbers of peaks, x is smoothed by a ricker wavelet for widths ranging from 1 to n. This feature calculator returns the number of peaks that occur at enough width scales and with sufficiently high Signal-to-Noise-Ratio (SNR)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Series
|
A single time-series. |
required |
max_width
|
int
|
maximum width to consider |
5
|
Returns:
Type | Description |
---|---|
float
|
|
number_peaks(x, support)
Calculates the number of peaks of at least support n in the time series x. A peak of support n is defined as a subsequence of x where a value occurs, which is bigger than its n neighbours to the left and to the right.
Hence in the sequence
x = [3, 0, 0, 4, 0, 0, 13]
4 is a peak of support 1 and 2 because in the subsequences
[0, 4, 0][0, 0, 4, 0, 0]
4 is still the highest value. Here, 4 is not a peak of support 3 because 13 is the 3th neighbour to the right of 4 and its bigger than 4.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
support
|
int
|
Support of the peak |
required |
Returns:
Type | Description |
---|---|
int | Expr
|
|
percent_reoccurring_points(x)
Returns the percentage of non-unique data points in the time series. Non-unique data points are those that occur more than once in the time series.
The percentage is calculated as follows:
# of data points occurring more than once / # of all data points
This means the ratio is normalized to the number of data points in the time series, in contrast to the
percent_reoccuring_values
function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float
|
|
percent_reoccurring_values(x)
Returns the percentage of values that are present in the time series more than once.
The percentage is calculated as follows:
# (distinct values occurring more than once) / # of distinct values
This means the percentage is normalized to the number of unique values in the time series, in contrast to the
percent_reoccurring_points
function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
permutation_entropy(x, tau=1, n_dims=3, base=math.e)
Computes permutation entropy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
tau
|
int
|
The embedding time delay which controls the number of time periods between elements of each of the new column vectors. |
1
|
n_dims
|
int, > 1
|
The embedding dimension which controls the length of each of the new column vectors |
3
|
base
|
float
|
The base for log in the entropy computation |
e
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
range_change(x, percentage=True)
Returns the maximum value range. If percentage is true, will compute (max - min) / min, which only makes sense when x is always positive.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
percentage
|
bool
|
compute the percentage if set to True |
True
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
range_count(x, lower, upper, closed='left')
Computes values of input expression that is between lower (inclusive) and upper (exclusive).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
lower
|
float
|
The lower bound, inclusive |
required |
upper
|
float
|
The upper bound, exclusive |
required |
closed
|
ClosedInterval
|
Whether or not the boundaries should be included/excluded |
'left'
|
Returns:
Type | Description |
---|---|
int | Expr
|
|
range_over_mean(x)
Returns the range (max - min) over mean of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
ratio_beyond_r_sigma(x, ratio=0.25)
Returns the ratio of values in the series that is beyond r*std from mean on both sides.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
ratio
|
float
|
The scaling factor for std |
0.25
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
ratio_n_unique_to_length(x)
Calculate the ratio of the number of unique values to the length of the time-series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
root_mean_square(x)
Calculate the root mean square.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
sample_entropy(x, ratio=0.2, m=2)
Calculate the sample entropy of a time series. This only works for Series input right now.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
The input time series. |
required |
ratio
|
float
|
The tolerance parameter. Default is 0.2. |
0.2
|
m
|
int
|
Length of a run of data. Most common run length is 2. |
2
|
Returns:
Type | Description |
---|---|
float | Expr
|
|
spkt_welch_density(x, n_coeffs=None)
This estimates the cross power spectral density of the time series x at different frequencies. This only works for Series input right now.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
The input time series. |
required |
n_coeffs
|
Optional[int]
|
The number of coefficients you want to take. If none, will take all, which will be a list as long as the input time series. |
None
|
Returns:
Type | Description |
---|---|
list of floats
|
|
streak_length_stats(x, above, threshold)
Returns some statistics of the length of the streaks of the time series. Note that the streaks here are about the changes for consecutive values in the time series, not the individual values.
The statistics include: min length, max length, average length, std of length, 10-percentile length, median length, 90-percentile length, and mode of the length. If input is Series, a dictionary will be returned. If input is an expression, the expression will evaluate to a struct with the fields ordered by the statistics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
above
|
bool
|
Above (>=) or below (<=) the given threshold |
required |
threshold
|
float
|
The threshold for the change (x_t+1 - x_t) to be counted |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
sum_reoccurring_points(x)
Returns the sum of all data points that are present in the time series more than once.
For example, sum_reoccurring_points(pl.Series([2, 2, 2, 2, 1]))
returns 8, as 2 is a reoccurring value, so all 2's
are summed up.
This is in contrast to the sum_reoccurring_values
function, where each reoccuring value is only counted once.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
sum_reoccurring_values(x)
Returns the sum of all values that are present in the time series more than once.
For example, sum_reoccurring_values(pl.Series([2, 2, 2, 2, 1]))
returns 2, as 2 is a reoccurring value, so it is
summed up with all other reoccuring values (there is none), so the result is 2.
This is in contrast to the sum_reoccurring_points
function, where each reoccuring value is only counted as often as it is present in the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time-series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
symmetry_looking(x, ratio=0.25)
Check if the distribution of x looks symmetric.
A distribution is considered symmetric if: | mean(X)-median(X) | < ratio * (max(X)-min(X))
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Series
|
Input time-series. |
required |
ratio
|
float
|
Multiplier on distance between max and min. |
0.25
|
Returns:
Type | Description |
---|---|
bool | Expr
|
|
time_reversal_asymmetry_statistic(x, n_lags)
Returns the time reversal asymmetry statistic.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Series
|
Input time-series. |
required |
n_lags
|
int
|
The lag that should be used in the calculation of the feature. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|
var_gt_std(x, ddof=1)
Is the variance >= std? In other words, is var >= 1?
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
ddof
|
int
|
Delta Degrees of Freedom used when computing var. |
1
|
Returns:
Type | Description |
---|---|
bool | Expr
|
|
variation_coefficient(x)
Calculate the coefficient of variation (CV).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Expr | Series
|
Input time series. |
required |
Returns:
Type | Description |
---|---|
float | Expr
|
|