pycvi.cluster

Generate clusterings and compute clustering-related information.

Main functions

The main functions of this module are:

pycvi.cluster.generate_all_clusterings(), that generate all clusterings for a given range of number of clusters \(k\).
pycvi.cluster.compute_center(), that computes the center of a cluster.
pycvi.cluster.compute_centers(), that computes the centers of all clusters.
pycvi.cluster.get_clustering(), that converts an array of predicted label for each datapoint (sklearn type of clustering encoding) to a list of datapoints for each cluster (PyCVI type of clustering encoding)

Functions

`compute_center`(cluster[, keepdims, dist_kwargs])	Compute the center of a cluster.
`compute_centers`(X[, clusters, keepdims, ...])	Compute the centers of all clusters.
`generate_all_clusterings`(data, model_class)	Generate all clusterings for the given data and clustering model.
`generate_uniform`(data[, zero_type, N_zero, rng])	Generate N_zero samples from a uniform distribution based on data.
`get_clustering`(y)	Get a list of clusters with indices based on labels.
`prepare_data`(X[, DTW, window, transformer, ...])	Data to be used for computing clusters and CVIs
`sliding_window`(T, w)	Compute information related to the sliding windows of time-series.

pycvi.cluster.compute_center(cluster: numpy.ndarray, keepdims: bool = False, dist_kwargs: dict = {}) → numpy.ndarray

Compute the center of a cluster.

For non time-series data, this is simply the average of all datapoints in the given cluster, but for time-series data and when DTW is used as the distance measure, then the cluster center is defined as the DBA (DTW barycentric average) as defined by Petitjean et al [DBA]. In this case, additional parameters for computing DTW can be passed in dist_kwargs, as described in aeon.distances.dtw_pairwise_distance. By default, uses {"window" : 0.2}.

Note that the “N” dimension is not included in the result.

[DBA]

F. Petitjean, A. Ketterlin, and P. Gan carski, “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678–693, Mar. 2011.

Parameters:

cluster (np.ndarray, shape (N, d*w_t) or (N, w_t, d) if DTW)
used. (is) – Data values in this cluster.
keepdims (bool, optional) – Whether to keep the dimension N of the input cluster, by default False.
dist_kwargs (dict, optional) – Additional parameters for the distance function used to compute the cluster center, by default {}.

Returns:

np.ndarray, shape (d*w_t), (w_t, d), (1, d*w_t) or ``(1,
w_t, d)`` – The cluster center.
- If keepdims=True then the shape is (1, d*w_t) or (1, w_t, d) if DTW is used.
- If keepdims=False then the shape is (d*w_t) or (w_t, d) if DTW is used.

Raises:

ShapeError – Raised if cluster doesn’t have the shape (N, d*w_t) or (N, w_t, d).

pycvi.cluster.compute_centers(X: numpy.ndarray, clusters: List[List[int]] = [], keepdims: bool = False, dist_kwargs: dict = {}) → List[numpy.ndarray]

Compute the centers of all clusters.

Parameters:

X (np.ndarray, shape (N, d*w_t) or (N, w_t, d)) – The original data.
clusters (List[List[int]]) – A list of clusters with indices.
keepdims (bool, optional) – Whether to keep the dimension N of the input cluster, by default False.
dist_kwargs (dict, optional) – Additional parameters for the distance function used to compute the cluster center, by default {}.

Returns:

A list of cluster centers. For each center:

If keepdims=True then the shape is (1, d*w_t) or (1, w_t, d) if DTW is used.
If keepdims=False then the shape is (d*w_t) or (w_t, d) if DTW is used.

Return type:

List[np.ndarray]

pycvi.cluster.generate_uniform(data: numpy.ndarray, zero_type: str = 'bounds', N_zero: int = 10, rng=numpy.random.default_rng) → List[numpy.ndarray]

Generate N_zero samples from a uniform distribution based on data.

data and each element of the returned l_data0 have the same shape, either (N, T, d) or (N, T*d) if DTW is used.

Parameters:

data (np.ndarray) – The original dataset
zero_type (str, optional) –
Determines how to parametrize the uniform distribution to sample from in the case \(k=0\), by default “bounds”. Possible options:
- ”variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.
- ”bounds”: the uniform distribution is defined such that it has the same bounds as the original data.
N_zero (int, optional) – Number of uniform distributions sampled, by default 10
rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)

Returns:

A list of samples from a uniform distribution, parametrized according to the original dataset given data

Return type:

List[np.ndarray]

pycvi.cluster.prepare_data(X: numpy.ndarray, DTW: bool = False, window: dict = None, transformer: callable = None, scaler=sklearn.preprocessing.StandardScaler) → List[numpy.ndarray] | numpy.ndarray

Data to be used for computing clusters and CVIs

Scaler has to be fit beforehand on the original data (even for the case \(k=0\)).

X_clus is:

a list of \(T\) (N, w_t, d) arrays if sliding window and DTW was used
a list of \(T\) (N, w_t*d) arrays if sliding window was used but not DTW
a list of \(1\) (N, T, d) array if DTW is used but not sliding window
a list of \(1\) (N, T*d) array if DTW and sliding window were not used

This function is notably called in pycvi.cluster.generate_all_clusterings().

Parameters:

X (np.ndarray, shape (N, T, d)) – Original data.
DTW (bool, optional) – Determines whether DTW should be the distance used on the data (concerns only time series data). If so, the time dimension is kept, otherwise it is “merged” with the feature dimension. By default, False.
window (dict, optional) – Information related to the sliding windows of time-series. By default None, which means that no sliding window is done on the data. For more information, see pycvi.cluster.sliding_window().
transformer (callable, optional) – A potential additional preprocessing step, by default None. If None, no transformation is applied on the data
scaler (A sklearn-like scaler model, optional) – A data scaler, by default StandardScaler(). In the case of time series data (i.e. \(T > 1\)), all the time steps of all samples of a given feature are aggregated before fitting the scaler. If None, no scaling is applied on the data.

Returns:

The processed data, ready to being clustered.

Return type:

Union[List[np.ndarray], np.ndarray]

pycvi.cluster.sliding_window(T: int, w: int) → dict

Compute information related to the sliding windows of time-series.

Assuming that we consider an array of length \(T\), and with indices \([0 \cdots T-1]\).

Windows extracted are shorter when considering the beginning and the end of the array. Which means that a padding is implicitly included.

When the time window \(w\) is an even number, favor future time steps, i.e., when extracting a time window around the datapoint \(t\), the time window indices are \([t - (w-1)//2, ..., t, ..., t + w/2]\). When w is odd then the window is \([t - (w-1)/2 \cdots t \cdots t + (w-1)/2]\).

Which means that the padding is as follows:

beginning: `(w-1)//2
end: w//2

And that the original indices are as follows:

\([0, ..., t + w//2]\), until \(t = (w-1)//2\)
\([t - (w-1)//2, ..., t, ..., t + w//2]\) for datapoints in \([(w-1)//2, ..., T-1 - w//2]\)
\([t - (w-1)//2, ..., T-1]\) from \(t = T-1 - w//2\)
Note that for \(t = (w-1)//2\) or \(t = (T-1 - w//2)\), both formulas apply.
Note also that the left side of the end case is the same as the left side of the base case

Window sizes:

\((1 + w//2)\) at \(t=0\), then \(t + (1 + w//2)\) until \(t = (w-1)//2\)
All datapoints from \([(w-1)//2, ..., T-1 - w//2]\) have a normal window size.
\((w+1)//2\) at \(t=T-1\), then \(T-1-t + (w+1)//2\) from \(t = T-1 - w//2\)
Note that for \(t = (w-1)//2\) or \(t = (T-1 - w//2)\), both formulas apply

Consider an extracted time window of length w_real (with \(w\_real \leq w\), if the window was extracted at the beginning or the end of the array). The midpoint of the extracted window (i.e. the index in that window that corresponds to the datapoint around which the time window was extracted in the original array) is:

\(0\) at \(t=0\), then \(t\), until \(t = pad\_left\), i.e. \(t = (w-1)//2\)
For all datapoints between, \([(w-1)//2, ..., (T-1 - w//2)]\), the midpoint is \((w-1)//2\) (so it is the same as the base case)
\(w\_real-1\) at \(t=T-1\), then \(w\_real - (T-t)\), from \(t=T-1-pad\_right\), i.e. from \(t = (T-1 - w//2)\)
Note that for \(t = (w-1)//2\) or \(t = (T-1 - w//2)\), both formulas apply.

The midpoint in the original array is actually simply \(t\).

Available keys:

“padding_left”: Padding to the left
“padding_right”: Padding to the right
“length”: Actual length of the time window
“midpoint_w”: Midpoint in the window reference
“midpoint_o”: Midpoint in the origin reference
“origin”: Original indices

This function is notably called in pycvi.cluster.generate_all_clusterings().

Parameters:

T (int) – Length of the time series.
w (int) – Length of the sliding window.

Returns:

The information related to the sliding windows of length \(w\) extracted from time-series of length \(T\).

Return type:

dict

pycvi.cluster.get_clustering(y: numpy.ndarray) → List[List[int]]

Get a list of clusters with indices based on labels.

The labels can either be the true labels when loading the data, or the output of a sklearn-like `fit_predict` or `predict` method.

Parameters:: y (np.ndarray, shape (N, )) – The labels for each datapoint
Returns:: `clusters`: a list of datapoint indices for each cluster. `clusters[i]`: contains the indices of the datapoints that belong to the ith cluster.
Return type:: List[List[int]]
Raises:: EmptyClusterError – Raised if the clustering algorithm didn’t find the expected number of clusters because it couldn’t converge.

pycvi.cluster.generate_all_clusterings(data: numpy.ndarray, model_class, n_clusters_range: Sequence = None, DTW: bool = True, time_window: int = None, transformer: callable = None, scaler=sklearn.preprocessing.StandardScaler, model_kw: dict = {}, fit_predict_kw: dict = {}, model_class_kw: dict = {}, return_list: bool = False, verbose: int = 0) → List[Dict[int, List[List[int]]]] | Dict[int, List[List[int]]]

Generate all clusterings for the given data and clustering model.

If time_window is None: `clusterings_t_k[k][i]` is a list of datapoint indices contained in cluster \(i\) for the clustering that assumes \(k\) clusters.

If time_window is not None (concerns only time series with sliding window): `clusterings_t_k[t_w][k][i]` is a list of datapoint indices contained in cluster \(i\) for the clustering that assumes \(k\) clusters for the extracted time window \(t\_w\).

If some clusterings couldn’t be defined because the clustering algorithm didn’t converged (pycvi.exceptions.EmptyClusterError) then `clusterings_t_k[t_w][n_clusters] = None`.

For more information about the preprocessing steps done on the data before the clustering operation, see pycvi.cluster.prepare_data() and pycvi.cluster.sliding_window().

Parameters:

data (np.ndarray,) –
Original data. Acceptable input shapes and their corresponding output shapes in the PyCVI package:
- (N,) -> (N, 1, 1)
- (N, d) -> (N, 1, d)
- (N, T, d) -> (N, T, d)
model_class (A sklearn-like clustering class) – A class implementing a clustering algorithm.
n_clusters_range (Sequence, optional) – Assumptions on the number of clusters to try out, by default None. If None, n_clusters_range=range(N+1).
DTW (bool, optional) – Determines if DTW should be used as the distance measure (concerns only time series data), by default True.
time_window (int, optional) – Length of the sliding window (concerns only time-series data), by default None. If None, no sliding window is used, and the time series is considered as a whole. If None, the output is of type Dict[int, List[List[int]]], if not None, the output is of List[Dict[int, List[List[int]]]].
transformer (callable, optional) – A potential additional preprocessing step, by default None. If None, no transformation is applied on the data
scaler (A sklearn-like scaler model, optional) – A data scaler, by default StandardScaler() . In the case of time series data (i.e. \(T > 1\)), all the time steps of all samples of a given feature are aggregated before fitting the scaler. If None, no scaling is applied on the data.
model_kw (dict, optional) – Specific kwargs to give to model_class init method, by default {}.
fit_predict_kw (dict, optional) – Specific kwargs to give to the fit_predict method of the model_class clustering model, by default {}.
model_class_kw (dict, optional) – Dictionary that contains the argument names of the number of clusters and the data to give to the clustering model, by default {}, which then updated as follows: {“k_arg_name” : “n_clusters”, “X_arg_name” : “X” } to follow sklearn conventions.
return_list (bool, optional) – Determines whether the output should be forced to be a List[Dict], even when no sliding window is used by default False.
verbose (int, optional) – Controls the verbosity of the function, by default 0, which means that the function will be quiet. Max level of verbosity: 2.

Returns:

All clusterings for the given range on the number of clusters and for the potential sliding windows if applicable.

Return type:

Union[List[Dict[int, List[List[int]]]], Dict[int, List[List[int]]]]