pycvi.cluster.generate_all_clusterings

pycvi.cluster.generate_all_clusterings(data: numpy.ndarray, model_class, n_clusters_range: Sequence = None, DTW: bool = True, time_window: int = None, transformer: callable = None, scaler=sklearn.preprocessing.StandardScaler, model_kw: dict = {}, fit_predict_kw: dict = {}, model_class_kw: dict = {}, return_list: bool = False, verbose: int = 0) List[Dict[int, List[List[int]]]] | Dict[int, List[List[int]]]

Generate all clusterings for the given data and clustering model.

If time_window is None: `clusterings_t_k[k][i]` is a list of datapoint indices contained in cluster \(i\) for the clustering that assumes \(k\) clusters.

If time_window is not None (concerns only time series with sliding window): `clusterings_t_k[t_w][k][i]` is a list of datapoint indices contained in cluster \(i\) for the clustering that assumes \(k\) clusters for the extracted time window \(t\_w\).

If some clusterings couldn’t be defined because the clustering algorithm didn’t converged (pycvi.exceptions.EmptyClusterError) then `clusterings_t_k[t_w][n_clusters] = None`.

For more information about the preprocessing steps done on the data before the clustering operation, see pycvi.cluster.prepare_data() and pycvi.cluster.sliding_window().

Parameters:
  • data (np.ndarray,) –

    Original data. Acceptable input shapes and their corresponding output shapes in the PyCVI package:

    • (N,) -> (N, 1, 1)

    • (N, d) -> (N, 1, d)

    • (N, T, d) -> (N, T, d)

  • model_class (A sklearn-like clustering class) – A class implementing a clustering algorithm.

  • n_clusters_range (Sequence, optional) – Assumptions on the number of clusters to try out, by default None. If None, n_clusters_range=range(N+1).

  • DTW (bool, optional) – Determines if DTW should be used as the distance measure (concerns only time series data), by default True.

  • time_window (int, optional) – Length of the sliding window (concerns only time-series data), by default None. If None, no sliding window is used, and the time series is considered as a whole. If None, the output is of type Dict[int, List[List[int]]], if not None, the output is of List[Dict[int, List[List[int]]]].

  • transformer (callable, optional) – A potential additional preprocessing step, by default None. If None, no transformation is applied on the data

  • scaler (A sklearn-like scaler model, optional) – A data scaler, by default StandardScaler() . In the case of time series data (i.e. \(T > 1\)), all the time steps of all samples of a given feature are aggregated before fitting the scaler. If None, no scaling is applied on the data.

  • model_kw (dict, optional) – Specific kwargs to give to model_class init method, by default {}.

  • fit_predict_kw (dict, optional) – Specific kwargs to give to the fit_predict method of the model_class clustering model, by default {}.

  • model_class_kw (dict, optional) – Dictionary that contains the argument names of the number of clusters and the data to give to the clustering model, by default {}, which then updated as follows: {“k_arg_name” : “n_clusters”, “X_arg_name” : “X” } to follow sklearn conventions.

  • return_list (bool, optional) – Determines whether the output should be forced to be a List[Dict], even when no sliding window is used by default False.

  • verbose (int, optional) – Controls the verbosity of the function, by default 0, which means that the function will be quiet. Max level of verbosity: 2.

Returns:

All clusterings for the given range on the number of clusters and for the potential sliding windows if applicable.

Return type:

Union[List[Dict[int, List[List[int]]]], Dict[int, List[List[int]]]]