pycvi.compute_scores

Low and high level functions to compute CVI values.

Main function

pycvi.compute_scores.compute_all_scores(cvi, ...)

Computes all CVI values for the given clusterings.

Functions

compute_all_scores(cvi, data, clusterings[, ...])

Computes all CVI values for the given clusterings.

f_diameter(cluster[, dist_kwargs])

Diameter of a group of elements.

f_inertia(cluster[, dist_kwargs])

Inertia of a group of elements.

f_intra(cluster[, dist_kwargs])

Sum of pairwise distances within a group of elements.

pycvi.compute_scores.f_intra(cluster: numpy.ndarray, dist_kwargs: dict = {}) float

Sum of pairwise distances within a group of elements.

Parameters:
  • cluster (np.ndarray, shape (N, d) or (N, w, d) if DTW.) – A cluster of size N.

  • dist_kwargs (dict, optional) – kwargs for scipy.spatial.distance.pdist , by default {}.

Returns:

The sum of pairwise distances within the cluster.

Return type:

float

pycvi.compute_scores.f_inertia(cluster: numpy.ndarray, dist_kwargs: dict = {}) float

Inertia of a group of elements.

The inertia is defined as the sum of (squared) distances between the datapoints in the the cluster and its centroid.

Parameters:
  • cluster (np.ndarray, shape (N, d) or (N, w, d) if DTW.) – A cluster of size N.

  • dist_kwargs (dict, optional) – kwargs for scipy.spatial.distance.cdist , by default {}.

Returns:

The inertia of the cluster.

Return type:

float

pycvi.compute_scores.f_diameter(cluster: numpy.ndarray, dist_kwargs: dict = {}) float

Diameter of a group of elements.

Parameters:
  • cluster (np.ndarray, shape (N, d) or (N, w, d) if DTW.) – A cluster of size N.

  • dist_kwargs (dict, optional) –

    kwargs for scipy.spatial.distance.pdist , by default {}.

Returns:

The diameter of the cluster.

Return type:

float

pycvi.compute_scores.compute_all_scores(cvi, data: numpy.ndarray, clusterings: List[Dict[int, List[List[int]]]], transformer: callable = None, scaler=sklearn.preprocessing.StandardScaler, DTW: bool = True, time_window: int = None, N_zero: int = 10, zero_type: str = 'bounds', rng=numpy.random.default_rng, cvi_kwargs: dict = {}, return_list: bool = False) List[List[Dict[int, float]]] | List[Dict[int, float]] | Dict[int, float]

Computes all CVI values for the given clusterings.

If some scores couldn’t be computed because of the condition on \(k\) (pycvi.exceptions.InvalidKError) or because the clustering algorithm used previously didn’t converged (pycvi.exceptions.EmptyClusterError) then `scores[t_w][n_clusters] = None`.

Parameters:
  • cvi (an instance of a CVI class or a CVIAggregator.) – The CVI(s) to use to compute all the scores.

  • data (np.ndarray) –

    Original data. Acceptable input shapes and their corresponding output shapes in the PyCVI package:

    • (N,) -> (N, 1, 1)

    • (N, d) -> (N, 1, d)

    • (N, T, d) -> (N, T, d)

  • clusterings (List[Dict[int, List[List[int]]]]) –

    All clusterings for the given range on the number of clusters and for the potential sliding windows if applicable.

    `clusterings_t_k[t_w][k][i]` is a list of datapoint indices contained in cluster \(i\) for the clustering that assumes \(k\) clusters for the extracted time window \(t\_w\).

  • transformer (callable, optional) – A potential additional preprocessing step, by default None. If None, no transformation is applied on the data

  • scaler (A sklearn-like scaler model, optional) – A data scaler, by default StandardScaler() . In the case of time series data (i.e. \(T > 1\)), all the time steps of all samples of a given feature are aggregated before fitting the scaler. If None, no scaling is applied on the data.

  • DTW (bool, optional) – Determines if DTW should be used as the distance measure (concerns only time series data), by default True.

  • time_window (int, optional) – Length of the sliding window (concerns only time-series data), by default None. If None, no sliding window is used, and the time series is considered as a whole.

  • N_zero (int, optional) – Number of uniform distributions sampled, by default 10.

  • zero_type (str, optional) –

    Determines how to parametrize the uniform distribution to sample from in the case \(k=0\), by default “bounds”. Possible options:

    • ”variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.

    • ”bounds”: the uniform distribution is defined such that it has the same bounds as the original data.

  • rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)

  • cvi_kwargs (dict, optional) – Specific kwargs to give to the CVI, by default {}

  • return_list (bool, optional) – Determines whether the output should be forced to be a List[Dict], even when no sliding window is used by default False.

Returns:

  • Union[List[List[Dict[int, float]]], List[Dict[int, float]],

  • Dict[int, float]] – The computed CVI values for each of the clustering given as input.

    The type is:

    • Dict[int, float]]: only if a CVI class was used (not a CVIAggregator and if no time window was used)

    • List[List[Dict[int, float]]]: only if both a CVIAggregator was used and a time window

    • List[Dict[int, float]]: otherwise, that is to say, if a CVIAggregator was used without time window, or if a CVI was used with a time window.