pycvi.cvi_func

Functional API of all implemented CVIs.

These functions are the one on which the CVI classes defined in pycvi.cvi are based.

Functions

CH(X, clusters[, k, X1, zero_type, rng, ...])

Compute the Calinski–Harabasz (CH) index for a given clustering.

MB(X, clusters[, k, p, dist_kwargs])

Compute the Maulik-Bandyopadhyay index for a given clustering.

SD_index(X, clusters[, alpha, dist_kwargs])

Compute the SD index for a given clustering.

SDbw_index(X, clusters[, dist_kwargs])

Compute the SDbw index for a given clustering.

davies_bouldin(X, clusters[, p, dist_kwargs])

Compute the Davies-Bouldin (DB) index for a given clustering.

dunn(X, clusters[, dist_kwargs])

Compute the Dunn index for a given clustering.

gap_statistic(X, clusters[, k, B, ...])

Compute the Gap statistics for a given clustering.

hartigan(X, clusters[, k, clusters_next, ...])

Compute the Hartigan index for a given clustering.

score_function(X, clusters[, k])

Compute the score function for a given clustering.

silhouette(X, clusters)

Compute the silhouette score for a given clustering.

xie_beni(X, clusters[, dist_kwargs])

Compute the Xie-Beni index for a given clustering.

xie_beni_star(X, clusters[, dist_kwargs])

Compute the Xie-Beni* (XB*) index for a given clustering.

pycvi.cvi_func.gap_statistic(X: numpy.ndarray, clusters: List[List[int]], k: int = None, B: int = 10, zero_type: str = 'variance', rng=numpy.random.default_rng, return_s: bool = False) Union[float, Tuple[float, float]]

Compute the Gap statistics for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • k (int, optional) – Number of clusters.

  • B (int, optional) – Number of uniform samples drawn, defaults to 10.

  • zero_type (str, optional) –

    Determines how to parametrize the uniform distribution to sample from in the case \(k=0\), by default “variance”. Possible options:

    • ”variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.

    • ”bounds”: the uniform distribution is defined such that it has the same bounds as the original data.

  • rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)

  • return_s (bool, optional) – Should s be returned as well?

Returns:

The gap statistics

Return type:

Union[float, Tuple[float, float]]

pycvi.cvi_func.score_function(X: numpy.ndarray, clusters: List[List[int]], k: int = None) float

Compute the score function for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • k (int) – Ignored. Used for compatibility purpose.

Returns:

The score function index

Return type:

float

pycvi.cvi_func.hartigan(X: numpy.ndarray, clusters: List[List[int]], k: int = None, clusters_next: List[List[int]] = None, X1: numpy.ndarray = None, rng=numpy.random.default_rng) float

Compute the Hartigan index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • k (int, optional) – Number of clusters.

  • clusters_next (List[List[int]]) – Next clustering (k+1)

  • X1 (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset. This assumes that k=0 and that X is then the values of all datapoints when sampled from a uniform distribution.

  • rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)

Returns:

The Hartigan index

Return type:

float

pycvi.cvi_func.silhouette(X: numpy.ndarray, clusters: List[List[int]]) float

Compute the silhouette score for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

Returns:

The silhouette score

Return type:

float

pycvi.cvi_func.CH(X: numpy.ndarray, clusters: List[List[int]], k: int = None, X1: numpy.ndarray = None, zero_type: str = 'variance', rng=numpy.random.default_rng, dist_kwargs: dict = {}) float

Compute the Calinski–Harabasz (CH) index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • k (int, optional) – Number of clusters.

  • X1 (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset. This assumes that k=0 and that X is then the values of all datapoints when sampled from a uniform distribution.

  • zero_type (str, optional) –

    Determines how to parametrize the uniform distribution to sample from in the case \(k=0\), by default “variance”. Possible options:

    • ”variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.

    • ”bounds”: the uniform distribution is defined such that it has the same bounds as the original data.

  • rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The CH index

Return type:

float

pycvi.cvi_func.MB(X: numpy.ndarray, clusters: List[List[int]], k: int = None, p: int = 2, dist_kwargs={}) float

Compute the Maulik-Bandyopadhyay index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • k (int, optional) – Number of clusters.

  • p (int, optional) – power of the equation

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Maulik-Bandyopadhyay index

Return type:

float

pycvi.cvi_func.SD_index(X: numpy.ndarray, clusters: List[List[int]], alpha: float = None, dist_kwargs={}) float

Compute the SD index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • alpha (float) – The constant in the SD index formula (=Dis(k_max)).

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The SD index

Return type:

float

pycvi.cvi_func.SDbw_index(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) float

Compute the SDbw index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The SDbw index

Return type:

float

pycvi.cvi_func.dunn(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) float

Compute the Dunn index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Dunn index

Return type:

float

pycvi.cvi_func.xie_beni(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) float

Compute the Xie-Beni index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Xie-Beni index

Return type:

float

pycvi.cvi_func.xie_beni_star(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) float

Compute the Xie-Beni* (XB*) index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Xie-Beni* (XB*) index

Return type:

float

pycvi.cvi_func.davies_bouldin(X: numpy.ndarray, clusters: List[List[int]], p: int = 2, dist_kwargs={}) float

Compute the Davies-Bouldin (DB) index for a given clustering.

Parameters:
  • X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset

  • clusters (List[List[int]]) – List of datapoint indices for each cluster.

  • dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Davies-Bouldin (DB) index

Return type:

float