pycvi.cvi_func

Functional API of all implemented CVIs.

These functions are the one on which the CVI classes defined in pycvi.cvi are based.

Functions

`CH`(X, clusters[, k, X1, zero_type, rng, ...])	Compute the Calinski–Harabasz (CH) index for a given clustering.
`MB`(X, clusters[, k, p, dist_kwargs])	Compute the Maulik-Bandyopadhyay index for a given clustering.
`SD_index`(X, clusters[, alpha, dist_kwargs])	Compute the SD index for a given clustering.
`SDbw_index`(X, clusters[, dist_kwargs])	Compute the SDbw index for a given clustering.
`davies_bouldin`(X, clusters[, p, dist_kwargs])	Compute the Davies-Bouldin (DB) index for a given clustering.
`dunn`(X, clusters[, dist_kwargs])	Compute the Dunn index for a given clustering.
`gap_statistic`(X, clusters[, k, B, ...])	Compute the Gap statistics for a given clustering.
`hartigan`(X, clusters[, k, clusters_next, ...])	Compute the Hartigan index for a given clustering.
`score_function`(X, clusters[, k])	Compute the score function for a given clustering.
`silhouette`(X, clusters)	Compute the silhouette score for a given clustering.
`xie_beni`(X, clusters[, dist_kwargs])	Compute the Xie-Beni index for a given clustering.
`xie_beni_star`(X, clusters[, dist_kwargs])	Compute the Xie-Beni* (XB*) index for a given clustering.

pycvi.cvi_func.gap_statistic(X: numpy.ndarray, clusters: List[List[int]], k: int = None, B: int = 10, zero_type: str = 'variance', rng=numpy.random.default_rng, return_s: bool = False) → Union[float, Tuple[float, float]]

Compute the Gap statistics for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
k (int, optional) – Number of clusters.
B (int, optional) – Number of uniform samples drawn, defaults to 10.
zero_type (str, optional) –
Determines how to parametrize the uniform distribution to sample from in the case \(k=0\), by default “variance”. Possible options:
- ”variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.
- ”bounds”: the uniform distribution is defined such that it has the same bounds as the original data.
rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)
return_s (bool, optional) – Should s be returned as well?

Returns:

The gap statistics

Return type:

Union[float, Tuple[float, float]]

pycvi.cvi_func.score_function(X: numpy.ndarray, clusters: List[List[int]], k: int = None) → float

Compute the score function for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
k (int) – Ignored. Used for compatibility purpose.

Returns:

The score function index

Return type:

float

pycvi.cvi_func.hartigan(X: numpy.ndarray, clusters: List[List[int]], k: int = None, clusters_next: List[List[int]] = None, X1: numpy.ndarray = None, rng=numpy.random.default_rng) → float

Compute the Hartigan index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
k (int, optional) – Number of clusters.
clusters_next (List[List[int]]) – Next clustering (k+1)
X1 (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset. This assumes that k=0 and that X is then the values of all datapoints when sampled from a uniform distribution.
rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)

Returns:

The Hartigan index

Return type:

float

pycvi.cvi_func.silhouette(X: numpy.ndarray, clusters: List[List[int]]) → float

Compute the silhouette score for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.

Returns:

The silhouette score

Return type:

float

pycvi.cvi_func.CH(X: numpy.ndarray, clusters: List[List[int]], k: int = None, X1: numpy.ndarray = None, zero_type: str = 'variance', rng=numpy.random.default_rng, dist_kwargs: dict = {}) → float

Compute the Calinski–Harabasz (CH) index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
k (int, optional) – Number of clusters.
X1 (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset. This assumes that k=0 and that X is then the values of all datapoints when sampled from a uniform distribution.
zero_type (str, optional) –
Determines how to parametrize the uniform distribution to sample from in the case \(k=0\), by default “variance”. Possible options:
- ”variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.
- ”bounds”: the uniform distribution is defined such that it has the same bounds as the original data.
rng (A numpy Random Generator, optional) – The numpy random generator to use to sample from the uniform distribution, by default np.random.default_rng(611)
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The CH index

Return type:

float

pycvi.cvi_func.MB(X: numpy.ndarray, clusters: List[List[int]], k: int = None, p: int = 2, dist_kwargs={}) → float

Compute the Maulik-Bandyopadhyay index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
k (int, optional) – Number of clusters.
p (int, optional) – power of the equation
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Maulik-Bandyopadhyay index

Return type:

float

pycvi.cvi_func.SD_index(X: numpy.ndarray, clusters: List[List[int]], alpha: float = None, dist_kwargs={}) → float

Compute the SD index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
alpha (float) – The constant in the SD index formula (=Dis(k_max)).
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The SD index

Return type:

float

pycvi.cvi_func.SDbw_index(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) → float

Compute the SDbw index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The SDbw index

Return type:

float

pycvi.cvi_func.dunn(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) → float

Compute the Dunn index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Dunn index

Return type:

float

pycvi.cvi_func.xie_beni(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) → float

Compute the Xie-Beni index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Xie-Beni index

Return type:

float

pycvi.cvi_func.xie_beni_star(X: numpy.ndarray, clusters: List[List[int]], dist_kwargs={}) → float

Compute the Xie-Beni* (XB*) index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Xie-Beni* (XB*) index

Return type:

float

pycvi.cvi_func.davies_bouldin(X: numpy.ndarray, clusters: List[List[int]], p: int = 2, dist_kwargs={}) → float

Compute the Davies-Bouldin (DB) index for a given clustering.

Parameters:

X (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)) – Dataset
clusters (List[List[int]]) – List of datapoint indices for each cluster.
dist_kwargs (dict, optional) – kwargs for the distance function, defaults to {}

Returns:

The Davies-Bouldin (DB) index

Return type:

float