pycvi.cvi
Python implementation of state-of-the-art internal CVIs.
Internal CVIs are used to select the best clustering among a set of pre-computed clustering when no information about the true clusters nor the number of clusters is available. To assess the quality of different clusterings, CVIs compute distances between datapoints and most of them also rely on the concept of cluster center.
In general for static data, the distance function used to compute pairwise distances is usually the euclidean distance and the center of a group of datapoints is defined as the barycentric average. Time-series data however are usually compared using time-series specific distances such as Dynamic Time Warping (DTW) [DTW] and the concept of average non-trivial and can be for example defined using DTW Barycentric Average (DBA) [DBA].
PyCVI extends state-of-the-art internal CVIs to make them compatible with time-series data as well by using DTW and DBA when necessary.
Donald J. Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94, page 359–370. AAAI Press, 1994
F. Petitjean, A. Ketterlin, and P. Gan carski, “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678–693, Mar. 2011.
D. J. Strauss and J. A. Hartigan, “Clustering algorithms,” Biometrics, vol. 31, p. 793, sep 1975.
T. Calinski and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics - Theory and Methods, vol. 3, no. 1, pp. 1–27, 1974.
R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 63, pp. 411–423, July 2001.
P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987.
J. C. Dunn, “Well-separated clusters and optimal fuzzy partitions,” Journal of Cybernetics, vol. 4, pp. 95–104, Jan. 1974.
D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, pp. 224–227, Apr. 1979.
M. Halkidi, M. Vazirgiannis, and Y. Batistakis, “Quality scheme assessment in the clustering process,” in Principles of Data Mining and Knowledge Discovery, pp. 265–276, Springer Berlin Heidelberg, 2000
M. Halkidi and M. Vazirgiannis, “Clustering validity assessment: finding the optimal partitioning of a data set,” in Proceedings 2001 IEEE International Conference on Data Mining, pp. 187–194, IEEE Comput. Soc, 2001.
X. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991.
S. Saitta, B. Raphael, and I. F. C. Smith, “A bounded index for cluster validity,” in Machine Learning and Data Mining in Pattern Recognition, pp. 174–187, Springer Berlin Heidelberg, 2007.
U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1650–1654, Dec. 2002.
Module Attributes
List of available CVI indices in PyCVI, as pycvi.cvi classes. |
Classes
|
Base class for all Cluster Validity Indices in PyCVI. |
|
An aggregator of multiple CVIs. |
|
The Calinski-Harabasz index. |
|
The Davies-Bouldin index. |
|
The Diameter of a clustering. |
|
The Dunn index. |
|
The Gap statistic. |
|
The Hartigan index. |
|
The inertia of a clustering. |
|
The Maulik-Bandyopadhyay index. |
|
The SD index. |
|
The SDbw index. |
|
The Score function. |
|
The Silhouette score. |
|
The Xie-Beni index. |
|
The Xie-Beni* index. |
- class pycvi.cvi.CVI(cvi_function: callable = None, maximise: bool = True, improve: bool = True, cvi_type: str = 'monotonous', criterion_function: callable = None, k_condition: callable = None, ignore0: bool = False)
Base class for all Cluster Validity Indices in PyCVI.
To create a custom CVI class, inherit from this class with well defined parameters.
- Parameters:
cvi_function (callable, optional) – Function used to assess each clustering, by default None
maximise (bool, optional) – Determines whether higher values mean better clustering quality according to this CVI , by default True
improve (bool, optional) – Determines whether the quality of the clustering is expected to improve with increasing values of \(k\) (concerns only monotone CVIs), by default True
cvi_type (str, optional) – Determines whether the CVI is to be interpreted as being “absolute”, “monotonous” or “original” (note that not all CVIs can have these 3 interpretations), by default “monotonous”
criterion_function (callable, optional) – Determines how the best clustering should be selected according to their corresponding CVI values, by default None
k_condition (callable, optional) – \(k\) values that are compatible with this CVI, by default None
ignore0 (bool, optional) – Determines how to treat the special case \(k=0\) (when available) when selecting the best clustering. This is for example used in the Hartigan index where we don’t use \(k=0\) as a reference score if \(k=0\) is more relevant than \(k=1\), by default False
- Raises:
ValueError – Raised if the cvi_type given is not among the available options for this CVI.
- cvi_types: List[str] = ['monotonous', 'absolute']
- reductions: List[str] = None
- get_cvi_kwargs(X_clus: numpy.ndarray = None, clusterings_t: Dict[int, List] = None, n_clusters: int = None, cvi_kwargs: dict = {}) dict
Get the kwargs parameters specific to the CVI.
Base method to override when defining a CVI if the CVI function requires additional parameters than the standard X and clusters representing respectively the data values (already processed) and the partition representing the clustering.
- Parameters:
X_clus (np.ndarray, shape (N, d*w_t) or (N, w_t, d),) –
optional – Dataset to cluster (already processed), by default None
clusterings_t (Dict[int, List], optional) – All the clusterings computed for the provided \(k\) range. Having an overview of the clusterings can be needed in some CVI such as the Hartigan index. By default None.
n_clusters (int, optional) – Current number of clusters considered, by default None
cvi_kwargs (dict, optional) – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
- Returns:
The dictionary of kwargs necessary to compute the CVI.
- Return type:
dict
- criterion(scores: Dict[int, float], cvi_type: str = None) Optional[int]
The default selection method for regular cases.
Regular cases included monotonous/absolute/pseudo-monotonous and maximise=True/False with improve=True/False.
Does not take into account rules that are specific to a CVI, such as the Gap statistic, Hartigan, etc.
Returns None if no clustering could be selected (probably because the CVI values were all None or NaN or that all valid and relevant scores were all equal).
- Parameters:
scores (Dict[int, float]) – The CVI values obtained for the provided \(k\) range.
cvi_type (str, optional) – The type of CVI to use in the selection scheme. Note that for most cases it is redundant with the attribute CVI.cvi_type but it may facilitate the selection for CVIs that use base cases with small adjustments, by default None.
- Returns:
The \(k\) value corresponding to the selected clustering. Returns None if no clustering could be selected.
- Return type:
Union[int, None]
- is_relevant(score: float, k: int, score_prev: float, k_prev: int) bool
Determines if a score is relevant based on the CVI properties.
This is particularly useful for pseudo-monotonous CVIs, to know whether we should ignore a specific CVI value.
- Parameters:
score (float) – The current CVI value.
k (int) – The current \(k\) value.
score_prev (float) – The previous CVI value.
k_prev (int) – The previous \(k\) value (which must be smaller).
- Returns:
True is the current value is relevant, which means that it is following the expected scheme of CVI values given the properties of the CVI (its cvi_type, maximise, improve properties).
- Return type:
bool
- select(scores_t_k: Union[List[Dict[int, float]], Dict[int, float]], return_list: bool = False) Union[List[int], int]
Select the best clusterings according to the CVI values given.
Select the best \(k\) for each \(t\) according to the CVI values given. If the data is not time series or if the time series are clustered considering all time steps at once, then the returned list has only one element.
If no k could be selected given the scores_t_k values, then returns a SelectionError (check the values of scores_t_k for more information on why the error was raised).
- Parameters:
scores_t_k (Union[List[Dict[int, float]], Dict[int, float]]) – The CVI values for the provided \(k\) range and for the potential number \(t\) of iterations to consider in time.
return_list (bool, optional) – Determines whether the output should be forced to be a List[Dict], even when no sliding window is used, by default False.
- Returns:
The list of \(k\) values corresponding to the best clustering for each potential number \(t\) of iterations to consider in time. Some elements can be None if no clustering could be selected at a given iteration \(t\).
- Return type:
Union[List[int], int]
- Raises:
ValueError – If
`scores_t_k`is empty or not of the right type (a list of dictionaries in the case of time series data clustered by sliding windows or a dictionary).SelectionError – If no clustering could be selected with the given CVI values (probably because the CVI values are None or NaN or all equal).
- better_score(score1: float, score2: float, or_equal: bool = False) bool
Checks if a CVI value is better than another.
Takes into account the properties of the CVI (its cvi_type, maximise, improve properties)
- Parameters:
score1 (float) – The first CVI value
score2 (float) – The second CVI value
or_equal (bool, optional) – Determines whether 2 equal scores should yield True or not, by default False
- Returns:
True if score1 is better than score2.
- Return type:
bool
- argbest(scores: List[float], ignore_None: bool = False) int
Returns the index of the best score.
- Parameters:
scores (List[float]) – A list of CVI values
ignore_None (bool, optional) – If True, None values will be ignored, otherwise, a None value will be considered as the best score, by default False
- Returns:
The index of the best score
- Return type:
int
- best_score(scores: List[float], ignore_None: bool = False) float
Returns the best score.
- Parameters:
scores (List[float]) – A list of CVI values
ignore_None (bool, optional) – If True, None values will be ignored, otherwise, a None value will be considered as the best score, by default False
- Returns:
The best score
- Return type:
float
- argworst(scores: List[float]) int
Returns the index of the worst score.
- Parameters:
scores (List[float]) – A list of CVI values
- Returns:
The index of the worst score
- Return type:
int
- worst_score(scores: List[float]) float
Returns the worst score.
- Parameters:
scores (List[float]) – A list of CVI values
- Returns:
The worst score
- Return type:
float
- class pycvi.cvi.CVIAggregator(cvi_classes: Optional[List[CVI]] = None, cvi_kwargs: Optional[List[dict]] = None)
An aggregator of multiple CVIs.
- Parameters:
cvi_classes (Union[List[CVI], None], optional) – List of CVIs to aggregate to select the best clustering. Default to None, in that case, all CVIs implemented in PyCVI are used and with their default parameters. (see
pycvi.cvi.CVIs).cvi_kwargs (Union[List[dict], None], optional) – List of CVIs specific kwargs to give to the corresponding CVI. Default to None, in that case, each CVI use their default parameters. If a list is given, its length must match the number of CVIs used in cvi_classes.
- Raises:
ValueError – Raised if lengths of cvi_classes and cvi_kwargs not consistent.
- select(scores_i_t_k: Union[List[List[Dict[int, float]]], List[Dict[int, float]]], return_list: bool = False) Union[List[int], int]
Select the best clusterings according to the CVI values given.
Select the best \(k\) for each \(t\) according to the majority vote of the selected clustering according to each CVI. Each CVI select k based on their corresponding CVI values and selection rule. In case of a tie for the best clustering, the clustering with fewer clusters is selected. If the data is not time series or if the time series are clustered considering all time steps at once, then the returned list has only one element.
If no k could be selected given the scores_i_t_k values because no CVI could select one clustering, then returns a SelectionError (check the values of scores_i_t_k for more information on why the error was raised).
After calling this functions, all votes will be available in the CVIAggregator.votes property and each individual selected k will be available in the CVIAggregator.all_selected_k property.
- Parameters:
scores_i_t_k (Union[List[List[Dict[int, float]]], List[Dict[int, float]]]) – The CVI values for the provided \(k\) range and for the potential number \(t\) of iterations to consider in time and for each CVI aggregated.
return_list (bool, optional) – Determines whether the output should be forced to be a List[Dict], even when no sliding window is used, by default False.
- Returns:
The list of \(k\) values corresponding to the best clustering for each potential number \(t\) of iterations to consider in time. Some elements can be None if no clustering could be selected at a given iteration \(t\).
- Return type:
Union[List[int], int]
- Raises:
SelectionError – If no clustering could be selected with the given CVIs and their corresponding CVI values.
- class pycvi.cvi.Hartigan(cvi_type: str = 'monotonous')
The Hartigan index. [Hartigan]
Originally, this index is absolute and the selection criteria is as follows:
According to de Amorim and Hennig [2015] “In the original paper, the lowest \(k\) to yield Hartigan \(<= 10\) was proposed as the optimal choice. However, if no \(k\) meets this criteria, choose the \(k\) whose difference in Hartigan for \(k\) and \(k+1\) is the smallest”. According to Tibshirani et al. [2001] it is “the estimated number of clusters is the smallest \(k\) such that Hartigan \(<= 10\) and \(k=1\) could then be possible.
A monotonous approach can also be taken.
Possible cvi_type values: “monotonous” or “original”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “monotonous”.
- cvi_types: List[str] = ['monotonous', 'original']
- get_cvi_kwargs(X_clus: numpy.ndarray = None, clusterings_t: Dict[int, List] = None, n_clusters: int = None, cvi_kwargs: dict = {}) dict
Get the kwargs parameters specific to Hartigan.
Hartigan has 3 additional parameters:
k (int): the current number of clusters.
clusters_next (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)): the clustering for the next \(k\) value considered.
X1 (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)): the dataset to cluster (already processed). This is needed for the case \(k=0\), and in that case X_clus is sampled from a uniform distribution with similar parameters as the original distribution.
- Parameters:
X_clus (np.ndarray, shape (N, d*w_t) or (N, w_t, d),) –
optional – Dataset to cluster (already processed), by default None
clusterings_t (Dict[int, List], optional) – All the clusterings computed for the provided \(k\) range. Having an overview of the clusterings can be needed in some CVI such as the Hartigan index. By default None.
n_clusters (int, optional) – Current number of clusters considered, by default None
cvi_kwargs (dict, optional) – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
- Returns:
The dictionary of kwargs necessary to compute the CVI.
- Return type:
dict
- class pycvi.cvi.CalinskiHarabasz(cvi_type: str = 'original')
The Calinski-Harabasz index. [CH]
Originally, this index is absolute and has to be maximised to find the best \(k\). A monotonous approach can also be taken, so that the case k=1 can be selected, with CH(1) = 0 and CH(0) extended (see pycvi.cvi.CH)
Possible cvi_type values: “monotonous”, “absolute” or “original”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “original”.
- cvi_types: List[str] = ['monotonous', 'original', 'absolute']
- get_cvi_kwargs(X_clus: numpy.ndarray = None, clusterings_t: Dict[int, List] = None, n_clusters: int = None, cvi_kwargs: dict = {}) dict
Get the kwargs parameters specific to Calinski-Harabasz.
Calinski-Harabasz has 3 additional parameters:
k (int): the current number of clusters.
X1 (np.ndarray, shape: (N, d*w_t) or (N, w_t, d)): the dataset to cluster (already processed). This is needed for the case \(k=0\), and in that case X_clus is sampled from a uniform distribution with similar parameters as the original distribution
zero_type (str): determines how to parametrize the uniform distribution to sample from in the case \(k=0\). Possible options:
“variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.
“bounds”: the uniform distribution is defined such that it has the same bounds as the original data.
- Parameters:
X_clus (np.ndarray, shape (N, d*w_t) or (N, w_t, d),) –
optional – Dataset to cluster (already processed), by default None
clusterings_t (Dict[int, List], optional) – All the clusterings computed for the provided \(k\) range. Having an overview of the clusterings can be needed in some CVI such as the Hartigan index. By default None.
n_clusters (int, optional) – Current number of clusters considered, by default None
cvi_kwargs (dict, optional) – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
- Returns:
The dictionary of kwargs necessary to compute the CVI.
- Return type:
dict
- class pycvi.cvi.GapStatistic(cvi_type: str = 'original')
The Gap statistic. [Gap]
Originally, this index is absolute and the selection criteria is as follow:
Take the smallest \(k\) such that \(Gap(k) \geq Gap(k+1) - s(k+1)\).
A monotonous approach can also be taken.
Possible cvi_type values: “monotonous”, or “original”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “original”.
- cvi_types: List[str] = ['monotonous', 'original']
- get_cvi_kwargs(X_clus: numpy.ndarray = None, clusterings_t: Dict[int, List] = None, n_clusters: int = None, cvi_kwargs: dict = {}) dict
Get the kwargs parameters specific to Gap statistic.
Gap statistic has 4 additional parameters:
k (int): the current number of clusters.
B (int): the number of uniform samples drawn.
zero_type (str): determines how to parametrize the uniform distribution to sample from in the case \(k=0\). Possible options:
“variance”: the uniform distribution is defined such that it has the same variance and mean as the original data.
“bounds”: the uniform distribution is defined such that it has the same bounds as the original data.
return_s (bool): determines whether the s value should also be returned
- Parameters:
X_clus (np.ndarray, shape (N, d*w_t) or (N, w_t, d),) –
optional – Dataset to cluster (already processed), by default None
clusterings_t (Dict[int, List], optional) – All the clusterings computed for the provided \(k\) range. Having an overview of the clusterings can be needed in some CVI such as the Hartigan index. By default None.
n_clusters (int, optional) – Current number of clusters considered, by default None
dict (cvi_kwargs) – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
optional – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
- Returns:
The dictionary of kwargs necessary to compute the CVI.
- Return type:
dict
- class pycvi.cvi.Silhouette(cvi_type: str = 'absolute')
The Silhouette score. [Silhouette]
This index is absolute, bounded in \([-1, 1]\) range and has to be maximised.
Possible cvi_type values: “absolute”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute']
- class pycvi.cvi.ScoreFunction(cvi_type: str = 'original')
The Score function. [SF]
This index has to be maximised to find the best clustering, but the original paper Saitta et al. [2007] adds special cases:
if the score always increases, then the number \(k = 1\) is chosen.
if a maximum is found, outside the extreme \(k\) values, then the argument of this maximum is chosen.
it is empirically decided that if \((SF_2 − SF_1) \times d \leq 0.2\) then, \(k = 1\) is also chosen (\(d\) being the dimensionality of the data).
The absolute case can also be chosen (i.e. special case are ignored).
Possible cvi_type values: “absolute”, or “original”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “original”.
- cvi_types: List[str] = ['absolute', 'original']
- get_cvi_kwargs(X_clus: numpy.ndarray = None, clusterings_t: Dict[int, List] = None, n_clusters: int = None, cvi_kwargs: dict = {}) dict
Get the kwargs parameters specific to Score Function.
Score Function has no additional parameters, but \(k\) is used to distinguish between the case \(k=0\) and \(k=1\), to make sure that the case \(k=0\) is never computed.
- Parameters:
X_clus (np.ndarray, shape (N, d*w_t) or (N, w_t, d),) –
optional – Dataset to cluster (already processed), by default None
clusterings_t (Dict[int, List], optional) – All the clusterings computed for the provided \(k\) range. Having an overview of the clusterings can be needed in some CVI such as the Hartigan index. By default None.
n_clusters (int, optional) – Current number of clusters considered, by default None
cvi_kwargs (dict, optional) – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
- Returns:
The dictionary of kwargs necessary to compute the CVI.
- Return type:
dict
- class pycvi.cvi.MaulikBandyopadhyay(cvi_type: str = 'absolute')
The Maulik-Bandyopadhyay index. [MB]
Originally, this index is absolute and has to be maximised to find the best \(k\).
A monotonous approach can also be taken.
Possible cvi_type values: “monotonous”, or “absolute”.
Note that the case \(k=1\) always returns 0.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute', 'monotonous']
- get_cvi_kwargs(X_clus: numpy.ndarray = None, clusterings_t: Dict[int, List] = None, n_clusters: int = None, cvi_kwargs: dict = {}) dict
Get the kwargs parameters specific to the CVI.
Base method to override when defining a CVI if the CVI function requires additional parameters than the standard X and clusters representing respectively the data values (already processed) and the partition representing the clustering.
- Parameters:
X_clus (np.ndarray, shape (N, d*w_t) or (N, w_t, d),) –
optional – Dataset to cluster (already processed), by default None
clusterings_t (Dict[int, List], optional) – All the clusterings computed for the provided \(k\) range. Having an overview of the clusterings can be needed in some CVI such as the Hartigan index. By default None.
n_clusters (int, optional) – Current number of clusters considered, by default None
cvi_kwargs (dict, optional) – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
- Returns:
The dictionary of kwargs necessary to compute the CVI.
- Return type:
dict
- class pycvi.cvi.SD(cvi_type: str = 'absolute')
The SD index. [SD]
This index is absolute and has to be minimised to find the best \(k\).
Note that if two clusters have equal centroids, then SD = inf which means that this clustering is irrelevant, which works as intended (even though two clusters could be well separated and still have equal centroids, as in the case of two concentric circles).
The case \(k=1\) is not possible.
Possible cvi_type values: “absolute”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute']
- get_cvi_kwargs(X_clus: numpy.ndarray = None, clusterings_t: Dict[int, List] = None, n_clusters: int = None, cvi_kwargs: dict = {}) dict
Get the kwargs parameters specific to the CVI.
Base method to override when defining a CVI if the CVI function requires additional parameters than the standard X and clusters representing respectively the data values (already processed) and the partition representing the clustering.
- Parameters:
X_clus (np.ndarray, shape (N, d*w_t) or (N, w_t, d),) –
optional – Dataset to cluster (already processed), by default None
clusterings_t (Dict[int, List], optional) – All the clusterings computed for the provided \(k\) range. Having an overview of the clusterings can be needed in some CVI such as the Hartigan index. By default None.
n_clusters (int, optional) – Current number of clusters considered, by default None
cvi_kwargs (dict, optional) – Pre-defined kwargs, typically the metric to use when computing the CVI values, by default {}
- Returns:
The dictionary of kwargs necessary to compute the CVI.
- Return type:
dict
- class pycvi.cvi.SDbw(cvi_type: str = 'absolute')
The SDbw index. [SDbw]
This index is absolute and has to be minimised to find the best \(k\).
Note that if two clusters have all datapoints further away to their respective centroids than what is called in the original paper “the average standard deviation of clusters”, then SDbw = inf, which means that this clustering is irrelevant, which works as intended.
The case \(k=1\) is not possible.
Possible cvi_type values: “absolute”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute']
- class pycvi.cvi.Dunn(cvi_type: str = 'absolute')
The Dunn index. [Dunn]
This index is absolute and has to be maximised to find the best \(k\).
The case \(k=1\) is not possible.
Possible cvi_type values: “absolute”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute']
- class pycvi.cvi.XB(cvi_type: str = 'absolute')
The Xie-Beni index. [XB]
This index is absolute and has to be minimised to find the best \(k\).
The case \(k=1\) is not possible.
Possible cvi_type values: “absolute”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute']
- class pycvi.cvi.XBStar(cvi_type: str = 'absolute')
The Xie-Beni* index. [XB*]_
This index is absolute and has to be minimised to find the best \(k\).
The case \(k=1\) is not possible.
Possible cvi_type values: “absolute”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute']
- class pycvi.cvi.DB(cvi_type: str = 'absolute')
The Davies-Bouldin index. [DB]
This index is absolute and has to be minimised to find the best \(k\).
The case \(k=1\) is not possible.
Possible cvi_type values: “absolute”.
- Parameters:
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “absolute”.
- cvi_types: List[str] = ['absolute']
- class pycvi.cvi.Inertia(reduction: Union[str, callable] = 'sum', cvi_type: str = 'monotonous')
The inertia of a clustering.
This index is monotonous and and smaller values are considered better.
Possible cvi_type values: “monotonous”.
- Parameters:
reduction (str, optional) – Determines how to combine the inertia values of each cluster to compute the inertia of the whole clustering, by default “sum”. Available options: “sum”, “mean”, “max”, “median”, “min”, “”, None or a callable. See pycvi.compute_scores.reduce for more information.
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “monotonous”.
- cvi_types: List[str] = ['monotonous']
- reductions: List[str] = ['sum', 'mean', 'max', 'median', 'min', '', None]
- class pycvi.cvi.Diameter(reduction: Union[str, callable] = 'max', cvi_type: str = 'monotonous')
The Diameter of a clustering.
This index is monotonous and and smaller values are considered better.
Possible cvi_type values: “monotonous”.
- Parameters:
reduction (str, optional) – Determines how to combine the diameter values of each cluster to compute the diameter of the whole clustering, by default “sum”. Available options: “sum”, “mean”, “max”, “median”, “min”, “”, None or a callable. See pycvi.compute_scores.reduce for more information.
cvi_type (str, optional) – Determines how the index should be interpreted, when selecting the best clustering, by default “monotonous”.
- cvi_types: List[str] = ['monotonous']
- reductions: List[str] = ['sum', 'mean', 'max', 'median', 'min', '', None]
- pycvi.cvi.CVIs = [<class 'pycvi.cvi.Hartigan'>, <class 'pycvi.cvi.CalinskiHarabasz'>, <class 'pycvi.cvi.GapStatistic'>, <class 'pycvi.cvi.Silhouette'>, <class 'pycvi.cvi.ScoreFunction'>, <class 'pycvi.cvi.MaulikBandyopadhyay'>, <class 'pycvi.cvi.SD'>, <class 'pycvi.cvi.SDbw'>, <class 'pycvi.cvi.Dunn'>, <class 'pycvi.cvi.XB'>, <class 'pycvi.cvi.XBStar'>, <class 'pycvi.cvi.DB'>, <class 'pycvi.cvi.Inertia'>, <class 'pycvi.cvi.Diameter'>]
List of available CVI indices in PyCVI, as pycvi.cvi classes.
Hartigan:
pycvi.cvi.HartiganCalinskiHarabasz:
pycvi.cvi.CalinskiHarabaszGapStatistic:
pycvi.cvi.GapStatisticSilhouette:
pycvi.cvi.SilhouetteScoreFunction:
pycvi.cvi.ScoreFunctionMaulikBandyopadhyay:
pycvi.cvi.MaulikBandyopadhyaySD:
pycvi.cvi.SDSDbw:
pycvi.cvi.SDbwDunn:
pycvi.cvi.DunnXB:
pycvi.cvi.XBXBStar:
pycvi.cvi.XBStarDB:
pycvi.cvi.DBInertia:
pycvi.cvi.InertiaDiameter:
pycvi.cvi.Diameter