pycvi.vi
Variation of Information, to evaluate clusterings and internal CVIs.
Variation of Information (VI) [2] can be used to evaluate clusterings and internal CVIs when the true clustering is known.
the most naive approach to evaluate a given CVI is to count how many time the CVI has selected a clustering with the right number of clusters on benchmarking datasets where the number of clusters is known. This approach has several flaws. Indeed, a CVI can only pick from a set of pre-computed clusterings, if all clusterings are of poor quality (for example when KMeans is used on concentric circles), then the fact that the CVI finds or doesn’t find the right number of clusters doesn’t give any information on the quality of the CVI.
One solution can then be to use the VI to weight this count of correctly selected number of clusters by a given CVI. The weight can be defined by computing the VI between the true clustering and the selected clustering or the true clustering and the clustering obtained with the given clustering method when the true number of clusters is given as a parameter, or a combination of both. Thus, a CVI won’t be penalised when it couldn’t find the right number of clusters when none of the clusterings was faithfully representing the clusters. Similarly CVIs won’t be rewarded for finding the right number of clusters do not represent well the original distribution of the data.
Main function
The main function of this module is
pycvi.vi.variation_information() that computes the variation of
information between two clusterings.
Functions
|
List of probability of the outcome being in cluster i. |
|
Contingency matrix between two clusterings. |
|
Entropy of the given clustering |
|
Mutual information between two clusterings. |
|
Variation of information between two clusterings. |
- pycvi.vi.P_clusters(clustering: List[List[int]]) List[float]
List of probability of the outcome being in cluster i.
- Parameters:
clustering (List[List[int]]) – A given clustering
- Returns:
List of probability of the outcome being in cluster i
- Return type:
List[float]
- pycvi.vi.entropy(clustering: List[List[int]]) float
Entropy of the given clustering
Conventions: see “Elements of Information Theory” by Cover and Thomas, section 2.3 [1].
\(\log\) is in base 2 and entropy is count in bits.
\(0 \times \log(0/0) = 0\).
\(0 \times \log(0/q) = 0\).
\(p \times \log(p/0) = +\infty\).
- Parameters:
clustering (List[List[int]]) – A given clustering
- Returns:
Entropy of the given clustering
- Return type:
float
- pycvi.vi.contingency_matrix(clustering1: List[List[int]], clustering2: List[List[int]]) numpy.ndarray
Contingency matrix between two clusterings.
- Parameters:
clustering1 (List[List[int]]) – First clustering
clustering2 (List[List[int]]) – Second clustering
- Returns:
Contingency matrix between the two clusterings.
- Return type:
np.ndarray
- pycvi.vi.mutual_information(clustering1: List[List[int]], clustering2: List[List[int]]) float
Mutual information between two clusterings.
Conventions: see “Elements of Information Theory” by Cover and Thomas, section 2.3 [1].
\(\log\) is in base 2 and entropy is count in bits.
\(0 \times \log(0/0) = 0\).
\(0 \times \log(0/q) = 0\).
\(p \times \log(p/0) = +\infty\).
- Parameters:
clustering1 (List[List[int]]) – First clustering
clustering2 (List[List[int]]) – Second clustering
- Returns:
Mutual information between two clusterings.
- Return type:
float
- pycvi.vi.variation_information(clustering1: List[List[int]], clustering2: List[List[int]]) float
Variation of information between two clusterings. [VI]
[VI]M. Meil ̆a, Comparing Clusterings by the Variation of Information, p. 173–187. Springer Berlin Heidelberg, 2003.
- Parameters:
clustering1 (List[List[int]]) – First clustering
clustering2 (List[List[int]]) – Second clustering
- Returns:
Variation of information between two clusterings.
- Return type:
float