pycvi.vi

Variation of Information, to evaluate clusterings and internal CVIs.

Variation of Information (VI) [2] can be used to evaluate clusterings and internal CVIs when the true clustering is known.

the most naive approach to evaluate a given CVI is to count how many time the CVI has selected a clustering with the right number of clusters on benchmarking datasets where the number of clusters is known. This approach has several flaws. Indeed, a CVI can only pick from a set of pre-computed clusterings, if all clusterings are of poor quality (for example when KMeans is used on concentric circles), then the fact that the CVI finds or doesn’t find the right number of clusters doesn’t give any information on the quality of the CVI.

One solution can then be to use the VI to weight this count of correctly selected number of clusters by a given CVI. The weight can be defined by computing the VI between the true clustering and the selected clustering or the true clustering and the clustering obtained with the given clustering method when the true number of clusters is given as a parameter, or a combination of both. Thus, a CVI won’t be penalised when it couldn’t find the right number of clusters when none of the clusterings was faithfully representing the clusters. Similarly CVIs won’t be rewarded for finding the right number of clusters do not represent well the original distribution of the data.

Main function

The main function of this module is pycvi.vi.variation_information() that computes the variation of information between two clusterings.

Functions

P_clusters(clustering)

List of probability of the outcome being in cluster i.

contingency_matrix(clustering1, clustering2)

Contingency matrix between two clusterings.

entropy(clustering)

Entropy of the given clustering

mutual_information(clustering1, clustering2)

Mutual information between two clusterings.

variation_information(clustering1, clustering2)

Variation of information between two clusterings.

pycvi.vi.P_clusters(clustering: List[List[int]]) List[float]

List of probability of the outcome being in cluster i.

Parameters:

clustering (List[List[int]]) – A given clustering

Returns:

List of probability of the outcome being in cluster i

Return type:

List[float]

pycvi.vi.entropy(clustering: List[List[int]]) float

Entropy of the given clustering

Conventions: see “Elements of Information Theory” by Cover and Thomas, section 2.3 [1].

  • \(\log\) is in base 2 and entropy is count in bits.

  • \(0 \times \log(0/0) = 0\).

  • \(0 \times \log(0/q) = 0\).

  • \(p \times \log(p/0) = +\infty\).

Parameters:

clustering (List[List[int]]) – A given clustering

Returns:

Entropy of the given clustering

Return type:

float

pycvi.vi.contingency_matrix(clustering1: List[List[int]], clustering2: List[List[int]]) numpy.ndarray

Contingency matrix between two clusterings.

Parameters:
  • clustering1 (List[List[int]]) – First clustering

  • clustering2 (List[List[int]]) – Second clustering

Returns:

Contingency matrix between the two clusterings.

Return type:

np.ndarray

pycvi.vi.mutual_information(clustering1: List[List[int]], clustering2: List[List[int]]) float

Mutual information between two clusterings.

Conventions: see “Elements of Information Theory” by Cover and Thomas, section 2.3 [1].

  • \(\log\) is in base 2 and entropy is count in bits.

  • \(0 \times \log(0/0) = 0\).

  • \(0 \times \log(0/q) = 0\).

  • \(p \times \log(p/0) = +\infty\).

Parameters:
  • clustering1 (List[List[int]]) – First clustering

  • clustering2 (List[List[int]]) – Second clustering

Returns:

Mutual information between two clusterings.

Return type:

float

pycvi.vi.variation_information(clustering1: List[List[int]], clustering2: List[List[int]]) float

Variation of information between two clusterings. [VI]

[VI]

M. Meil ̆a, Comparing Clusterings by the Variation of Information, p. 173–187. Springer Berlin Heidelberg, 2003.

Parameters:
  • clustering1 (List[List[int]]) – First clustering

  • clustering2 (List[List[int]]) – Second clustering

Returns:

Variation of information between two clusterings.

Return type:

float