Using the Variation of Information

In this example, we compute the variation of information between the true clustering and the clustering predicted when assuming the correct number of clusters. We see that some clusterings methods are not adapted to some dataset and this is illustrated by a high variation of information (VI) between the predicted and the true clustering.

If you wish to run the example scripts on your own computer, please first follow the instructions detailed in Running example scripts on your computer.

 1
 2from sklearn.cluster import AgglomerativeClustering, KMeans
 3from sklearn.preprocessing import StandardScaler
 4from pycvi.datasets.benchmark import load_data
 5from pycvi.cluster import get_clustering
 6from pycvi.vi import variation_information
 7
 8from pycvi_examples_utils import plot_true_selected
 9
10# Load data
11datasets = ["xclara", "zelnik1"]
12for dataset in datasets:
13    print(f" ============= {dataset} =============")
14    data, labels = load_data(dataset, "barton")
15
16    # Data pre-processing
17    scaler = StandardScaler()
18    X = scaler.fit_transform(data)
19
20    # --- Generate clusters assuming the correct number of clusters ----
21
22    # From predicted cluster-label for each datapoint to a list of
23    # datapoints for each cluster.
24    clustering_true = get_clustering(labels)
25    k_true = len(clustering_true)
26
27    # Generate the clusters assuming the right number of clusters
28    # Clustering model to use, could be any sklearn-like clustering class
29    model = KMeans(n_clusters=k_true)
30    labels_pred = model.fit_predict(X)
31    clustering_pred = get_clustering(labels_pred)
32
33    # ------ variation of information between true and predicted -------
34
35    # Compute the variation of information between the true clustering and
36    # the clustering obtained with the method on the dataset.
37    vi = variation_information(clustering_true, clustering_pred)
38    print(f"Variation of information: {vi}")
39
40    # ---------------------- Summmary fig ------------------------------
41    ax_titles = [
42        f"True clustering, k={k_true}",
43        f"Clustering assuming k={k_true} | VI={vi:.4f}",
44    ]
45    fig = plot_true_selected(data, clustering_true, clustering_pred, ax_titles)
46    fig_title = f"{dataset} - KMeans clustering"
47    fig_name = f"variation_information_KMeans_{dataset}.png"
48    fig.suptitle(fig_title)
49    fig.savefig(fig_name)
50
51
../_images/variation_information_KMeans_xclara.png ../_images/variation_information_KMeans_zelnik1.png
 ============= xclara =============
Variation of information: 0.030363784741793243
 ============= zelnik1 =============
Variation of information: 2.5176943544098846