Computing distances: cdist and pdist
In this example, we use the PyCVI counterparts of pdist and cdist: in scipy, namely pycvi.dist.f_pdist() and pycvi.dist.f_cdist() in order to compute distance matrices with time-series data, the same way these functions are used with non-time series data. Their behavior is the same as scipy’s functions but for time-series data, DTW is used a a distance function (aeon’s implementation is used).
Note that in the case of pycvi.dist.f_pdist(), a condensed distance matrix is returned (as in scipy).
If you wish to run the example scripts on your own computer, please first follow the instructions detailed in Running example scripts on your computer.
import numpy as np
from pycvi.datasets.benchmark import load_data
from pycvi.dist import f_cdist, f_pdist
from pycvi.cluster import get_clustering
# ===================== Non time series-data ===========================
dataset_name = "xclara"
data, labels = load_data(dataset_name, "barton")
# From predicted cluster-label for each datapoint to a list of
# datapoints for each cluster.
clustering = get_clustering(labels)
n_clusters = len(clustering)
print(f"\n{n_clusters} clusters in dataset {dataset_name} (Non time-series)")
# -------------------------- pdist -------------------------------------
# Distance matrices between the datapoints of each cluster
distances = [ f_pdist(data[cluster]) for cluster in clustering ]
for i, d in enumerate(distances):
print(f"\nDistance matrix between datapoints in cluster {i+1}, shape {d.shape}")
print(f" Mean distance between datapoints in cluster {i+1}: {np.mean(d):.4f}")
# -------------------------- cdist -------------------------------------
# Distance matrices between the first cluster and the others
clusterA = clustering[0]
distances = [
f_cdist(data[clusterA], data[cluster]) for cluster in clustering[1:]
]
for i, d in enumerate(distances):
print(f"\nDistance matrix between cluster 1 and {i+2}, shape {d.shape}")
print(f" Mean distance between cluster 1 and {i+2}: {np.mean(d):.4f}")
# ======================= Time series-data =============================
dataset_name = "Trace"
data, labels = load_data(dataset_name, "ucr")
clustering = get_clustering(labels)
n_clusters = len(clustering)
print(f"\n{n_clusters} clusters in dataset {dataset_name} (Time-series)")
# -------------------------- pdist -------------------------------------
# Distance matrices between the datapoints of each cluster
distances = [ f_pdist(data[cluster]) for cluster in clustering ]
for i, d in enumerate(distances):
print(f"\nDistance matrix between datapoints in cluster {i+1}, shape {d.shape}")
print(f" Mean distance between datapoints in cluster {i+1}: {np.mean(d):.4f}")
# -------------------------- cdist -------------------------------------
# Distance matrices between the first cluster and the others
clusterA = clustering[0]
distances = [
f_cdist(data[clusterA], data[cluster]) for cluster in clustering[1:]
]
for i, d in enumerate(distances):
print(f"\nDistance matrix between cluster 1 and {i+2}, shape {d.shape}")
print(f" Mean distance between cluster 1 and {i+2}: {np.mean(d):.4f}")
3 clusters in dataset xclara (Non time-series)
Distance matrix between datapoints in cluster 1, shape (452676,)
Mean distance between datapoints in cluster 1: 18.4354
Distance matrix between datapoints in cluster 2, shape (397386,)
Mean distance between datapoints in cluster 2: 17.5734
Distance matrix between datapoints in cluster 3, shape (667590,)
Mean distance between datapoints in cluster 3: 17.8209
Distance matrix between cluster 1 and 2, shape (952, 892)
Mean distance between cluster 1 and 2: 65.6305
Distance matrix between cluster 1 and 3, shape (952, 1156)
Mean distance between cluster 1 and 3: 76.9694
4 clusters in dataset Trace (Time-series)
Distance matrix between datapoints in cluster 1, shape (300,)
Mean distance between datapoints in cluster 1: 8.9790
Distance matrix between datapoints in cluster 2, shape (210,)
Mean distance between datapoints in cluster 2: 0.6607
Distance matrix between datapoints in cluster 3, shape (231,)
Mean distance between datapoints in cluster 3: 5.4999
Distance matrix between datapoints in cluster 4, shape (465,)
Mean distance between datapoints in cluster 4: 8.5036
Distance matrix between cluster 1 and 2, shape (25, 21)
Mean distance between cluster 1 and 2: 35.4798
Distance matrix between cluster 1 and 3, shape (25, 22)
Mean distance between cluster 1 and 3: 507.1584
Distance matrix between cluster 1 and 4, shape (25, 31)
Mean distance between cluster 1 and 4: 532.0727