Computing distances: cdist and pdist

In this example, we use the PyCVI counterparts of pdist and cdist from SciPy, namely pycvi.dist.f_pdist() and pycvi.dist.f_cdist(), to compute distance matrices with time-series data the same way these functions are used with non-time-series data. Their behavior is the same as SciPy’s functions, but for time-series data, distance functions specifically designed for time-series are actually used and implemented in aeon’.

Note that in the case of pycvi.dist.f_pdist(), a condensed distance matrix is returned (as in scipy).

If you wish to run the example scripts on your own computer, please first follow the instructions detailed in Running example scripts on your computer.

 1
 2import numpy as np
 3from pycvi.datasets.benchmark import load_data
 4from pycvi.dist import f_cdist, f_pdist
 5from pycvi.cluster import get_clustering
 6
 7# ===================== Non time series-data ===========================
 8
 9dataset_name = "xclara"
10data, labels = load_data(dataset_name, "barton")
11
12# From predicted cluster-label for each datapoint to a list of
13# datapoints for each cluster.
14clustering = get_clustering(labels)
15n_clusters = len(clustering)
16print(f"\n{n_clusters} clusters in dataset {dataset_name} (Non time-series)")
17
18# -------------------------- pdist -------------------------------------
19# Distance matrices between the datapoints of each cluster
20distances = [ f_pdist(data[cluster]) for cluster in clustering ]
21for i, d in enumerate(distances):
22    print(f"\nDistance matrix between datapoints in cluster {i+1}, shape {d.shape}")
23    print(f"  Mean distance between datapoints in cluster {i+1}: {np.mean(d):.4f}")
24
25# -------------------------- cdist -------------------------------------
26# Distance matrices between the first cluster and the others
27clusterA = clustering[0]
28distances = [
29    f_cdist(data[clusterA], data[cluster]) for cluster in clustering[1:]
30]
31for i, d in enumerate(distances):
32    print(f"\nDistance matrix between cluster 1 and {i+2}, shape {d.shape}")
33    print(f"  Mean distance between cluster 1 and {i+2}: {np.mean(d):.4f}")
34
35
36# ======================= Time series-data =============================
37
38dataset_name = "Trace"
39data, labels = load_data(dataset_name, "ucr")
40
41clustering = get_clustering(labels)
42n_clusters = len(clustering)
43print(f"\n{n_clusters} clusters in dataset {dataset_name} (Time-series)")
44
45# -------------------------- pdist -------------------------------------
46# Distance matrices between the datapoints of each cluster
47distances = [ f_pdist(data[cluster]) for cluster in clustering ]
48for i, d in enumerate(distances):
49    print(f"\nDistance matrix between datapoints in cluster {i+1}, shape {d.shape}")
50    print(f"  Mean distance between datapoints in cluster {i+1}: {np.mean(d):.4f}")
51
52# -------------------------- cdist -------------------------------------
53# Distance matrices between the first cluster and the others
54clusterA = clustering[0]
55distances = [
56    f_cdist(data[clusterA], data[cluster]) for cluster in clustering[1:]
57]
58for i, d in enumerate(distances):
59    print(f"\nDistance matrix between cluster 1 and {i+2}, shape {d.shape}")
60    print(f"  Mean distance between cluster 1 and {i+2}: {np.mean(d):.4f}")
61
3 clusters in dataset xclara (Non time-series)

Distance matrix between datapoints in cluster 1, shape (452676,)
  Mean distance between datapoints in cluster 1: 18.4354

Distance matrix between datapoints in cluster 2, shape (397386,)
  Mean distance between datapoints in cluster 2: 17.5734

Distance matrix between datapoints in cluster 3, shape (667590,)
  Mean distance between datapoints in cluster 3: 17.8209

Distance matrix between cluster 1 and 2, shape (952, 892)
  Mean distance between cluster 1 and 2: 65.6305

Distance matrix between cluster 1 and 3, shape (952, 1156)
  Mean distance between cluster 1 and 3: 76.9694

4 clusters in dataset Trace (Time-series)

Distance matrix between datapoints in cluster 1, shape (300,)
  Mean distance between datapoints in cluster 1: 64.6368

Distance matrix between datapoints in cluster 2, shape (210,)
  Mean distance between datapoints in cluster 2: 55.6207

Distance matrix between datapoints in cluster 3, shape (231,)
  Mean distance between datapoints in cluster 3: 65.8920

Distance matrix between datapoints in cluster 4, shape (465,)
  Mean distance between datapoints in cluster 4: 71.8390

Distance matrix between cluster 1 and 2, shape (25, 21)
  Mean distance between cluster 1 and 2: 81.2241

Distance matrix between cluster 1 and 3, shape (25, 22)
  Mean distance between cluster 1 and 3: 206.1871

Distance matrix between cluster 1 and 4, shape (25, 31)
  Mean distance between cluster 1 and 4: 203.9528