Welcome to PyCVI’s documentation!
PyCVI is a Python package specialized in internal Clustering Validity Indices (CVI). Internal CVIs are used to select the best clustering among a set of pre-computed clusterings when no external information is available such as the labels of the datapoints.
Although being fundamental to clustering tasks and being an active research topic, very few internal CVIs are implemented in standard Python libraries (only 3 in scikit-learn, more were available in R but few were maintained and kept in CRAN). This is despite the well-known limitations of all existing CVIs and the need to use the right one(s) according to the specific dataset at hand.
In addition, all CVIs rely on the definition of a distance between datapoints and most of them on the notion of cluster center.
For non-time-series data, the distance used is usually the Euclidean distance and the cluster center is defined as the usual average. Libraries such as scipy, numpy, scikit-learn, etc. offer a large selection of distance measures that are compatible with all their functions.
For time-series data however, common distances used are Dynamic Time Warping (DTW) or Move-Split-Merge (MSM) and the barycenter of a group of time series is then not defined as the usual mean, but as the DTW Barycentric Average (DBA) or MBA (MSM DTW barycentric average). Unfortunately, DTW, MSM, DBA and MBA are not compatible with the libraries mentioned above, which among other reasons, made additional machine learning libraries specialized in time series data such as aeon, sktime and tslearn necessary.
PyCVI then tries to fill that gap by implementing 12 state-of-the-art internal CVIs and by making them compatible with DTW and DBA (and obviously non-time-series data). PyCVI is entirely compatible with scikit-learn, scikit-learn-extra, aeon and sktime, in order to be easily integrated into any clustering pipeline in Python.
To compute DTW, MSM, DBA, MBA, etc. PyCVI relies on the aeon library.
Main features
12 internal CVIs implemented: Hartigan, Calinski-Harabasz, GapStatistic, Silhouette, ScoreFunction, Maulik-Bandyopadhyay, SD, SDbw, Dunn, Xie-Beni, XB* and Davies-Bouldin.
Compute CVI values and select the best clustering based on the results.
Compatible with time-series and their distance and average functions such as Dynamic Time Warping (DTW), Move-Split-Merge (MSM), Dynamic Time Warping Barycentric Average (DBA), MBA (MSM DTW barycentric average), etc.
Compatible with scikit-learn, scikit-learn-extra, aeon and sktime, for easy integration into any clustering pipeline in Python.
Can compute the clusterings beforehand if provided with a sklearn-like clustering class.
Enables users to define custom CVIs.
Multiple CVIs can easily be combined to select the best clustering based on a majority vote.
Variation of Information implemented (distance between clusterings).
Facilitates the use of time-series distances directly in some of the models implemented in scikit-learn such as AgglomerativeClustering.
Install
With uv:
# From PyPI
uv add pycvi-lib
# Alternatively, from github directly
uv add "pycvi-lib @ git+https://github.com/nglm/pycvi.git"
With poetry
# From PyPI
poetry add pycvi-lib
# Alternatively, from github directly
poetry add git+https://github.com/nglm/pycvi.git
With pip
# From PyPI
pip install pycvi-lib
# Alternatively, from github directly
pip install git+https://github.com/nglm/pycvi.git
With anaconda
# activate your environment (replace myEnv with your environment name)
conda activate myEnv
# install pip first in your environment
conda install pip
# install pycvi on your anaconda environment with pip
pip install pycvi-lib
Extra dependencies
In order to run the example scripts, extra dependencies are necessary. The install command is then:
# For uv
uv add pycvi-lib[examples]
# For poetry
poetry add pycvi-lib[examples]
# For pip and anaconda
pip install pycvi-lib[examples]
Alternatively, you can manually install in your environment the packages that are necessary to run the example scripts (matplotlib).
Important note: As of now (June 2026), the latest version of scikit-learn-extra (0.3.0) is not compatible with numpy>= 2.0.0. Users who wish to combine scikit-learn-extra with PyCVI must ensure themselves that they are using a compatible version of numpy.
If you wish to run the example scripts on your own computer, please follow the instructions detailed in the documentation first: Running example scripts on your computer.
Examples
- CVI - Basic usage
- CVI - Basic usage with time-series
- Computing cluster centers
- Using the Variation of Information
- Computing distances: cdist and pdist
- Selecting the number of clusters k
- Full PyCVI pipeline
- CVIAggregator: Combining CVIs
- Functional and Object-oriented APIs
- Time-Series metric with Sklearn
If you wish to run the example scripts on your own computer, please first follow the instructions detailed in Running example scripts on your computer.
Main Modules
All implemented CVIs, as well as selection methods are available in the pycvi.cvi module:
Python implementation of state-of-the-art internal CVIs. |
High level functions are defined to compute clusterings, compare clusterings and evaluate clusterings:
Generate all clusterings for the given data and clustering model. |
|
Variation of information between two clusterings. |
|
Computes all CVI values for the given clusterings. |
More low-level functions are defined to perform common operations but that can handle the case of time-series distances such as DTW and MSM, as well as time-series average methods such as DBA and MBA:
Pairwise distances within a group of elements. |
|
Distances between two (groups of) elements. |
|
Allow to use time-series metrics with (some) sklearn models. |
|
Inertia of a group of elements. |
|
Compute the center of a cluster. |
|
Compute the centers of all clusters. |
Finally, you can browser the full API here:
Full API
Internal Cluster Validity Indices (CVIs), compatible with time-series |
Contribute
Issue Tracker: github.com/nglm/pycvi/issues.
Source Code: github.com/nglm/pycvi.
Support
If you are having issues, please let me know or create an issue.
How to cite PyCVI
If you are using PyCVI in your work, please cite us by using one of the following entries referring to the JOSS paper “PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data” by N. Galmiche:
BibTeX
@article{Galmiche2024,
author = {Natacha Galmiche},
title = {PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data},
doi = {10.21105/joss.06841},
url = {https://doi.org/10.21105/joss.06841},
year = {2024},
publisher = {The Open Journal},
volume = {9},
number = {102},
pages = {6841},
journal = {Journal of Open Source Software}
}
Plain text
Galmiche, N., (2024). PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data. Journal of Open Source Software, 9(102), 6841, https://doi.org/10.21105/joss.06841