Analytics

fit_cluster_hdbscan(df, input_columns=['umap_x', 'umap_y'], output_columns=['cluster', 'cluster_probability'], min_cluster_size=5, min_samples=None, cluster_selection_epsilon=0.0, metric='euclidean', cluster_selection_method='eom', allow_single_cluster=False)[source]

Apply HDBSCAN clustering to the specified columns of the DataFrame.

Parameters:

df (pandas.DataFrame) – The input DataFrame.
input_columns (list, optional) – List of column names to use for clustering, by default [‘umap_x’, ‘umap_y’]
output_columns (list, optional) – Names for the output columns, by default [“cluster”, “cluster_probability”]
min_cluster_size (int, optional) – The minimum size of clusters, by default 5
min_samples (int, optional) – The number of samples in a neighborhood for a point to be considered a core point, by default None
cluster_selection_epsilon (float, optional) – A distance threshold. Clusters below this value will be merged. Higher epsilon means fewer, larger clusters, by default 0.0
metric (str, optional) – The metric to use for distance computation, by default ‘euclidean’
cluster_selection_method (str, optional) – The method to select clusters. Either ‘eom’ or ‘leaf’, by default ‘eom’
allow_single_cluster (bool, optional) – Whether to allow a single cluster, by default False

Returns:

The input DataFrame with additional columns containing cluster labels and probabilities.

Return type:

pandas.DataFrame

fit_umap(df, input_columns, output_columns=['umap_x', 'umap_y'], target_y=None, embeddings_in_list=False, **kwargs)[source]

Apply UMAP to the columns in the dataframe.

This function applies UMAP dimensionality reduction to the specified columns and appends the x and y coordinates to the dataframe as new columns.

Parameters:

df (pandas.DataFrame) – The input dataframe to transform.
input_columns (Union[List[str], str]) – Column name(s) containing the data to reduce.
output_columns (list, optional) – Names for the output coordinate columns, by default [“umap_x”, “umap_y”]
target_y (str, optional) – Name of a column to use as the target variable for supervised UMAP, by default None
embeddings_in_list (bool, optional) – Set to True if embeddings are a list of values in a single column, False if each column is a separate dimension, by default False
**kwargs – Additional arguments to pass to UMAP. Most important is n_neighbors (default is 15).

Returns:

The input dataframe with added UMAP coordinate columns.

Return type:

pandas.DataFrame

Raises:

KeyError – If the specified target_y is not a column in the dataframe.
ValueError – If embeddings_in_list is True but multiple input columns are provided.