Cluster Free Text Comments

[1]:
# 02_cluster_comments.ipynb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas_survey_toolkit import nlp
from pandas_survey_toolkit.vis import cluster_heatmap_plot

# Create sample survey data with open-ended comments about a product
data = {
    'respondent_id': range(1, 21),
    'comments': [
        "Battery life is excellent, lasts all day",
        "The battery doesn't last long enough for me",
        "Battery performance is outstanding, very impressed",
        "Screen resolution is incredible, so sharp and clear",
        "Love the high-resolution display, colors are vibrant",
        "The screen is too reflective in bright light",
        "Camera quality is excellent for the price range",
        "Photos taken in low light are grainy and poor quality",
        "Camera autofocus is slow and often misses the shot",
        "The software is intuitive and easy to use",
        "User interface is confusing and not user-friendly",
        "Software keeps crashing when I open multiple apps",
        "Build quality feels premium and solid",
        "The device feels flimsy and cheaply made",
        "Very durable, survived several drops without damage",
        "Excellent value for money considering the features",
        "Overpriced for what you get compared to competitors",
        "Worth every penny, exceeded my expectations",
        "Customer service was unhelpful when I had issues",
        "Great customer support, quick and helpful responses"
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the original data
print("Original data:")
display(df)

y:\Python Scripts\pandas-survey-toolkit\.venv\Lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py:11: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm, trange
y:\Python Scripts\pandas-survey-toolkit\.venv\Lib\site-packages\transformers\utils\generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
Original data:
respondent_id comments
0 1 Battery life is excellent, lasts all day
1 2 The battery doesn't last long enough for me
2 3 Battery performance is outstanding, very impre...
3 4 Screen resolution is incredible, so sharp and ...
4 5 Love the high-resolution display, colors are v...
5 6 The screen is too reflective in bright light
6 7 Camera quality is excellent for the price range
7 8 Photos taken in low light are grainy and poor ...
8 9 Camera autofocus is slow and often misses the ...
9 10 The software is intuitive and easy to use
10 11 User interface is confusing and not user-friendly
11 12 Software keeps crashing when I open multiple apps
12 13 Build quality feels premium and solid
13 14 The device feels flimsy and cheaply made
14 15 Very durable, survived several drops without d...
15 16 Excellent value for money considering the feat...
16 17 Overpriced for what you get compared to compet...
17 18 Worth every penny, exceeded my expectations
18 19 Customer service was unhelpful when I had issues
19 20 Great customer support, quick and helpful resp...
[2]:

# Cluster the comments df_clustered = df.cluster_comments(input_column='comments', min_cluster_size=3, n_neighbors=5, cluster_selection_epsilon=0.5) # Examine the clusters print("\nComment clusters:") display(df_clustered[['comments', 'cluster', 'cluster_probability']].sort_values('cluster')) # Count comments per cluster cluster_counts = df_clustered['cluster'].value_counts().reset_index() cluster_counts.columns = ['cluster', 'count'] print("\nComments per cluster:") display(cluster_counts)
y:\Python Scripts\pandas-survey-toolkit\.venv\Lib\site-packages\transformers\utils\generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(

Comment clusters:
comments cluster cluster_probability
19 Great customer support, quick and helpful resp... -1.0 0.000000
18 Customer service was unhelpful when I had issues 0.0 0.667263
0 Battery life is excellent, lasts all day 0.0 1.000000
2 Battery performance is outstanding, very impre... 0.0 1.000000
1 The battery doesn't last long enough for me 0.0 1.000000
11 Software keeps crashing when I open multiple apps 1.0 1.000000
10 User interface is confusing and not user-friendly 1.0 1.000000
9 The software is intuitive and easy to use 1.0 1.000000
12 Build quality feels premium and solid 2.0 1.000000
13 The device feels flimsy and cheaply made 2.0 0.830560
14 Very durable, survived several drops without d... 2.0 0.906771
15 Excellent value for money considering the feat... 2.0 1.000000
16 Overpriced for what you get compared to compet... 2.0 1.000000
17 Worth every penny, exceeded my expectations 2.0 1.000000
8 Camera autofocus is slow and often misses the ... 3.0 0.700035
6 Camera quality is excellent for the price range 3.0 0.700035
5 The screen is too reflective in bright light 3.0 1.000000
4 Love the high-resolution display, colors are v... 3.0 1.000000
7 Photos taken in low light are grainy and poor ... 3.0 0.847425
3 Screen resolution is incredible, so sharp and ... 3.0 1.000000

Comments per cluster:
cluster count
0 3.0 6
1 2.0 6
2 0.0 4
3 1.0 3
4 -1.0 1
[3]:
df_clustered.head()
[3]:
respondent_id comments sentence_embedding umap_x umap_y cluster cluster_probability
0 1 Battery life is excellent, lasts all day [-0.038631026, 0.044625234, -0.028667396, -0.0... 11.356366 4.052678 0.0 1.0
1 2 The battery doesn't last long enough for me [-0.0007719228, -0.0042446144, 0.011075384, -0... 11.582358 3.586939 0.0 1.0
2 3 Battery performance is outstanding, very impre... [-0.008022247, 0.09049879, -0.0867905, -0.0022... 11.752824 4.191682 0.0 1.0
3 4 Screen resolution is incredible, so sharp and ... [-0.014808243, -0.03135826, 0.035538964, -0.05... 13.497684 2.565261 3.0 1.0
4 5 Love the high-resolution display, colors are v... [-0.029058423, 0.026945723, 0.040125024, -0.05... 13.295995 2.067326 3.0 1.0

You can see on the datamapplot that similar comments are closer together. By varying the cluster_epsilon you can tweak the number of clusters (clustering works better on much larger datasets)

[4]:
import datamapplot

datamapplot.create_interactive_plot(df_clustered[['umap_x', 'umap_y']].values, df_clustered['cluster'].astype(str).values, hover_text=df_clustered['comments'])
[4]:
[ ]: