NLP

cluster_comments(df, input_column, output_columns=['cluster', 'cluster_probability'], min_cluster_size=5, cluster_selection_epsilon=0.2, n_neighbors=15)[source]

Apply a pipeline for clustering text comments.

Applies a pipeline of: 1) Vector embeddings 2) Dimensional reduction 3) Clustering

This assigns each row a cluster ID so that similar free text comments (found in the input_column) can be grouped together.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to cluster.

  • output_columns (list, optional) – Names for the output columns. Default is [“cluster”, “cluster_probability”].

  • min_cluster_size (int, optional) – The minimum size of clusters for HDBSCAN. Default is 5.

  • cluster_selection_epsilon (float, optional) – Distance threshold for HDBSCAN. Higher epsilon means fewer, larger clusters. Default is 0.2.

  • n_neighbors (int, optional) – The size of local neighborhood for UMAP. Default is 15.

Returns:

The input DataFrame with additional columns for cluster IDs and probabilities.

Return type:

pandas.DataFrame

cluster_questions(df, columns=None, pattern=None, likert_mapping=None, umap_n_neighbors=15, umap_min_dist=0.1, hdbscan_min_cluster_size=20, hdbscan_min_samples=None, cluster_selection_epsilon=0.4)[source]

Cluster Likert scale questions based on response patterns.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • columns (list, optional) – List of column names to cluster. If None, all columns matching the pattern will be used.

  • pattern (str, optional) – Regex pattern to match column names. Used if columns is None.

  • likert_mapping (dict, optional) – Custom mapping for Likert scale responses. If None, default mapping is used.

  • umap_n_neighbors (int, optional) – The size of local neighborhood for UMAP. Default is 15.

  • umap_min_dist (float, optional) – The minimum distance between points in UMAP. Default is 0.1.

  • hdbscan_min_cluster_size (int, optional) – The minimum size of clusters for HDBSCAN. Default is 20.

  • hdbscan_min_samples (int, optional) – The number of samples in a neighborhood for a core point in HDBSCAN. Default is None.

  • cluster_selection_epsilon (float, optional) – A distance threshold. Clusters below this value will be merged. Default is 0.4. Higher epsilon means fewer, larger clusters.

Returns:

The input DataFrame with additional columns for encoded Likert responses, UMAP coordinates, and cluster IDs.

Return type:

pandas.DataFrame

Raises:

ValueError – If neither ‘columns’ nor ‘pattern’ is provided.

encode_likert(df, likert_columns, output_prefix='likert_encoded_', custom_mapping=None, debug=True)[source]

Encode Likert scale responses to numeric values.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • likert_columns (list) – List of column names containing Likert scale responses.

  • output_prefix (str, optional) – Prefix for the new encoded columns. Default is ‘likert_encoded_’.

  • custom_mapping (dict, optional) – Optional custom mapping for Likert scale responses.

  • debug (bool, optional) – If True, prints out the mappings. Default is True.

Returns:

The input DataFrame with additional columns for encoded Likert responses.

Return type:

pandas.DataFrame

Notes

Default mapping: - -1: Phrases containing ‘disagree’, ‘do not agree’, etc. - 0: Phrases containing ‘neutral’, ‘neither’, ‘unsure’, etc. - +1: Phrases containing ‘agree’ (but not ‘disagree’ or ‘not agree’) - NaN: NaN values are preserved

extract_keywords(df, input_column, output_column='keywords', preprocessed_column='preprocessed_text', spacy_column='spacy_output', lemma_column='lemmatized_text', top_n=3, threshold=0.4, ngram_range=(1, 1), min_df=5, min_count=None, min_proportion_with_keywords=0.95, **kwargs)[source]

Apply a pipeline of text preprocessing, spaCy processing, lemmatization, and TF-IDF to extract keywords from the specified column.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to process.

  • output_column (str, optional) – Name of the column to store the extracted keywords. Default is ‘keywords’.

  • preprocessed_column (str, optional) – Name of the column to store preprocessed text. Default is ‘preprocessed_text’.

  • spacy_column (str, optional) – Name of the column to store spaCy output. Default is ‘spacy_output’.

  • lemma_column (str, optional) – Name of the column to store lemmatized text. Default is ‘lemmatized_text’.

  • top_n (int, optional) – Number of top keywords to extract for each document. Default is 3.

  • threshold (float, optional) – Minimum TF-IDF score for a keyword to be included. Default is 0.4.

  • ngram_range (tuple, optional) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. Default is (1, 1) which means only unigrams.

  • min_df (int, optional) – Minimum document frequency for TF-IDF. Default is 5.

  • min_count (int, optional) – Minimum count for a keyword to be considered common in refinement. Default is None.

  • min_proportion_with_keywords (float, optional) – Minimum proportion of rows that should have keywords after refinement. Default is 0.95.

  • **kwargs – Additional keyword arguments to pass to the preprocessing, spaCy, lemmatization, or TF-IDF functions.

Returns:

The input DataFrame with additional columns for preprocessed text, spaCy output, lemmatized text, and extracted keywords.

Return type:

pandas.DataFrame

extract_sentiment(df, input_column, output_columns=['positive', 'neutral', 'negative', 'sentiment'])[source]

Extract sentiment from text using the cardiffnlp/twitter-roberta-base-sentiment model.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to analyze.

  • output_columns (list, optional) – List of column names for the output. Default is [“positive”, “neutral”, “negative”, “sentiment”].

Returns:

The input DataFrame with additional columns for sentiment scores and labels.

Return type:

pandas.DataFrame

fit_sentence_transformer(df, input_column, model_name='all-MiniLM-L6-v2', output_column='sentence_embedding')[source]

Add vector embeddings for each string in the input column.

Creates sentence embeddings that can be used for downstream tasks like clustering.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to embed.

  • model_name (str, optional) – Name of the sentence transformer model to use. Default is ‘all-MiniLM-L6-v2’.

  • output_column (str, optional) – Name of the column to store embeddings. Default is ‘sentence_embedding’.

Returns:

The input DataFrame with an additional column containing sentence embeddings.

Return type:

pandas.DataFrame

fit_spacy(df, input_column, output_column='spacy_output')[source]

Apply the en_core_web_md spaCy model to the specified column.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to analyze.

  • output_column (str, optional) – Name of the output column. Default is “spacy_output”.

Returns:

The input DataFrame with an additional column containing spaCy doc objects.

Return type:

pandas.DataFrame

Notes

If the spaCy model is not already downloaded, this function will attempt to download it automatically.

fit_tfidf(df, input_column, output_column='keywords', top_n=3, threshold=0.6, append_features=False, ngram_range=(1, 1), **tfidf_kwargs)[source]

Apply TF-IDF vectorization to extract top keywords from text.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to vectorize.

  • output_column (str, optional) – Name of the column to store the extracted keywords. Default is ‘keywords’.

  • top_n (int, optional) – Number of top keywords to extract for each document. Default is 3.

  • threshold (float, optional) – Minimum TF-IDF score for a keyword to be included. Default is 0.6.

  • append_features (bool, optional) – If True, append all TF-IDF features to the DataFrame (useful for downstream machine learning tasks). Default is False.

  • ngram_range (tuple, optional) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. Default is (1, 1) which means only unigrams. Set to (1, 2) for unigrams and bigrams, and so on.

  • **tfidf_kwargs – Additional keyword arguments to pass to TfidfVectorizer.

Returns:

The input DataFrame with an additional column containing the top keywords.

Return type:

pandas.DataFrame

get_lemma(df, input_column='spacy_output', output_column='lemmatized_text', text_pos=['PRON'], remove_punct=True, remove_space=True, remove_stop=True, keep_tokens=None, keep_pos=None, keep_dep=['neg'], join_tokens=True)[source]

Extract lemmatized text from spaCy doc objects.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str, optional) – Name of the column containing spaCy doc objects. Default is ‘spacy_output’.

  • output_column (str, optional) – Name of the output column for lemmatized text. Default is ‘lemmatized_text’.

  • text_pos (List[str], optional) – List of POS tags to exclude from lemmatization and return the text. Default is [‘PRON’].

  • remove_punct (bool, optional) – Whether to remove punctuation. Default is True.

  • remove_space (bool, optional) – Whether to remove whitespace tokens. Default is True.

  • remove_stop (bool, optional) – Whether to remove stop words. Default is True.

  • keep_tokens (List[str], optional) – List of token texts to always keep. Default is None.

  • keep_pos (List[str], optional) – List of POS tags to always keep. Default is None.

  • keep_dep (List[str], optional) – List of dependency labels to always keep. Default is [“neg”].

  • join_tokens (bool, optional) – Whether to join tokens into a string. If False, returns a list of tokens. Default is True.

Returns:

The input DataFrame with an additional column containing lemmatized text or token list.

Return type:

pandas.DataFrame

preprocess_text(df, input_column, output_column=None, remove_html=True, lower_case=False, normalize_whitespace=True, remove_numbers=False, remove_stopwords=False, flag_short_comments=False, min_comment_length=5, max_comment_length=None, remove_punctuation=True, keep_sentence_punctuation=True, comment_length_column=None)[source]

Preprocess text data in the specified column, tailored for survey responses.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to preprocess.

  • output_column (str, optional) – Name of the output column. If None, overwrites the input column.

  • remove_html (bool, optional) – Whether to remove unexpected HTML tags. Default is True.

  • lower_case (bool, optional) – Whether to lowercase all words. Default is False.

  • normalize_whitespace (bool, optional) – Whether to normalize whitespace. Default is True.

  • remove_numbers (bool, optional) – Whether to remove numbers. Default is False.

  • remove_stopwords (bool, optional) – Whether to remove stop words. Default is False.

  • flag_short_comments (bool, optional) – Whether to flag very short comments. Default is False.

  • min_comment_length (int, optional) – Minimum length of comment to not be flagged as short. Default is 5.

  • max_comment_length (int, optional) – Maximum length of comment to keep. If None, keeps full length. Default is None.

  • remove_punctuation (bool, optional) – Whether to remove punctuation. Default is True.

  • keep_sentence_punctuation (bool, optional) – Whether to keep sentence-level punctuation. Default is True.

  • comment_length_column (str, optional) – Name of the column to store comment lengths. If None, no column is added. Default is None.

Returns:

The input DataFrame with preprocessed text and optionally new columns for short comments, truncation info, and comment length.

Return type:

pandas.DataFrame

refine_keywords(df, keyword_column='keywords', text_column='lemmatized_text', min_count=None, min_proportion=0.95, output_column=None, debug=True)[source]

Refine keywords by replacing rare keywords with more common ones based on the text content.

Parameters:
  • df (pd.DataFrame) – The input DataFrame.

  • keyword_column (str, optional) – Name of the column containing keyword lists. Default is ‘keywords’.

  • text_column (str, optional) – Name of the column containing the original text. Default is ‘lemmatized_text’.

  • min_count (int, optional) – Minimum count for a keyword to be considered common. If None, it will be determined automatically. Default is None.

  • min_proportion (float, optional) – Minimum proportion of rows that should have keywords after refinement. Used only if min_count is None. Default is 0.95.

  • output_column (str, optional) – Column name for the refined keyword output. If None, the keyword_column is overwritten. Default is None.

  • debug (bool, optional) – If True, print detailed statistics about the refinement process. Default is True.

Returns:

The input DataFrame with refined keywords.

Return type:

pd.DataFrame

remove_short_comments(df, input_column, min_comment_length=5)[source]

Replace comments shorter than the specified minimum length with NaN.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame.

  • input_column (str) – Name of the column containing text to process.

  • min_comment_length (int, optional) – Minimum length of comment to keep. Default is 5.

Returns:

The input DataFrame with short comments replaced by NaN.

Return type:

pandas.DataFrame