NLP
- cluster_comments(df, input_column, output_columns=['cluster', 'cluster_probability'], min_cluster_size=5, cluster_selection_epsilon=0.2, n_neighbors=15)[source]
Apply a pipeline for clustering text comments.
Applies a pipeline of: 1) Vector embeddings 2) Dimensional reduction 3) Clustering
This assigns each row a cluster ID so that similar free text comments (found in the input_column) can be grouped together.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to cluster.
output_columns (list, optional) – Names for the output columns. Default is [“cluster”, “cluster_probability”].
min_cluster_size (int, optional) – The minimum size of clusters for HDBSCAN. Default is 5.
cluster_selection_epsilon (float, optional) – Distance threshold for HDBSCAN. Higher epsilon means fewer, larger clusters. Default is 0.2.
n_neighbors (int, optional) – The size of local neighborhood for UMAP. Default is 15.
- Returns:
The input DataFrame with additional columns for cluster IDs and probabilities.
- Return type:
pandas.DataFrame
- cluster_questions(df, columns=None, pattern=None, likert_mapping=None, umap_n_neighbors=15, umap_min_dist=0.1, hdbscan_min_cluster_size=20, hdbscan_min_samples=None, cluster_selection_epsilon=0.4)[source]
Cluster Likert scale questions based on response patterns.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
columns (list, optional) – List of column names to cluster. If None, all columns matching the pattern will be used.
pattern (str, optional) – Regex pattern to match column names. Used if columns is None.
likert_mapping (dict, optional) – Custom mapping for Likert scale responses. If None, default mapping is used.
umap_n_neighbors (int, optional) – The size of local neighborhood for UMAP. Default is 15.
umap_min_dist (float, optional) – The minimum distance between points in UMAP. Default is 0.1.
hdbscan_min_cluster_size (int, optional) – The minimum size of clusters for HDBSCAN. Default is 20.
hdbscan_min_samples (int, optional) – The number of samples in a neighborhood for a core point in HDBSCAN. Default is None.
cluster_selection_epsilon (float, optional) – A distance threshold. Clusters below this value will be merged. Default is 0.4. Higher epsilon means fewer, larger clusters.
- Returns:
The input DataFrame with additional columns for encoded Likert responses, UMAP coordinates, and cluster IDs.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If neither ‘columns’ nor ‘pattern’ is provided.
- encode_likert(df, likert_columns, output_prefix='likert_encoded_', custom_mapping=None, debug=True)[source]
Encode Likert scale responses to numeric values.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
likert_columns (list) – List of column names containing Likert scale responses.
output_prefix (str, optional) – Prefix for the new encoded columns. Default is ‘likert_encoded_’.
custom_mapping (dict, optional) – Optional custom mapping for Likert scale responses.
debug (bool, optional) – If True, prints out the mappings. Default is True.
- Returns:
The input DataFrame with additional columns for encoded Likert responses.
- Return type:
pandas.DataFrame
Notes
Default mapping: - -1: Phrases containing ‘disagree’, ‘do not agree’, etc. - 0: Phrases containing ‘neutral’, ‘neither’, ‘unsure’, etc. - +1: Phrases containing ‘agree’ (but not ‘disagree’ or ‘not agree’) - NaN: NaN values are preserved
- extract_keywords(df, input_column, output_column='keywords', preprocessed_column='preprocessed_text', spacy_column='spacy_output', lemma_column='lemmatized_text', top_n=3, threshold=0.4, ngram_range=(1, 1), min_df=5, min_count=None, min_proportion_with_keywords=0.95, **kwargs)[source]
Apply a pipeline of text preprocessing, spaCy processing, lemmatization, and TF-IDF to extract keywords from the specified column.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to process.
output_column (str, optional) – Name of the column to store the extracted keywords. Default is ‘keywords’.
preprocessed_column (str, optional) – Name of the column to store preprocessed text. Default is ‘preprocessed_text’.
spacy_column (str, optional) – Name of the column to store spaCy output. Default is ‘spacy_output’.
lemma_column (str, optional) – Name of the column to store lemmatized text. Default is ‘lemmatized_text’.
top_n (int, optional) – Number of top keywords to extract for each document. Default is 3.
threshold (float, optional) – Minimum TF-IDF score for a keyword to be included. Default is 0.4.
ngram_range (tuple, optional) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. Default is (1, 1) which means only unigrams.
min_df (int, optional) – Minimum document frequency for TF-IDF. Default is 5.
min_count (int, optional) – Minimum count for a keyword to be considered common in refinement. Default is None.
min_proportion_with_keywords (float, optional) – Minimum proportion of rows that should have keywords after refinement. Default is 0.95.
**kwargs – Additional keyword arguments to pass to the preprocessing, spaCy, lemmatization, or TF-IDF functions.
- Returns:
The input DataFrame with additional columns for preprocessed text, spaCy output, lemmatized text, and extracted keywords.
- Return type:
pandas.DataFrame
- extract_sentiment(df, input_column, output_columns=['positive', 'neutral', 'negative', 'sentiment'])[source]
Extract sentiment from text using the cardiffnlp/twitter-roberta-base-sentiment model.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to analyze.
output_columns (list, optional) – List of column names for the output. Default is [“positive”, “neutral”, “negative”, “sentiment”].
- Returns:
The input DataFrame with additional columns for sentiment scores and labels.
- Return type:
pandas.DataFrame
- fit_sentence_transformer(df, input_column, model_name='all-MiniLM-L6-v2', output_column='sentence_embedding')[source]
Add vector embeddings for each string in the input column.
Creates sentence embeddings that can be used for downstream tasks like clustering.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to embed.
model_name (str, optional) – Name of the sentence transformer model to use. Default is ‘all-MiniLM-L6-v2’.
output_column (str, optional) – Name of the column to store embeddings. Default is ‘sentence_embedding’.
- Returns:
The input DataFrame with an additional column containing sentence embeddings.
- Return type:
pandas.DataFrame
- fit_spacy(df, input_column, output_column='spacy_output')[source]
Apply the en_core_web_md spaCy model to the specified column.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to analyze.
output_column (str, optional) – Name of the output column. Default is “spacy_output”.
- Returns:
The input DataFrame with an additional column containing spaCy doc objects.
- Return type:
pandas.DataFrame
Notes
If the spaCy model is not already downloaded, this function will attempt to download it automatically.
- fit_tfidf(df, input_column, output_column='keywords', top_n=3, threshold=0.6, append_features=False, ngram_range=(1, 1), **tfidf_kwargs)[source]
Apply TF-IDF vectorization to extract top keywords from text.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to vectorize.
output_column (str, optional) – Name of the column to store the extracted keywords. Default is ‘keywords’.
top_n (int, optional) – Number of top keywords to extract for each document. Default is 3.
threshold (float, optional) – Minimum TF-IDF score for a keyword to be included. Default is 0.6.
append_features (bool, optional) – If True, append all TF-IDF features to the DataFrame (useful for downstream machine learning tasks). Default is False.
ngram_range (tuple, optional) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. Default is (1, 1) which means only unigrams. Set to (1, 2) for unigrams and bigrams, and so on.
**tfidf_kwargs – Additional keyword arguments to pass to TfidfVectorizer.
- Returns:
The input DataFrame with an additional column containing the top keywords.
- Return type:
pandas.DataFrame
- get_lemma(df, input_column='spacy_output', output_column='lemmatized_text', text_pos=['PRON'], remove_punct=True, remove_space=True, remove_stop=True, keep_tokens=None, keep_pos=None, keep_dep=['neg'], join_tokens=True)[source]
Extract lemmatized text from spaCy doc objects.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str, optional) – Name of the column containing spaCy doc objects. Default is ‘spacy_output’.
output_column (str, optional) – Name of the output column for lemmatized text. Default is ‘lemmatized_text’.
text_pos (List[str], optional) – List of POS tags to exclude from lemmatization and return the text. Default is [‘PRON’].
remove_punct (bool, optional) – Whether to remove punctuation. Default is True.
remove_space (bool, optional) – Whether to remove whitespace tokens. Default is True.
remove_stop (bool, optional) – Whether to remove stop words. Default is True.
keep_tokens (List[str], optional) – List of token texts to always keep. Default is None.
keep_pos (List[str], optional) – List of POS tags to always keep. Default is None.
keep_dep (List[str], optional) – List of dependency labels to always keep. Default is [“neg”].
join_tokens (bool, optional) – Whether to join tokens into a string. If False, returns a list of tokens. Default is True.
- Returns:
The input DataFrame with an additional column containing lemmatized text or token list.
- Return type:
pandas.DataFrame
- preprocess_text(df, input_column, output_column=None, remove_html=True, lower_case=False, normalize_whitespace=True, remove_numbers=False, remove_stopwords=False, flag_short_comments=False, min_comment_length=5, max_comment_length=None, remove_punctuation=True, keep_sentence_punctuation=True, comment_length_column=None)[source]
Preprocess text data in the specified column, tailored for survey responses.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to preprocess.
output_column (str, optional) – Name of the output column. If None, overwrites the input column.
remove_html (bool, optional) – Whether to remove unexpected HTML tags. Default is True.
lower_case (bool, optional) – Whether to lowercase all words. Default is False.
normalize_whitespace (bool, optional) – Whether to normalize whitespace. Default is True.
remove_numbers (bool, optional) – Whether to remove numbers. Default is False.
remove_stopwords (bool, optional) – Whether to remove stop words. Default is False.
flag_short_comments (bool, optional) – Whether to flag very short comments. Default is False.
min_comment_length (int, optional) – Minimum length of comment to not be flagged as short. Default is 5.
max_comment_length (int, optional) – Maximum length of comment to keep. If None, keeps full length. Default is None.
remove_punctuation (bool, optional) – Whether to remove punctuation. Default is True.
keep_sentence_punctuation (bool, optional) – Whether to keep sentence-level punctuation. Default is True.
comment_length_column (str, optional) – Name of the column to store comment lengths. If None, no column is added. Default is None.
- Returns:
The input DataFrame with preprocessed text and optionally new columns for short comments, truncation info, and comment length.
- Return type:
pandas.DataFrame
- refine_keywords(df, keyword_column='keywords', text_column='lemmatized_text', min_count=None, min_proportion=0.95, output_column=None, debug=True)[source]
Refine keywords by replacing rare keywords with more common ones based on the text content.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
keyword_column (str, optional) – Name of the column containing keyword lists. Default is ‘keywords’.
text_column (str, optional) – Name of the column containing the original text. Default is ‘lemmatized_text’.
min_count (int, optional) – Minimum count for a keyword to be considered common. If None, it will be determined automatically. Default is None.
min_proportion (float, optional) – Minimum proportion of rows that should have keywords after refinement. Used only if min_count is None. Default is 0.95.
output_column (str, optional) – Column name for the refined keyword output. If None, the keyword_column is overwritten. Default is None.
debug (bool, optional) – If True, print detailed statistics about the refinement process. Default is True.
- Returns:
The input DataFrame with refined keywords.
- Return type:
pd.DataFrame
- remove_short_comments(df, input_column, min_comment_length=5)[source]
Replace comments shorter than the specified minimum length with NaN.
- Parameters:
df (pandas.DataFrame) – The input DataFrame.
input_column (str) – Name of the column containing text to process.
min_comment_length (int, optional) – Minimum length of comment to keep. Default is 5.
- Returns:
The input DataFrame with short comments replaced by NaN.
- Return type:
pandas.DataFrame