Reproducible notebooks for text analytics¶

Whoami¶

  • Research Software Engineer
  • Part of the Research Computing team in Central IT
  • Background in Python/R/data science with a sprinkle of NLP

Research Computing Leeds logo

What are notebooks?¶

Notebooks are text documents composed of json that contain both code, markup text and other graphical elements (images, videos, plots, widgets).

Notebooks are composed of:

  • Code
  • Generated outputs
  • Metadata
  • Data

Jupyter notebooks¶

Screenshot of jupyter notebook

JupyterLab¶

Screenshot of Jupyter Lab

Google Colab¶

Screenshot of google colab

Notebooks are great for text analytics¶

Example of text preprocessing for topic modelling with consumer complaints data.

In [1]:
%%bash 

if [ -d data/ ]; then
    echo "Data directory exists"
else
    mkdir data
fi

if test -f data/complaints.csv; then
    echo "Data file exists"
else 
    curl -LO http://files.consumerfinance.gov/ccdb/complaints.csv.zip; mv complaints.csv.zip data/ ;unzip data/complaints.csv.zip -d data/
fi
Data directory exists
Data file exists
In [2]:
# import the dataset
import pandas as pd

ticket_data = pd.read_csv('data/complaints.csv')

ticket_data.dropna(subset=["Consumer complaint narrative"], inplace=True)

print(ticket_data.shape)

ticket_data.head()
/tmp/ipykernel_8714/1604837686.py:4: DtypeWarning: Columns (9,16) have mixed types. Specify dtype option on import or set low_memory=False.
  ticket_data = pd.read_csv('data/complaints.csv')
(1158384, 18)
Out[2]:
Date received Product Sub-product Issue Sub-issue Consumer complaint narrative Company public response Company State ZIP code Tags Consumer consent provided? Submitted via Date sent to company Company response to consumer Timely response? Consumer disputed? Complaint ID
4 2022-12-29 Debt collection I do not know Attempts to collect debt not owed Debt is not yours I declare under penalty of perjury ( under the... Company has responded to the consumer and the ... Convergent Resources, Inc. HI 96818.0 Servicemember Consent provided Web 2022-12-29 Closed with explanation Yes NaN 6375521
10 2022-12-24 Checking or savings account Checking account Managing an account Deposits and withdrawals I opened up a account online XXXX weeks ago an... Company has responded to the consumer and the ... BMO HARRIS BANK NATIONAL ASSOCIATION AZ 85301.0 Servicemember Consent provided Web 2022-12-24 Closed with explanation Yes NaN 6358144
13 2022-12-16 Credit card or prepaid card Store credit card Fees or interest Unexpected increase in interest rate When signing up with the card they never tell ... Company has responded to the consumer and the ... SYNCHRONY FINANCIAL NY 11421.0 NaN Consent provided Web 2022-12-16 Closed with explanation Yes NaN 6329064
15 2022-12-20 Credit card or prepaid card General-purpose credit card or charge card Problem with a purchase shown on your statement Credit card company isn't resolving a dispute ... Received credit card statement dated XX/XX/22 ... Company has responded to the consumer and the ... U.S. BANCORP OH 45377.0 NaN Consent provided Web 2022-12-20 Closed with non-monetary relief Yes NaN 6338237
17 2022-12-16 Credit reporting, credit repair services, or o... Credit reporting Problem with a credit reporting company's inve... Their investigation did not fix an error on yo... I reviewed my Consumer Reports and noticed tha... Company has responded to the consumer and the ... Experian Information Solutions Inc. CA 93727.0 NaN Consent provided Web 2022-12-16 Closed with explanation Yes NaN 6323220
In [3]:
import numpy as np
# a quick look at the average number of words in each complaint in each category
ticket_data.groupby('Product')['Consumer complaint narrative'].apply(lambda x: np.mean([len(word) for word in x]))
Out[3]:
Product
Bank account or service                                                         1243.543769
Checking or savings account                                                     1326.154371
Consumer Loan                                                                   1109.715945
Credit card                                                                     1127.125438
Credit card or prepaid card                                                     1260.682314
Credit reporting                                                                 750.135087
Credit reporting, credit repair services, or other personal consumer reports     846.862221
Debt collection                                                                  957.419525
Money transfer, virtual currency, or money service                              1217.202314
Money transfers                                                                 1153.176353
Mortgage                                                                        1651.228443
Other financial service                                                         1233.157534
Payday loan                                                                      747.893471
Payday loan, title loan, or personal loan                                       1151.683676
Prepaid card                                                                     963.180000
Student loan                                                                    1283.149795
Vehicle loan or lease                                                           1360.902611
Virtual currency                                                                 940.187500
Name: Consumer complaint narrative, dtype: float64
In [4]:
ticket_data = ticket_data[ticket_data['Product'] == 'Credit card']

# lets peak and look what this looks like
 
ticket_data['Consumer complaint narrative'].iloc[:3].tolist()
Out[4]:
['Last month I started receiving calls from unknown numbers. They did leave a voicemail to call a number back or log on to citicards.com and they could help me. I XXXX the numbers and there were multiple people suspecting the number of fraud. So I logged on to citicards.com and there were no alerts. I sent a secure message about the calls and they gave me this reply, " Dear XXXX, Thank you for contacting us. We appreciate each and every opportunity to serve you. \n\nOur records do not show that we have called you regarding your account. \n\nIf you think that the card information is at risk, please call Customer Service immediately. Once your closure request is processed, the current card is closed and a new card number is established. \nXXXX. \nIf there is any way we can be of further assistance, please feel free to contact us. \n\nSincerely, Account Specialist South Dakota \'\' So I assumed it was fraud, but the calls continued. I finally was able to answer XXXX and it said that it was a citicards account that was past due. Because of the additional time that lapsed, they reported my account as delinquent. This seems unfair since I took the exact step from their phone call and was told by Citi that they were n\'t trying to get a hold of me. Now I have a late payment on my credit report from Citi because Citi told me they were n\'t trying to contact me.',
 "I was misled into thinking that I could have a XXXX alternative from XXXX in XXXX GA. They set me up with a Comenity Bank account for a debt of over {$5500.00}. After XXXX weeks I let XXXX know that I had zero results. They had me wait until a full 3 months had passed to see me, all the while I was making payments to the account. \n\nAfter 12 weeks XXXX got me in, took my XXXX, and took after pictures. I almost cried, I had gained XXXX and there were no differences in my pictures. I was clearly upset with the results and wanted to speak to someone. The nurse said she was sorry, suggested some natural ideas help with XXXX and said a manager would call me. \n\nOnce I finally heard from a manager they tried to sell me on a second round of the procedure for another $ XXXX. I refused and contacted Comenity Bank for help. \n\nThe service from XXXX was fraud. The pictures on the sales ads were deceptive, and it should be illegal to take advantage of consumers like this. They advertise XXXX alternative and show unrealistic before and after pictures. I had zero results and informed the company after 4 weeks to let them know. \n\nAfter I filed the dispute, Comenity Bank instructed me to work my issues out with XXXX. I contacted the XXXX corporate office and was told they would offer me a credit and that a manager in GA would reach out to me. \n\nThe manager in Georgia did reach out to me. She agreed that the service was NOT successful and offered me a store credit for future services of {$2800.00}, still wanting me to pay Comenity $ XXXX. I refused this offer and went back to Comenity bank to let them know we could not resolve this ourselves. \n\nI was told by the XXXX customer service that the dispute would be refiled on XX/XX/2017. I have written on the company XXXX and client portal for updates and help with no reply other than call customer service. Today I called customer service and they said they had no record of me contacting XXXX to resolve and thought they had sent a letter to inform me. The rep said she was n't sure why the letter was n't mailed and would reopen the case. \n\nI feel like I am getting the run around from the bank and have definitely been scammed by XXXX. PLEASE HELP ME! I have two other accounts with Comenity funded ( XXXX and XXXX ) that are paid on time and XXXX is actually paid in full. \n\nI have no problem paying my bills but do have a problem being taken advantage of. \nWhen I complained on the Comenity XXXX page a consumer referred me to contact your site for help. I hope that you can help me and shut companies like this down. I wish I would have done more research before using XXXX. I see that many thousands of other people are going through this too. \n\nKind regards, XXXX",
 'I once had a credit card with Fifth Third bank, which was closed bank in XX/XX/XXXX. In XX/XX/XXXX a charge that was stored on an online account automatically tried charging the card for the renewed subscription. Instead of Fifth Third bank declining the transaction they reopened my closed credit card without my permission. Finally, after three months of not contacting me via phone, email, or mail, I received a letter in the mail saying I owed the charge of {$77.00} ( for the service ) and an extra {$100.00} for late fees. After countless hours on the phone with them, they acknowledged my credit card was closed and they reopened it. I offered to pay the initial {$77.00} fee if they would waive the late fee for not informing me, or without reopening it without my permission. They said they would look into the dispute into a better solution. After about another month they sent back a letter saying the dispute was denied and I owed the entire fee. When I called back to ask how it could have been denied, since they reopened a closed account, they said the 120 day period has passed and there is nothing they could do about it. I have since discarded the credit card because I had no use for it after I had closed the account in XX/XX/XXXX.']

Preprocessing¶

In [5]:
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_numeric, remove_stopwords, strip_short, stem_text

def basic_preprocess(list_of_strings):
    """
    A basic function that takes a list of strings and runs some basic
    gensim preprocessing to tokenise each string.
    
    Operations:
        - convert to lowercase
        - remove html tags
        - remove punctuation
        - remove numbers
        - remove short tokens (less than 3 characters)
    
    Outputs a list of lists
    """
    
    CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_numeric, remove_stopwords, strip_short]

    preproc_text = [preprocess_string(doc, CUSTOM_FILTERS) for doc in list_of_strings]
    
    return preproc_text
In [6]:
import re

def remove_twitterisms(list_of_strings):
    """
    Some regular expression statements to remove twitter-isms
    
    Operations:
        - remove links
        - remove @tag
        - remove #tag
        
    Returns list of strings with the above removed
    """
    
    # removing some standard twitter-isms

    list_of_strings = [re.sub(r"http\S+", "", doc) for doc in list_of_strings]

    list_of_strings = [re.sub(r"@\S+", "", doc) for doc in list_of_strings]

    list_of_strings = [re.sub(r"#\S+", "", doc) for doc in list_of_strings]
    
    return list_of_strings
In [7]:
# removing emojis
# taken from https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b#gistcomment-3315605

def remove_emoji(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d" 
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)
In [8]:
def remove_redacted(string):
    
    string = [re.sub(r"(x|X){2,}", "", doc) for doc in string]
    
    return string
In [9]:
from gensim.models.phrases import Phrases

def n_gram(tokens):
    """Identifies common two/three word phrases using gensim module."""
    # Add bigrams and trigrams to docs (only ones that appear 10 times or more).
    # includes threshold kwarg (threshold score required by bigram)
    bigram = Phrases(tokens, min_count=10, threshold=100)
    trigram = Phrases(bigram[tokens], threshold = 100)

    for idx, val in enumerate(tokens):
        for token in bigram[tokens[idx]]:
            if '_' in token:
                if token not in tokens[idx]:
                    # Token is a bigram, add to document.bigram
                    tokens[idx].append(token)
        for token in trigram[tokens[idx]]:
            if '_' in token:
                if token not in tokens[idx]:
                    # Token is a trigram, add to document.
                    tokens[idx].append(token)
    return tokens
In [10]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lemmatise(words):
    """
    Convert words to their lemma or root using WordNet lemmatizer
    """
    lemma = WordNetLemmatizer()
    # this function takes a list of lists of tokens
    return [[lemma.lemmatize(token,'v') for token in tokens] for tokens in words]
[nltk_data] Downloading package wordnet to /home/medacola/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
In [11]:
# lets slice out the text data from our dataframe
subsample_text = ticket_data['Consumer complaint narrative'].tolist()
In [12]:
# next we implement the preprocessing functions on our data

preprocessed_corpus = remove_twitterisms(subsample_text)

preprocessed_corpus = remove_redacted(preprocessed_corpus)

preprocessed_corpus = [remove_emoji(doc) for doc in preprocessed_corpus]

preprocessed_corpus = basic_preprocess(preprocessed_corpus)

preprocessed_corpus = lemmatise(preprocessed_corpus)
In [13]:
# lets compare the original strings to the preprocessed strings

print(subsample_text[0])
print("-------------------------")
print(preprocessed_corpus[0])
Last month I started receiving calls from unknown numbers. They did leave a voicemail to call a number back or log on to citicards.com and they could help me. I XXXX the numbers and there were multiple people suspecting the number of fraud. So I logged on to citicards.com and there were no alerts. I sent a secure message about the calls and they gave me this reply, " Dear XXXX, Thank you for contacting us. We appreciate each and every opportunity to serve you. 

Our records do not show that we have called you regarding your account. 

If you think that the card information is at risk, please call Customer Service immediately. Once your closure request is processed, the current card is closed and a new card number is established. 
XXXX. 
If there is any way we can be of further assistance, please feel free to contact us. 

Sincerely, Account Specialist South Dakota '' So I assumed it was fraud, but the calls continued. I finally was able to answer XXXX and it said that it was a citicards account that was past due. Because of the additional time that lapsed, they reported my account as delinquent. This seems unfair since I took the exact step from their phone call and was told by Citi that they were n't trying to get a hold of me. Now I have a late payment on my credit report from Citi because Citi told me they were n't trying to contact me.
-------------------------
['month', 'start', 'receive', 'call', 'unknown', 'number', 'leave', 'voicemail', 'number', 'log', 'citicards', 'com', 'help', 'number', 'multiple', 'people', 'suspect', 'number', 'fraud', 'log', 'citicards', 'com', 'alert', 'send', 'secure', 'message', 'call', 'give', 'reply', 'dear', 'thank', 'contact', 'appreciate', 'opportunity', 'serve', 'record', 'call', 'account', 'think', 'card', 'information', 'risk', 'customer', 'service', 'immediately', 'closure', 'request', 'process', 'current', 'card', 'close', 'new', 'card', 'number', 'establish', 'way', 'assistance', 'feel', 'free', 'contact', 'sincerely', 'account', 'specialist', 'south', 'dakota', 'assume', 'fraud', 'call', 'continue', 'finally', 'able', 'answer', 'say', 'citicards', 'account', 'past', 'additional', 'time', 'lapse', 'report', 'account', 'delinquent', 'unfair', 'take', 'exact', 'step', 'phone', 'tell', 'citi', 'try', 'hold', 'late', 'payment', 'credit', 'report', 'citi', 'citi', 'tell', 'try', 'contact']
In [14]:
print(subsample_text[2])
print("-------------------------")
print(preprocessed_corpus[2])
I once had a credit card with Fifth Third bank, which was closed bank in XX/XX/XXXX. In XX/XX/XXXX a charge that was stored on an online account automatically tried charging the card for the renewed subscription. Instead of Fifth Third bank declining the transaction they reopened my closed credit card without my permission. Finally, after three months of not contacting me via phone, email, or mail, I received a letter in the mail saying I owed the charge of {$77.00} ( for the service ) and an extra {$100.00} for late fees. After countless hours on the phone with them, they acknowledged my credit card was closed and they reopened it. I offered to pay the initial {$77.00} fee if they would waive the late fee for not informing me, or without reopening it without my permission. They said they would look into the dispute into a better solution. After about another month they sent back a letter saying the dispute was denied and I owed the entire fee. When I called back to ask how it could have been denied, since they reopened a closed account, they said the 120 day period has passed and there is nothing they could do about it. I have since discarded the credit card because I had no use for it after I had closed the account in XX/XX/XXXX.
-------------------------
['credit', 'card', 'fifth', 'bank', 'close', 'bank', 'charge', 'store', 'online', 'account', 'automatically', 'try', 'charge', 'card', 'renew', 'subscription', 'instead', 'fifth', 'bank', 'decline', 'transaction', 'reopen', 'close', 'credit', 'card', 'permission', 'finally', 'months', 'contact', 'phone', 'email', 'mail', 'receive', 'letter', 'mail', 'say', 'owe', 'charge', 'service', 'extra', 'late', 'fee', 'countless', 'hours', 'phone', 'acknowledge', 'credit', 'card', 'close', 'reopen', 'offer', 'pay', 'initial', 'fee', 'waive', 'late', 'fee', 'inform', 'reopen', 'permission', 'say', 'look', 'dispute', 'better', 'solution', 'month', 'send', 'letter', 'say', 'dispute', 'deny', 'owe', 'entire', 'fee', 'call', 'ask', 'deny', 'reopen', 'close', 'account', 'say', 'day', 'period', 'pass', 'discard', 'credit', 'card', 'use', 'close', 'account']
In [15]:
import nltk
import matplotlib.pyplot as plt
flat_list = [item for sublist in preprocessed_corpus for item in sublist]
text = nltk.Text(flat_list)
fdist = nltk.FreqDist(text)
plt.figure(figsize=(10,6))
fdist.plot(50)
Out[15]:
<AxesSubplot: xlabel='Samples', ylabel='Counts'>

But notebooks can be hard to reproduce¶

Hidden state¶

In [19]:
def sum(x): return x + x
In [20]:
xx = sum(2)
In [22]:
xx == 4
Out[22]:
False

Naming notebooks¶

Jupyter notebook with untitled notebooks

Not sharing dependencies required¶

Module not found error

Missing tests¶

Pytest run showing no tests run

Not sharing data with notebooks¶

Data Star Trek Next Generation

But there are steps you can take to improve reproducibility¶

Should this be a notebook? Or should it be a library?¶

Tom Hanks thinking face

Validate your notebooks¶

Example of running nbval

Julynter¶

Juypter lab interface with Julynter prompts

Binder¶

BinderHub logo

repo2Docker¶

  • transform your notebook repository into a Jupyter-ready container
repo2Docker logo

For your next notebook why not try...¶

  • Naming it before you start
  • Clearing the outputs when you commit it to version control
  • Making sure it runs from top to bottom without errors (Use "Run All")
  • Specifying your dependencies (data and packages) in the same repository

Thanks for listening!¶

Looney Tunes Thats all folks gif

Summary of tools¶

  • This presentation was made with RISE
  • nbval, a pytest plugin for validating your notebooks
  • BinderHub, share notebook repositories and run them on the Cloud
  • repo2Docker build Jupyter-ready Docker images from notebook repositories
  • Julynter, an experimental linter plugin for JupyterLab