Chatbots aren’t as difficult to make as You Think

Creating our StackOverflow ChatBot

Ok, so finally we are at a stage where we can do something we love. Use Data Science to power our Application/Chatbot.

Let us start with creating a rough architecture of what we are going to do next.

The Architecture of our StackOverflow Chatbot

We will need to create two classifiers and save them as .pkl files.

  1. Intent-Classifier: This classifier will predict if it a question is a Stack-Overflow question or not. If it is not a Stack-overflow question, we let Chatterbot handle it.
  2. Programming-Language(Tag) Classifier: This classifier will predict which language a question belongs to if the question is a Stack-Overflow question. We do this so we can search for those language questions in our database only.

To keep it simple we will create simple TFIDF models. We will need to save these TFIDF vectorizers.

We will also need to store word vectors for every question for similarity calculations later.

Let us go through the process step by step. You can get the full code in this jupyter notebook in my project repository.

Step 1. Reading and Visualizing the Data

dialogues = pd.read_csv("data/dialogues.tsv",sep="t")
posts = pd.read_csv("data/tagged_posts.tsv",sep="t")
dialogues.head()
Dialogues Data
posts.head()
StackOverflow Posts data
print("Num Posts:",len(posts))
print("Num Dialogues:",len(dialogues))

Num Posts: 2171575
Num Dialogues: 218609

Step 2: Create training data for Intent classifier — Chitchat/StackOverflow Question

We will be creating a TFIDF model with Logistic regression to do this. If you want to know about the TFIDF model you can read it here.

We could also have used one of the Deep Learning models or transfer learning approaches to do this, but since the main objective of this post is to get a chatbot up and running and not worry too much about the accuracy we sort of work with the TFIDF based model only.

texts  =  list(dialogues[:200000].text.values) + list(posts[:200000].title.values)
labels = ['dialogue']*200000 + ['stackoverflow']*200000
data = pd.DataFrame({'text':texts,'target':labels})
def text_prepare(text):
"""Performs tokenization and simple preprocessing."""

replace_by_space_re = re.compile('[/(){}[]|@,;]')
bad_symbols_re = re.compile('[^0-9a-z #+_]')
stopwords_set = set(stopwords.words('english'))

    text = text.lower()
text = replace_by_space_re.sub(' ', text)
text = bad_symbols_re.sub('', text)
text = ' '.join([x for x in text.split() if x and x not in stopwords_set])
    return text.strip()
# Doing some data cleaning
data['text'] = data['text'].apply(lambda x : text_prepare(x))
X_train, X_test, y_train, y_test = train_test_split(data['text'],data['target'],test_size = .1 , random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

Train size = 360000, test size = 40000

Step 3. Create Intent classifier

Here we Create a TFIDF Vectorizer to create features and also train a Logistic regression model to create the intent_classifier. Please note how we are saving TFIDF Vectorizer to resources/tfidf.pkl and intent_classifier to resources/intent_clf.pkl.

We will need these files once we are going to write the SimpleDialogueManager class for our final Chatbot.

# We will keep our models and vectorizers in this folder
!mkdir resources
def tfidf_features(X_train, X_test, vectorizer_path):
"""Performs TF-IDF transformation and dumps the model."""
tfv = TfidfVectorizer(dtype=np.float32, min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'w{1,}',
ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')

X_train = tfv.fit_transform(X_train)
X_test = tfv.transform(X_test)

pickle.dump(tfv,vectorizer_path)
return X_train, X_test

X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test, open("resources/tfidf.pkl",'wb'))
intent_recognizer = LogisticRegression(C=10,random_state=0)
intent_recognizer.fit(X_train_tfidf,y_train)
pickle.dump(intent_recognizer, open("resources/intent_clf.pkl" , 'wb'))
# Check test accuracy.
y_test_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.989825

The Intent Classifier has a pretty good test accuracy of 98%. TFIDF is not so bad.

Step 4: Create Programming Language classifier

Let us first create the data for Programming Language classifier and then train a Logistic Regression model using TFIDF features. We save this tag Classifier at the location resources/tag_clf.pkl.

We do this step mostly because we don’t want to do similarity calculations over the whole database of questions but only on the subset of questions by the language tag.

# creating the data for Programming Language classifier 
X = posts['title'].values
y = posts['tag'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

Train size = 1737260, test size = 434315

vectorizer = pickle.load(open("resources/tfidf.pkl", 'rb'))
X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)
tag_classifier = OneVsRestClassifier(LogisticRegression(C=5,random_state=0))
tag_classifier.fit(X_train_tfidf,y_train)
pickle.dump(tag_classifier, open("resources/tag_clf.pkl", 'wb'))
# Check test accuracy.
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.8043816124241622

Not Bad again.

Step 5: Store Question database Embeddings

One can use pre-trained word vectors from Google or get a better result by training their embeddings using their data.

Since again accuracy and precision is not the primary goal of this post, we will use pretrained vectors.

# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

We want to convert every question to an embedding and store them so we don’t calculate the embeddings for the whole dataset every time.

In essence, whenever the user asks a Stack Overflow question, we want to use some distance similarity measure to get the most similar question.

def question_to_vec(question, embeddings, dim=300):
"""
question: a string
embeddings: dict where the key is a word and a value is its' embedding
dim: size of the representation
        result: vector representation for the question
"""
word_tokens = question.split(" ")
question_len = len(word_tokens)
question_mat = np.zeros((question_len,dim), dtype = np.float32)

for idx, word in enumerate(word_tokens):
if word in embeddings:
question_mat[idx,:] = embeddings[word]

# remove zero-rows which stand for OOV words
question_mat = question_mat[~np.all(question_mat == 0, axis = 1)]

# Compute the mean of each word along the sentence
if question_mat.shape[0] > 0:
vec = np.array(np.mean(question_mat, axis = 0), dtype = np.float32).reshape((1,dim))
else:
vec = np.zeros((1,dim), dtype = np.float32)

return vec

counts_by_tag = posts.groupby(by=['tag'])["tag"].count().reset_index(name = 'count').sort_values(['count'], ascending = False)
counts_by_tag = list(zip(counts_by_tag['tag'],counts_by_tag['count']))
print(counts_by_tag)

[('c#', 394451), ('java', 383456), ('javascript', 375867), ('php', 321752), ('c_cpp', 281300), ('python', 208607), ('ruby', 99930), ('r', 36359), ('vb', 35044), ('swift', 34809)]

We save the embeddings in a folder aptly named resources/embeddings_folder.

This folder will contain a .pkl file for every tag. For example one of the files will be python.pkl.

! mkdir resources/embeddings_folder
for tag, count in counts_by_tag:
tag_posts = posts[posts['tag'] == tag]
tag_post_ids = tag_posts['post_id'].values
tag_vectors = np.zeros((count, 300), dtype=np.float32)
for i, title in enumerate(tag_posts['title']):
tag_vectors[i, :] = question_to_vec(title, model, 300)
# Dump post ids and vectors to a file.
filename = 'resources/embeddings_folder/'+ tag + '.pkl'
pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))

We are nearing the end now. We need to have a function to get most similar question’s post id in the dataset given we know the programming Language of the question and the question. Here it is:

def get_similar_question(question,tag):
# get the path where all question embeddings are kept and load the post_ids and post_embeddings
embeddings_path = 'resources/embeddings_folder/' + tag + ".pkl"
post_ids, post_embeddings = pickle.load(open(embeddings_path, 'rb'))
# Get the embeddings for the question
question_vec = question_to_vec(question, model, 300)
# find index of most similar post
best_post_index = pairwise_distances_argmin(question_vec,
post_embeddings)
# return best post id
return post_ids[best_post_index]
get_similar_question("how to use list comprehension in python?",'python')

array([5947137])

we can use this post ID and find this question at https://stackoverflow.com/questions/5947137

The question the similarity checker suggested has the actual text: “How can I use a list comprehension to extend a list in python? [duplicate]”

Not too bad but we could have done better if we train our embeddings or use starspace embeddings.