Chatbot is a piece of intelligent software that can communicate with people through natural language processing. Commonly used in customer service. We have all heard about Siri, Alexa, Cortana – all great assistants and we are about to create one ourselves. Our assistant, let’s call him Jack (no idea why it’s usually female’s name) – we will teach Jack to answer simple questions. First we can create templates:
jack_template = "Jack : {0}"
user_template = "USER : {0}"
def user_query(q):
print(user_template.format(q))
response = respond(q)
print(jack_template.format(response))
We start simple. First we create dictionary (responses) and define set of frequent questions and corresponding replies. For each question we can have multiple replies to mix it up a bit and to add a variety to Jack’s answers.
import random
name = "Jack"
bday = "March 1st"
responses = {'Hi':['Hello there'],
"what's your name?": ['My name is {0}'.format(name),'They call me {0}'.format(name),'I go by {0}'.format(name), name],
'How are you?': ['Excellent','Great'],
'When is your birthday?' : [bday, 'March', 'My birthday is on {0}'.format(bday), 'I was born on {0}'.format(bday)],
'default': ["I don't understand. Sorry! I am still learning."]}
def respond(q):
if q in responses:
jack_reply = random.choice(responses[q])
else:
jack_reply = random.choice(responses['default'])
return jack_reply
Jack is going to answer only if we ask him the exact question otherwise we get default message. Let’s test him
user_query('Hi')
user_query("what's your name?")
user_query('How old are you?')
We can teach Jack to answer more complex questions with the help of regex expressions. Our questions will get more sophisticated and do not have to be asked in the exact manner. It is enough if they start with common expressions like ‚I want…’ or ‚do you think….’ and Jack will have few answers to choose from.
res = {'I want (.*)': ['and I want a million dollars','Why do you want {0}',"What's stopping you from getting {0}"],
'do you think (.*)': ['if {0}? Of course.', 'Nope', 'A bit...yes'],
'if (.*)': ["Do you really think it's likely that {0}",'Do you wish that {0}','What do you think about {0}']}
To make it more personal we teach Jack to replace pronouns:
import re
def swap_pron (q):
q = q.lower()
if 'I' in q:
return re.sub('I', 'You', q)
if 'me' in q:
return re.sub('me', 'you', q)
if 'my' in q:
return re.sub('my', 'your', q)
if 'your' in q:
return re.sub('your', 'my', q)
if 'you' in q:
return re.sub('you', 'me', q)
return q
Now we can match question to our regex pattern and form a reply:
def match(res, q):
response, phrase = "I don't understand. Still learning my friend", None
for pattern, responses in res.items():
match = re.search(pattern, q)
if match is not None:
response = random.choice(responses)
if '{0}' in response:
phrase = match.group(1)
return response.format(phrase)
And we can finally modified our respond def:
def respond(q):
response, phrase = match(res, q),None
if '{0}' in response:
phrase = swap_pron(phrase)
return response.format(phrase)
Let’s talk to Jack:
user_query('I want a strawberry')
user_query('do you think this is possible?')
user_query('Could you help me solve this problem?')
So far we used rule-based aproach where Jack answers questions based on rules we defined. Another approach is self-learning where with the use of Machine Learning Jack will answer to more complex questions. Self-learning divides further into two models – retrieval based and generative. we are going to build retrieval based bot with repository of pre-defined responses Jack will choose from, which is unlike generative models that can generate responses never seen before. Repository for Jack will be text from Wikipedia on Poland.
import warnings
warnings.filterwarnings("ignore")
import wikipedia as wp
repository = wp.page('Poland').content; repository[:100]
We are going to use nltk which is used to work with human language. First we need to process our text for Python to uderstand it. Few steps to do it: make sure to standarize text either to lowercase or uppercase, then we tokenize sentences and words (split into sentences/words), next comes text normalization (stemming/lemmatization):
import nltk
import string
repository=repository.lower()
sent_tokens = nltk.sent_tokenize(repository)
word_tokens = nltk.word_tokenize(repository)
sent_tokens[:2]
lemmer = nltk.stem.WordNetLemmatizer()
def lem_tokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def lem_norm(text):
return lem_tokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
Like before we will define set of some greeting responses:
user_greeting = ("hello", "hi", "greetings", "what's up","hey")
jack_greeting = ["hi", "hey", "hi there", "hello", "I am glad! You are talking to me", 'long time no see']
def greeting(sentence):
for word in sentence.split():
if word.lower() in user_greeting:
return random.choice(jack_greeting)
Now we need to transform the text into a more meaningful vector/array of numbers to do that we use bag of words and TF-IDF, which are techniques to help us convert text sentences into numeric vectors. Then we can obtain cosine similarity which is a metric used to measure how similar the documents are irrespective of their size (it measures the cosine of the angle between two vectors projected in a multi-dimensional space)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def respond(user):
jack_response=''
sent_tokens.append(user)
TfidfVec = TfidfVectorizer(tokenizer=lem_norm, stop_words='english')
tfidf = TfidfVec.fit_transform(sent_tokens)
cos = cosine_similarity(tfidf[-1], tfidf)
idx=cos.argsort()[0][-2]
flat = cos.flatten()
flat.sort()
r_tfidf = flat[-2]
if(r_tfidf==0):
jack_response=jack_response+"I am sorry! I don't know the answer"
return jack_response
else:
jack_response = jack_response+sent_tokens[idx]
return jack_response
flag=True
print("Jack: Hi, My name is Jack. I will try to answer your queries about Poland. If you don't wish to continue, type Bye!")
while(flag==True):
user = input()
user =user.lower()
if(user!='bye'):
if(user=='thanks' or user=='thank you' ):
flag=False
print("Jack: You are very welcome..")
else:
if(greeting(user)!=None):
print("Jack: "+greeting(user))
else:
print("Jack: ",end="")
print(respond(user))
sent_tokens.remove(user)
else:
flag=False
print("Jack: See you soon")
Jack is getting better and can answer some queries, his answers are hit or miss so there is a still room for improvement. We will keep on teaching Jack – I can see a great potential. TBC