Sentiment analysis is a type of data mining that measures the propensity of people’s opinions to natural language processing (NLP), computational linguistics, and text analysis, which are used to extract and analyze subjective information from the web – mostly social media and similar sources. The data analyzed quantify the moods and reactions of the general public to certain products, people, or ideas, and reveal the contextual polarity of the information. Sentimental analysis is also known as opinion exploration.
There are two approaches to sentiment analysis. Some are pure statistics, such algorithms treat texts as Word Bags (BOW), where the word order and as such context are ignored. The original text is only filtered to words that are considered to convey sentiment. Such models do not exploit the understanding of a certain language and only use statistical measures to classify the text. Another approach is to combine statistics and linguistics. Algorithms try to incorporate grammar rules, various natural language processing techniques, and statistics to train the machine to truly „understand” the language.
Sentiment analysis can also be divided into two types, depending on the type of results generated by the analysis. Categorical / Polarity – Was this part of the text „positive”, „neutral” or „negative”? Scalar / Degree – Scoring on a predefined scale that ranges from very positive to very negative. Valence-based analysis where the intensity of the sentiment is taken into account.
Sentiment analysis is applicable to many industries – great for anything where you can get unstructured feedback data about a service or product. One use of sentiment analysis is for companies that have Twitter or other social media accounts to receive feedback.
In this post, we will look at the last words spoken by inmates on death row – data from Texas, where the death penalty is still being administered, and use Vader analysis. VADER (Valence Aware Dictionary for sEntiment Reasoning) belongs to the type of sentiment analysis based on lexicons related to sentiment. In this approach, each word in the lexicon is assessed as positive or negative, and in many cases positive or negative. This analysis works best with shorter text forms like tweets or sentences.
Import libraries:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Data comes from kaggle:
with open(r'D:\machinelearning\deathrow.csv','rt',encoding='utf-8', errors='ignore') as f:
data = pd.read_csv(f)
data.head()
We will use only few columns:
sen=data[['LastName', 'FirstName', 'Age', 'AgeWhenReceived', 'LastStatement']]
sen.head()
To build analyzer:
analyser = SentimentIntensityAnalyzer()
def sentiment_analyzer_scores(sentence):
score = analyser.polarity_scores(sentence)
print("{:-<40} {}".format(sentence, str(score)))
Let’s see how it performs on one example:
sentiment_analyzer_scores(sen['LastStatement'].iloc[2])
Positive, negative, and neutral results represent the proportion of text that falls under these categories. This means that our opinion was assessed as 72% neutral, 24% positive and 4% negative. They should all add up to 1. The Compound Score is a metric that computes the sum of all lexicon grades that have been normalized between -1 (most negative) and +1 (most extremely positive). Compound is 0.85, which means very high positive sentiment.
Let’s compute the compound for all sentences:
i=0
compval1=[]
while (i<len(sen)):
k = analyser.polarity_scores(sen.iloc[i]['LastStatement'])
compval1.append(k['compound'])
i = i+1
compval1 = np.array(compval1)
sen['Vader Score']=compval1
sen.head(10)
Based on these results, we can determine which are more positive and which are negative. The intervals as we define it depend on us, let’s assume that all results above 0.7 are positive, the results = 0 will be null (there was no last word), from 0 to 0.7 are neutral and negative are negative:
i = 0
predicted_value = [ ]
while(i<len(sen)):
if ((sen.iloc[i]['Vader Score'] >= 0.7)):
predicted_value.append('positive')
i = i+1
elif ((sen.iloc[i]['Vader Score'] > 0) & (sen.iloc[i]['Vader Score'] < 0.7)):
predicted_value.append('neutral')
i = i+1
elif ((sen.iloc[i]['Vader Score'] == 0)):
predicted_value.append('null')
i = i+1
elif ((sen.iloc[i]['Vader Score'] <= 0)):
predicted_value.append('negative')
i = i+1
sen['predicted sentiment'] = predicted_value
sen.head(10)
Let’s see how sentiments break down:
sen['predicted sentiment'].value_counts()
We can conclude that the vast majority of recent statements were positive, i.e. showing remorse for the committed actions. Negative and neutral statements account for around 16%
Finally, let’s see what the most common words were:
tex=[]
for row in sen['LastStatement']:
tex.append(row)
from collections import Counter
join=','.join(tex)
split = join.split()
Counter = Counter(split)
most_occur = Counter.most_common(20)
print(most_occur)
Words like I, to, a, the, etc could also be excluded, but we’ll cover that later. It was just an introduction to the sentiment analysis, we will definitely come back to it.