Sentiment Analysis, which is also known as ‘opinion mining’, is a sub-field of Natural Language Processing that tries to identify and extract opinions within a given text across blogs, reviews, social media, forums, news etc.
A typical Sentiment Analysis model takes in a huge corpus of data, such as user reviews, identifies a pattern, and infers a conclusion based on real evidence rather than assumptions made on a small sample of data.
Sentiment analysis solves a number of genuine business problems. It can: predict customer behavior for a particular product; help to test the adaptability of a product; automate the task of customer preference reports, and easily automate the process of determining how well a product is performing by analysing the sentiments behind the reviews from a number of platforms.
This tutorial will demonstrate how to perform sentiment analysis on tweets to determine whether they are of positive sentiment or negative sentiment. We will be making use of Python’s NLTK (Natural Language Toolkit) library, which is a very commonly used library in the analysis of textual data.
The entire process can be divided into four sections:
- Preparing the data for analysis
- Cleaning the data
- Normalisation of the data
- Building and evaluating the model
- Preparing the data
First, install the necessary packages at the terminal.
Open a new (Jupyter) notebook, and import the nltk library.
Download the “twitter_samples” package from nltk, which is where we will be building our dataset of sample tweets from.
We will now combine these positive and negative tweets into a single Pandas dataframe to make data preprocessing easier.
The result is a single dataframe with 10,000 rows (5,000 for positive tweets and 5,000 for negative tweets). The positive and negative tweets will be randomized.
The dataframe will have two columns — ‘Tweet’ and ‘Sentiment’. Here, sentiment will be a binary value — 0 for a negative sentiment and 1 for positive sentiment.
We can now clean this data to prepare it for training the model. The existing data is unclean because it may contain slang, abbreviations, emojis, or a number of “stop words” that do not add significant meaning to text. We will demonstrate how to eliminate these problems from the data.
Let’s start by converting all the words to lowercase.
Now, we will remove all URLs from the data, as they do not add any meaning and do not aid in detecting sentiment. To do this, write a function that matches the text with the regular expression for URLs.
Slang and abbreviations are commonly deployed on social media. You can choose to convertabbreviations to their full forms to extract more meaning from them. To do this, you would require a CSV file of common abbreviations and their full forms (separated by tabs). Replace <path to CSV file> in the below code with the path to your CSV file.
We will now process emoticons. These are important in expressing sentiment and thus, it would not be advisable to completely remove them from the text. It would be better to replace them with the actual emotion that they are conveying.
To deal with emoticons, we can load a Python dictionary of common emoticons. For emojis, we will be making use of the Python package “emoji” and its method “demojize”. In doing so, we will replace all the occurrences of emoticons (or “smilies”) and emojis with their actual meaning.
We can now remove any extra noise, which includes twitter handles, punctuation, numbers, and special characters. They do not add value in terms of detecting sentiment.
As the last step in cleaning the text, we need to remove all stopwords from the text. Stopwords are the most common words used in any natural language. For the purpose of analysing text data and building NLP models, these stopwords do not generally add value to the meaning of the document. However, in our use case, negative stopwords could be very important in the detection of negative sentiment; hence, we can edit the list of stopwords before applying it to the text.
Before removing the stopwords, the text must be tokenised or split into smaller parts, which are called ‘tokens’: a token is a sequence of characters in text which serves as a single unit.
Normalisation in NLP is the process of converting a word to its canonical form. It helps to group together words with the same meaning but different forms. Without normalization, “clean”, “cleans”, and “cleaning” would be treated as different words, even though you might want them to be treated as the same word.
Two popular techniques of normalisation are stemming and lemmatization. Stemming and lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemmatisation will result in an actual word.
For example, consider the word “cries”. Stemming would reduce the word to “cri”, which is not an actual word in the English language, whereas lemmatisation would reduce the word to “cry”.
In this project, we will make use of lemmatization.
Some word-forms could be same but contextually or semantically different. To make the lemmatization better and context dependent, we would need to find out the POS (Part of Speech) tag and pass it on to the lemmatizer. We would first find out the POS tag for each token and then use the lemmatizer to lemmatize the token based on the tag.
Any model (i.e. linear regression, logistic regression, decision tree, or neural network) in data science can only take numeric inputs; how do we convert the text that we have into numerical data that a computer will understand? CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
Some of the arguments that we can pass to the CountVectorizer are as follows:
min_df: When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold.
max_features: If not ‘None’, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
To build the model, we will be making use of Gaussian Naive Bayes, which is a popular algorithm for classifying text. Although it is fairly simple, it often performs as well as much more complicated solutions.
This model applies Bayes theorem with a Naive assumption of no relationship between different features. According to Bayes theorem:
Before fitting, we will split the dataset into train and test subsets.
We will use metrics module from the sklearn library to evaluate the predictions.
By building a basic sentiment analysis model using the nltk library in Python 3, you can now successfully associate tweets with a definitive sentiment. However, a supervised learning model is only as good as its training data. To further strengthen the model, you could gather a larger, and thus, better, dataset. You could also consider adding more categories of sentiment, such as ‘excitement’ and ‘anger’. As a result of this tutorial, you can now conduct sentiment analysis using Natural Language Processing and gain a plethora of unique insights from a relatively small dataset.