Top 5 Tips on Building a Sentiment Analysis Application | Clarsentia Blog
Top 5 Tips on Building a Sentiment Analysis Application

Top 5 Tips on Building a Sentiment Analysis Application

Posted by Ivan Smith in Data Science & Analytics

Building your own sentiment analyzer can be tricky, especially with all the different text pre-processing approaches and technologies required to make the analysis and scoring accurate. Sentiment bias (where word associations are often incorrect or inaccurate due to subtleties in language) is one area where many machine sentiment analyzers fall short. Google's Natural Language API is a classic example of this, where terms like 'homosexual' were viewed as negative terms, causing consumers to have some serious discrepancies in their data interpretations.

Similarly, you've probably heard about Microsoft's AI chatbot Tay, which was quickly taken down from Twitter after users taught it to use racist and demeaning language, probably not ideal for their customers. Worse still, the software company relaunched its bot Zo, which had similarly learned negative language from humans due to problems with its learning algorithm and had to be shutdown again.

So the question becomes, why does this happen and what can you do to avoid having similar problems with your own Sentiment Analysis engine? Here are the top 5 tips to consider when building your own sentiment analysis application:


Tip #1: Go Beyond Simple Dictionary and Bag-of-Words Approaches

The first mistake most beginners to language processing make is assuming you can get accurate results using a simple dictionary of terms. Dictionary approaches to sentiment analysis are well documented throughout the web and can provide an easy way to get up and running with your analyzer app. Bag of words approaches (where you use specific keywords to interpret sentiment) can be tempting, but beware. You will never get accurate results using dictionary approaches to sentiment analysis by itself.

To understand why this happens you have to first understand what sentiment bias is all about. Consider the following phrase for interpretation:

“My iPhone would be amazing if it recorded videos in HD, instead it’s just a beautiful coaster.”

If you use the bag-of-words approach to breaking down and analyzing this sentence, you would conclude the phrase is extremely positive. Why? Because bag-of-words would only consider the keywords used:

“My iPhone would be
amazing if it recorded videos in HD, instead it’s just a beautiful coaster.”

Here we have a clear example of a biased problem. We’re assuming the sentence is positive because it contains specific positive keywords when semantically it’s actually negative as demonstrated below:

“My iPhone
would be amazing if it recorded videos in HD, instead it’s just a beautiful coaster.”

So the question becomes, how do we avoid this? In order to get accurate results from our analysis we need to go beyond simple words and look at what the author is actually saying. Using semantic approaches to language has been proven to yield the best results to date. By breaking down the sentence semantically using a technique known as part-of-speech (POS) tagging, we can better understand what the author is really saying.


Tip #2: Don’t Try to Make it Perfect

Don’t look for perfection with your sentiment analyzer; you’ll probably never get there. Outside of the biased interpretation problem discussed briefly above which is a well-known and well researched area of sentiment analysis, the truth is research has shown that people only agree on the sentiment interpretation about 80% of the time. This means that no matter how accurate your results appear to be, it will only ever get you 80% of the way there, because someone can always interpret complex sentences and language differently.

Instead, try to get your analyzer to agree with how a domain expert (human researcher) would interpret the results. This is easily accomplished using a training set to compare how you interpret language versus how a machine algorithm performed. There are several techniques available to get your analysis engine as close to the 80% mark as possible.

Training sets scored manually by humans are a great way to do this. On average, you’ll need at least 3,000 to 5,000 pre-scored documents (referred to as a corpus) to get started. Once you’ve built your sentiment analyzer, go back and analyze your training sets using your algorithm to gauge how closely the model agrees with human interpretation. The closer you get to the 80%, the more accurate the results of your analyses will be in a real-world scenario.


Tip #3: Spend More Time Pre-Processing Your Text

It may seem obvious, but you’d be surprised how often this step gets overlooked by data scientists. Pre-processing is the way in which we clean our textual content to get it ready for analysis by our sentiment algorithm. Advertising in web content, slang, metatags and embedded characters are all examples of pre-processing challenges you’ll need to overcome to make your sentiment score as accurate as possible.

The phrase "garbage in, garbage out" is a cautionary tale for any data mining project, and sentiment analysis is particularly vulnerable to garbage data due to the noisy nature of language in general. The more time you spend cleaning up the text the better your result is going to be when you analyze it. Many sites make use of special tagging systems which are also subject to interpretation by your analyzer, so think about how you want to addresses those challenges before you get started.

Many pre-processing techniques and tools exist to help you with this. HTML parsers and custom built regular expressions will go a long way in making your analysis work easier.


Tip #4: If the Words Don’t Fit You Should Omit

If you’re analyzer comes up against something it doesn’t understand, log the item and omit it from your analysis. Don’t be afraid to leave something out if your analyzer can’t interpret it. You’re much better off to have small gaps in your data then to get the meaning of a word or phrase wrong and have your analysis skewed as a result. Build in mechanisms for dealing with interpretation problems in advance. A common technique is to build in a confidence score or probability for how accurate a sentiment score is, using this determiner for whether or not it should be included in your result.

Remember, the goal initially isn’t to have a perfect solution out of the gate, but to have one that is as accurate as possible (which means agreeing with our domain expert at least 80% of the time). Automate the process by capturing and logging the low-confidence interpretations for further tweaking your algorithm and training sets later.


Tip #5: Use Machine Learning and Classification

While lexical and semantic methods to sentiment analysis are great, don’t be afraid to leverage machine learning techniques to bring your solution to the next level. You can use machine learning to classify sentences, phrases and entire documents to make your sentiment analysis more accurate. You can also use classification engines to build and apply scores to previously unknown words and phrases.

Building a machine learning algorithm however typically requires some level expertise in semantics and you’ll also need to build out a comprehensive training data set your system can use to interpret new documents.

Below is a definition of each to better understand the difference between the two methodologies:

Lexical Methods: Employ dictionaries of words annotated with their semantic polarity and sentiment strength, often also with their respective position in text. Calculate a score for the polarity and/or sentiment of the document.

Machine Learning: Involves creating a model used for training a classifier with labeled examples. This means that you must first gather a dataset with examples for positive, negative and neutral classes, extract the features from the examples and then train the algorithm based on the examples.

In order to leverage machine learning, you’ll need to build a classifier. Here are a few good classification algorithms to consider when building your sentiment analyzer along with advantages for each:

Naive Bayes

Naïve Bayes is capable of making assumptions about data using small training sets with a high degree of accuracy. You’re more likely to have an accurate fit for your data and improved classification result using Naïve Bayes if you have a small training set to work with. This is because Naïve Bayes tends to be more accurate with limited datasets when compared to KNN techniques which are less accurate with smaller datasets. Naïve Bayes tends to be popular amongst beginners because it doesn’t require as much data upfront to develop training data. As your training data grows however, the degree of accuracy diminishes. Naïve Bayes represents a supervised approach to machine learning.

K-Nearest Neighbour (KNN)

KNN is much more accurate using large training sets that are built using noisy data. This makes it ideal for sentiment analysis if you have a large data set to work from. Accuracy improves as more documents and features are included in your training set, but it’s also generally computationally expensive to use KNN, so make sure you have the hardware and plenty of CPU cycles to support it. KNN also represents a supervised approach to machine learning.

Neural Networks & Deep Learning

Popularized because of its use in Google’s Deep Learning algorithm, neural networks attempt to emulate the way the human brain works in order to classify and learn from data. While neural networks provide an excellent way of detecting relationships between variables that may have been overlooked (which is useful in semantic analysis) the application success of neural networks varies because the results are very difficult for humans to interpret for accuracy as learning happens in a ‘black box’ using an unsupervised learning approach. Neural networks are also computationally expensive to implement.

Remember to combine supervised learning with unsupervised techniques to get the best results from your classifier. Supervised techniques to machine learning require a training set which has been vetted by a human where unsupervised learning is done by clustering related items together to infer an interpretation. Supervised learning can be laborious to setup and maintain while unsupervised learning can be easily misled causing problems with data (as seen by Google NLP and Microsoft’s chatbot Tay discussed earlier).

Clarsentia avoids these pitfalls by combining both supervised and unsupervised learning in our approach to sentiment analysis. Take a look at Clarsentia’s demo analyzer for an example of how our sentiment analysis engine works. 

Enjoyed this Article? Please Share!