Sometimes we only want an overall rating of the sentiment of the whole review. In other cases, we need a little more detail, and we want each negative or positive comment identified.
This kind of detailed detection can be quite challenging. Sometimes the aspect is explicit. An example is the opinion “very uninspired food”, where the criticized aspect is the food. In other cases, is implicit: the sentence “too expensive” gives a negative opinion about the price without mentioning it.
In this post I will focus on detecting the overall polarity of a review, leaving for later the identification of individual opinions on concrete aspects of the restaurant. To compute the polarity of a review, I’m going to use an approach based on dictionaries and some basic algorithms.
A note about the dictionaries
A dictionary is no more than a list of words that share a category. For example, you can have a dictionary for positive expressions, and another one for stop words.
The design of the dictionaries highly depends on the concrete topic where you want to perform the opinion mining. Mining hotel opinions is quite different than mining laptops opinions. Not only the positive/negative expressions could be different but the context vocabulary is also quite distinct.
Defining a structure for the text
Before writing code, there is an important decision to make. Our code will have to interact with text, splitting, tagging, and extracting information from it.
But what should be the structure of our text?
This is a key decision because it will determine our algorithms in some ways. We should decide if we want to differentiate sentences inside a a paragraph. We could define a sentence as a list of tokens. But what is a token? a string? a more complex structure? Note that we will want to assign tags to our token. Should we only allow one tag per-token or unlimited ones?
Infinite options here. We could choose a very simple structure, for example, defining the text simply as a list of words. Or we could define a more elaborated structure carrying every possible attribute of a processed text (word lemmas, word forms, multiple taggings, inflections…)
As usual, a compromise between these two extremes can be a good way to go.
For the examples of this post, I’m going to use the following structure:
- Each text is a list of sentences
- Each sentence is a list of tokens
- Each token is a tuple of three elements: a word form (the exact word that appeared in the text), a word lemma (a generalized version of the word), and a list of associated tags