Sunday, October 30, 2011

Sentiment Analysis: Beyond Polarity [Problem Statement]

 


.

1.2 Research Background

The aim of this research is to develop machine learning techniques to recognise the emotions that are interwoven into the text of a document. This problem originates as a combination of a Machine Learning, Information Retrieval, Cognitive Science and Natural Language Processing challenge. By combining these fields, the research problem of computationally identifying emotional expression within a given document set exists. In working towards a solution to this problem, the intention is to contribute knowledge to the creation of an intelligent, emotionally sensitive computer system, which could reason about problems such as the following one:

1.2.1 Example Problem

The National Health Service plays a crucial role in our lives. For some, it is a fantastic experience, which aids them tremendously, whilst for others, the situation is unpleasant. Fortunately, websites such as Patient Opinion1 enable patients to post feedback regarding their experiences with the UK health services. The people giving feedback are doing so because they feel inclined in some way to express their emotions about parts of the service which they have felt strongly about, and so they contribute content to a valid emotional dataset. Countless people are treated everyday, and a vast quantity of feedback is left for the health services to observe and act upon. For those processing this data, the task could be approached in a more efficient way by using sentiment analysis to categorize the comments into emotional categories, so they can be dealt with in an appropriate manner. Comments that express sadness could be prioritised over those expressing joy, as something has caused the patient to feel this way, and could be corrected before another patient is affected in such a way. If surprise was expressed alongside sadness, then an even higher priority could be assigned to such comments. By using sentiment analysis, the health services could ensure that comments were not missed that could make the difference to the running of their hospitals, and fundamentally, the well-being of patients.

.
The problem posed in sentiment analysis is the identification of emotional expressions, and the emotion that is being expressed. For example a comment giving feedback about the UK health services could say:
.
“I was not treated until late in the evening, and thought that the doctor would not come.”
.
The problem here is that there are no words which explicitly denote an emotion, but the emotions of fear and sadness can be attributed to this statement due to the situation which is described. Identifying the context to this problem can come through learning about similar sentences and the emotion that they expressed, and through this learning process, a solution to the problem can be created. This thesis will discuss possible solutions to this problem, but before we cover those, sentiment analysis must be formally defined, and background concerning linguistic expression given.

.

1.2.2 Definition of Sentiment Analysis

In the past there has been confusion surrounding the terminology of this field. Quite often the challenges of polarity recognition and emotion identification have been described using the same term, sentiment analysis. This thesis seeks to go beyond polarity-based identification, and focus on finer-grained emotional recognition. Therefore in this research, the term sentiment analysis will be used in a broader fashion.

The meaning of the term sentiment analysis is quite inclusive. From a non-computational viewpoint, reading a film review and deciding you want to see it because of what the reviewer has written is a form of sentiment analysis. However, for this work, it can be thought of in the following way:

Sentiment Analysis is the computational evaluation of documents to determine the finegrained emotions that are expressed.

or more formally:

Given a document d from a document set D, computationally assign emotional labels, e, from a set of emotions E in such a way that e is reflective of d’s expressed emotion or emotions at the appropriate level of expression.

It will be of use to first define what is meant by a document in this context. For a general text classification problem, Lewis (1998) describes a document as simply being a single piece of text. This is stored on a machine as a character sequence, where these characters in combination embody the text of a written natural language expression. Sentiment analysis builds upon the problem of text classification, which makes the definition of a document given by Lewis (ibid) relevant to this domain. It goes beyond naive text classification however by seeking to determine the expressed emotion in a document, which can occur at multiple expressive levels.

.

Historical definitions of sentiment analysis traditionally define it as recognising if the subjects of a text are described in a positive or negative manner. It is often referred to as determining the polarity of a text. By limiting categorisation through use of a small, closed set of possible classes that a document can be assigned to, this definition intelligently restricts the set of categories to either positive or negative (Turney, 2002), with the occasional use of neutrality. This differs drastically from the definition which will be used in this work, which concentrates on textual emotion recognition. In this the option of a variable set size is introduced. This is due to the range of possible emotions that can be linked to the text of a document. With a limited set to work with, it could be argued that polarity identification is a simpler task than textual emotion recognition. However, both areas struggle with the challenge posed by the written language of emotion, in particular its expressiveness.

.
1.2.3 Linguistic Expression of Emotion
In any form of written text that wishes to convey an emotion, there are two significant modes of expressing this phenomenon in language. The first is the explicit communication of emotion. Strapparava et al. (2006) refer to this as the direct affective words of a text. An example of this is:
.
“What a wonderful policy.”
.
This sentence explicitly describes through use of the word ‘wonderful’ that the speaker’s attitude towards the policy is positive. Therefore in sentiment analysis, if this sentence was regarded as the whole document for classification, with no external documents affecting its context, it could be assigned the positive label (more on document annotation will be discussed in Chapter 3). Whitelaw et al. (2005) demonstrate that only identifying the explicit features of a document yields favourable results, but by doing this the assumption is made that direct affective words are of more importance than other forms of linguistic expression in developing intelligent systems due to the favourable patterns of identification they yield. This should not be the case as implicit linguistic expressions can bear just as much, if not more emotional information:
.
“Jesus Christ!”
.
Previous approaches to sentiment analysis may have suggested that due to the lack of emotional words the sentence here is inherently neutral. However, this sentence could refer to a number of scenarios, and is contextually ambiguous due to the emotional nature that this phrase can communicate when voiced. This sentence could provoke a positive emotion, as it could be uttered under the context of a positive event transpiring. However, this is not the only emotion that could be be associated with it, as can be revealed by further cogitating the sentence. This could also have been uttered under a negative context, whereby a tragedy may have occurred. Due to this, the desired emotional connotation would be a negative one. Strapparava et al. (2006) refer to this type of expression as containing the indirect affective words of a text.
.
The above example displays the difficulty of deducing an implied set of emotions from text, as either an a priori knowledge of the situation is required, or a mechanism to understand the underlying semantics of the document. If we take the two words of the sentence independently, no emotional information can trivially be deduced, and a religious reference could be associated with this utterance. Yet, if we take the words in combination, they probe the reader for a background knowledge that is crucial in deducing the sentence’s emotional connotations. This phenomenon is common in natural language, especially English, where the emotional meaning of a document is subtler than it may first seem. This thesis must attempt to overcome the issue of identifying an implicit emotion, so research questions will be asked which aim to explore possible solutions to overcome this problem.

.

1.3 Research Questions

In light of the overview of this work, my proposed research will aim to address three major questions:

RQ1 Which model of emotion is best suited to sentiment analysis?

(a) Are the emotions expressed in text suited to an ontology?

RQ2 How should documents be annotated with emotional information?

(a) What spans of text should be annotated?

(b) How will structural information be captured?

(c) How will the different forms of expression be captured?

RQ3 Which machine learning techniques are most suited to the task of textual emotion recognition?


The first question is the motivating question of my research. It is a high level question, so it has been divided into a sub-question in order to produce a workable contribution to a solution. The first question (RQ1) must be asked as this thesis is not seeking to redefine the wheel of literature available concerning models of emotion. This thesis aims to critically assert which currently proposed model, if any, would be suitable to define the emotions that are held in text. This deviates from much of the scientific literature on emotion research, which tends to focus on modelling emotion given facial expressions (Picard, 1995; Russell, 1994) or speech data (Cowie et al., 2001; Dellaert et al., 1996; Murray & Arnott, 1996).

RQ1(a) questions whether a structure can be imposed on the emotions exhibited in text. If this is the case, it will be of interest to investigate whether a combination of emotions, or the combination of the relationships between them, could lead to a further emotion being derived, and if so how this system works. This investigation can only gauge so much from a literature review, so through experimentation this is a vital part of RQ1.

The following two research questions, RQ2 and RQ3, branch from the main research question. They further expand on the issue of the emotions that are typically expressed in a document, and in doing so angle the research in a computational direction. RQ3 considers the machine learning approaches to textual emotion recognition. There are two main classes of machine learning algorithms, supervised and unsupervised. To observe and thoroughly experiment with each approach that the two classes consist of is beyond the scope of this research, however a subset of the approaches will be considered in working towards a solution to this question.

RQ2 concerns the annotation framework which should be created in order to maximize the output of the algorithms. The question of how a document should be annotated with emotional information has been divided into specific questions where it is felt the literature does not provide a sufficient solution to the problem.

1.4 Hypotheses
The research questions raise the following hypotheses, which will form the basis for experimentation
in this work:
Hypothesis 1 - (RQ1) Emotions can be structured as a tree, with valenced categories acting as the root node, and fine-grained emotional categories at the leaves.
Hypothesis 2 - (RQ2) Expressed emotion is not a sum of its parts, and therefore documents should be annotated across various levels to capture this expression.
Hypothesis 3 - (RQ3) Supervised machine learning techniques in combination with a dependency structure are most suited to sentiment analysis.
.
This introduction to the thesis has introduced a basis for the formation of these hypotheses, and
the following chapters of this proposal will justify their inception.
.
.