Tuesday, August 30, 2016

A Framework and practical implementation for sentiment analysis and aspect exploration [problem statement]


 

.

1.2 Problem statement and research questions

.

The explosion of the Web 2.0 has not only brought us a huge volume of opinionated data recorded in digital forms, but also provided us a great opportunity to understand the sentiment of the public by analysing these large-scale data. However, all of the user generated data is a double-edged sword: the larger the size of the data, the more difficult it is to extract useful information. A survey shows that Facebook generates 250 million posts per hour and Twitter users on the other hand generate 21 million tweets per hour (George, 2015). Nowadays, the review website TripAdvisor 4 generates more than 255 reviews every minute and nearly 2,600 new topics are posted every day. So far TripAdvisor has over 385 million reviews and opinions from users around the world (TripAdvisor, 2016). Facing such big data, studies have already revealed that more than half of online customers encounter frustrations during their online shopping. It makes difficult for a potential customer to read the reviews and make an informed decision. Around 30% of online customers have felt confused and overwhelmed by the amount of information, since there is a large number of spam or duplicate content in websites (Horrigan, 2008; Niven, 2012).

.

Although we are in the era of Web 2.0, flooded with tons of data every day, companies and organizations also face problems dealing with the opinionated data effectively. A survey shows that three quarters of 2,100 organizations do not have a clear idea of what their most valuable customers think about them and nearly 31% of them find it difficult to measure customers’ opinions (Michael, 2012). It is obvious that they do not lack the data sources of customers’ opinions, but the overwhelming size of opinionated data and the complexity of dealing with subjectivity, makes it difficult to extract useful information for organizations. 

.

The need to deal with these unstructured opinionated data naturally leads to the rise of research in the field of sentiment analysis. Sentiment analysis has been one of the most active research areas in natural language processing (NLP) since 2002 (see Section 2.3). The main task of sentiment analysis is to automatically determine the semantic orientation (SO) in a given document (Turney, 2002; Pang and Lee, 2008;). Semantic orientation (SO) refers to a measure of opinions and subjectivity, which indicates the polarity (positive, negative or neutral) and strength of words, phrases, sentences or documents (Hatzivassiloglou and McKeown, 1997; Turney 2002; Liu, 2010). Currently research on sentiment analysis has been dominated by two basic approaches: the first one is machine learning approach, which aims to build text classifiers by selecting right text features and algorithms from labelled instances of texts (see Section 2.5.2). The other is semantic orientation approach, which involves calculating the overall polarity via the semantic orientation of words or phrases in the text (see Section 2.5.1). Since the latter approach utilizes lexical resources like lists of opinion-bearing words, lexicons, dictionaries etc., it is also referred as lexicon-based approach (Peng and Park, 2004; Ding et al., 2008; Na et al., 2009; Taboada et al., 2011; Molina-González et al., 2015). Thus in this thesis, the terms ‘semantic orientation approach’ and ‘lexicon-based approach’ are used interchangeably.

.

Many sentiment analysis tools and applications have been developed to mine the opinions in user generated content in the Web. However, the performances are very poor due to the complexity of natural language (Sobkowicz et al. 2012, Mohammad et al., 2013; Maynard, 2016). In essence, sentiment analysis is still a problem of natural language processing (NLP), which deals with the natural language documents, which are also called unstructured data (Liu, 2012). Prior researches show that sentiment analysis is more difficult than the traditional topic-based text classification (Pang and Lee, 2008). Although various approaches have been proposed to conduct sentiment analysis, it is still difficult to deal with some linguistic phenomena, such as negation and mix-opinion text. This leads to low accuracy of sentiment classification (Vinodhini, and Chandrasekaran, 2012; Park et al., 2015; Khan et al., 2016). Besides, it is insufficient to only determine the polarity of the opinions, since an opinion without a target is of limited use. The task of extracting the opinions and their targets simultaneously, is also called aspect-level sentiment analysis in the research literature and is more difficult to achieve (Liu, 2012). Current studies show that the methods dealing with aspect-level sentiment analysis are limited (see Section 2.3.3). 

.

Due to the existing real-world problems in dealing with the big data and current research gaps (see Section 2.4.4 for more details), the research presented in this thesis is motivated to address the following two research questions: 

1) How can online product reviews be automatically and accurately classified with respect to their sentiments?

2) How to detect the aspects of sentiments shown in the online product reviews effectively? 

.

The first research question concerns the need to manage the large amount of online reviews automatically and improve the performance of sentiment classification. The second research question underlines the significance to identify the targets of the opinions, which pursues to help individuals to make an informed purchasing decision and provide manufacturers insight in order improve their products or services. 

.

1.3 Aim and objectives 

.

The aim of this thesis is to explore an effective way to conduct fine-grained sentiment analysis by improving the performance of sentiment classification and extracting aspects related with the sentiments. To cater for this aim, there are three objectives that this research has tried to achieve.

.

The first objective intends to handle the text that contains positive and negative orientated opinions, because most of the real-word data shows that positive and negative sentiments co-occur in the same document. Most documents will have both positive and negative views. Besides the aspects (attributes of an entity that a review is about) of the opinions can be various, and therefore, it is essential to separate the mixed-opinion reviews.

.

Secondly, following the semantic orientation approach for sentiment analysis (see Section 2.5.1), a domain sentiment lexicon needs to be constructed and is used to determine the polarity of a document. The sentiment lexicon contains the words with their sentiment inclinations. Due to various domains, words could be used differently and show opposite sentiment orientations in each domain. Thus the sentiment lexicon used for sentiment analysis is the key to obtaining more accurate results.

.

Furthermore, online product reviews include a variety of aspects (see Section 2.3.3). Therefore, the third objective is to extract the aspects of the products within a review, instead of predefining them, and then identify the sentiments about them.

.

Achieving these three objectives should lead to a coherent sentiment analysis framework that is proposed in this research (see Chapter 3), which aims to improve the performance of sentiment classification and provide in-depth aspect-level analysis. 

.

https://www.research.manchester.ac.uk/portal/files/55559300/FULL_TEXT.PDF

.