PoppaRazi: July 2020

1.3 Research problem

Inspired by the challenges discussed in the previous section, the overall research problem in this study is to design and develop a systematic and comprehensive framework for document-level Email sentiment analysis. The aim is to effectively analyse and classify sentiments from Email data according to the framework shown in Figure 1.2. The framework consists of four major phases—preprocessing, feature generation, document vectorisation and sentiment analysis—and contains four main functions:

• Noise handling

• Sentiment sequence

• Sentiment classification

• Quantitative evaluation

The above four functions are associated with unique features identified in Email data and the general framework formulated for document-level Email sentiment analysis. In brief, a noise handling function is implemented in the preprocessing phase that aims to solve the issue of noise and unstructured content through proper Email cleaning and text normalisation methods. Sentiment sequence features and multi-topic features are addressed in the feature generation phase as part of the sentiment sequencing and sentiment classification functions. A quantitative evaluation function is implemented in the sentiment analysis phase that aims to obtain reliable classification results from an adequate amount of data through appropriate data augmentation methods.

1.4 Research aims and questions

To break down the aforementioned framework into more specific tasks, four research aims are defined according to the main components of the framework:

• Preprocessing: To investigate preprocessing methods that reduce the impact of unstructured and noisy data, and data scarcity.

• Feature generation: To investigate the effectiveness of sentiment sequence and multi-topic features on Email sentiment determination and effective feature generation methods.

• Document vectorisation: To investigate document vectorisation methods that capture sentiment sequence and multi-topic features that can be used to effectively model Email documents and represent them as numeric vectors

• Sentiment analysis: To investigate effective sentiment sequence discovery and sentiment classification methods.

The high-level research question derived from the main research problem is formulated as: how to incorporate the special characteristics of Email, including noise, sentiment sequence and multi-topic, into the sentiment analysis process and build a robust and effective framework for Email sentiment classification? Several sub-questions are identified that should lead to concrete technical approaches to achieving each aim:

1. What preprocessing methods are essential in addressing unstructured and noisy contents in Email data and can solve the issues of data scarcity and imbalanced class distributions in labelled Emails?

2. How to effectively capture sentiment sequence features and discover sentiment sequence patterns within Email data?

3. How to encode sentiment sequence features in a neural network model for robust and accurate sentiment polarity classification?

4. How to capture multi-topic features and model documents with multi-topic segments for effective sentiment polarity classification?

Briefly, Research Question 2 is addressed through a study on sentiment sequence clustering, with a more detailed discussion given in Chapter 4. Research Question 3 is addressed through a study on sequence-encoded neural sentiment classification, with a more detailed discussion provided in Chapter 5. Research Question 4 is addressed by a study on multi-topic neural sentiment classification (Chapter 6). Research Question 1 is addressed by conducting experiments that compare the preprocessed and original data obtained in the second and third studies (Chapter 5 & 6). Research hypotheses associated with the research aims and questions are discussed in Section 2.5 following a thorough review of the literature and a summary of existing research gaps.

1.5 Thesis significance

The main significance of the research is the design and development of a systematic and comprehensive framework for document-level sentiment analysis of Email data. The framework fulfills four tasks, including noise handling, sentiment sequence discovery, sentiment polarity classification and quantitative evaluation, through three studies on 1) sentiment sequence clustering, 2) sequence-encoded neural sentiment classification and 3) multi-topic neural sentiment classification. an investigation on the . This research further contributes to the literature of Email sentiment analysis by investigating the effectiveness of Email data preprocessing and augmentation methods on solving the issues of data scarcity and imbalanced class distributions.

https://researchonline.jcu.edu.au/65310/1/JCU_65310_liu_s_thesis_2020.pdf