Friday, November 29, 2019

Sentiment Analysis in Scandinavian Languages: Systematic Review and Evaluation

 


Abstract

Natural Language Processing has seen a tremendous boost in popularity following the widespread use of the World Wide Web, and emergence of machine learning tools. 

The specific problem of sentiment analysis has become a popular topic with the availability of user generated content, from micro-blogs and the likes. 

But these data dependent problems have seen a larger jump in popularity in the international field, compared to low-resource languages, due to the availability of language specific data. 

This thesis seeks to delve into the problem of sentiment analysis research within some of these low-resource languages, specifically those of mainland Scandinavia, which are closely related languages. 

We perform a literature review to uncover popular research topics within this language specific field, and seek to find practical and theoretical work as well as resources within this field. 

Furthermore we perform experiments adapting international tools for these low-resource languages, and compare our results to that of the research, in order to further contribute to the language specific research field

REFERENCES

Adhikari, A., A. Ram, R. Tang, and J. Lin (2019). Docbert: Bert for document classifica-

tion. arXiv preprint arXiv:1904.08398.


Alonso, H. M., A. Johannsen, S. Olsen, S. Nimb, N. H. Sørensen, A. Braasch, A. Søgaard,

and B. S. Pedersen (2015). Supersense tagging for danish. In Proceedings of the 20th

Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015,

Vilnius, Lithuania, Number 109, pp. 21–29. Linköping University Electronic Press.


Alpaydin, E. (2009). Introduction to machine learning. MIT press.


Bai, A., H. Hammer, A. Yazidi, and P. Engelstad (2014). Constructing sentiment lexicons

in norwegian from a large text corpus. In 2014 IEEE 17th international conference on

computational science and engineering, pp. 231–237. IEEE.


Borin, L., M. Forsberg, and L. Lönngren (2013). Saldo: a touch of yin to wordnet’s yang.

Language resources and evaluation 47(4), 1191–1211.


Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova (2018). Bert: Pre-training

of deep bidirectional transformers for language understanding. arXiv preprint

arXiv:1810.04805.


Durgesh, K. S. and B. Lekha (2010). Data classification using support vector machine.

Journal of theoretical and applied information technology 12(1), 1–7.


Eide, S. R., N. Tahmasebi, and L. Borin (2016). The swedish culturomics gigaword

corpus: A one billion word swedish reference dataset for nlp. In Digital Humani-

ties 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic

Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow,

Poland, Number 126, pp. 8–12. Linköping University Electronic Press.


Elming, J., B. Plank, and D. Hovy (2014). Robust cross-domain sentiment analysis for

low-resource languages. In Proceedings of the 5th Workshop on Computational Ap-

proaches to Subjectivity, Sentiment and Social Media Analysis, pp. 2–7.


Fink, A. (2019). Conducting research literature reviews: From the internet to paper. Sage

publications.


Friedman, J., T. Hastie, and R. Tibshirani (2009). glmnet: Lasso and elastic-net regular-

ized generalized linear models. R package version 1(4).


Goldberg, Y. (2017). Neural network methods for natural language processing. Synthe-

sis Lectures on Human Language Technologies 10(1), 1–309.


Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. MIT Press. http:

//www.deeplearningbook.org.


Hagen, K., J. B. Johannessen, and A. Noklestad (2000). A constraint-based tagger

for norwegian. ODENSE WORKING PAPERS IN LANGUAGE AND COMMUNICA-

TIONS (1), 31–48.


Hammer, H., A. Bai, A. Yazidi, and P. Engelstad (2014). Building sentiment lexicons

applying graph theory on information from three norwegian thesauruses. Norsk In-

formatikkonferanse (NIK).


Harris, D. and S. Harris (2010). Digital design and computer architecture. Morgan Kauf-

mann.


Hohle, P., L. Øvrelid, and E. Velldal (2017). Optimizing a pos tagset for norwegian de-

pendency parsing. In Proceedings of the 21st Nordic Conference on Computational

Linguistics, pp. 142–151.


Holmberg, A. and C. Platzack (2005). The scandinavian languages. The Oxford hand-

book of comparative syntax, 420–459.


Ikonomakis, M., S. Kotsiantis, and V. Tampakas (2005). Text classification using ma-

chine learning techniques. WSEAS transactions on computers 4(8), 966–974.


Johannessen, J. B., K. Hagen, Å. Haaland, A. B. Jónsdottir, A. Nøklestad, D. Kokkinakis,


P. Meurer, E. Bick, and D. Haltrup (2005). Named entity recognition for the mainland

scandinavian languages. Literary and Linguistic Computing 20(1), 91–102.


Johannessen, J. B., K. Hagen, A. Nøklestad, and A. Lynum (2011). Obt+ stat: Evaluation

of a combined cg and statistical tagger. Constraint Grammar Applications, 26–34.


Jones, K. S. (2004). A statistical interpretation of term specificity and its application in

retrieval. Journal of documentation.


Joulin, A., E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016). Fast-

text.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.


Kann, V. and M. Rosell (2006). Free construction of a free swedish dictionary of syn-

onyms. In Proceedings of the 15th Nordic Conference of Computational Linguistics

(NODALIDA 2005), pp. 105–110.


Karlsson, F., A. Voutilainen, J. Heikkilä, and A. Anttila (1995, 01). Constraint Grammar:

A Language-Independent System for Parsing Unrestricted Text.


Kirkedal, A., I. Copenhagen, B. Plank, L. Derczynski, and N. Schluter (2019). The lacu-

nae of danish natural language processing. In Proceedings of the 22nd Nordic Con-

ference on Computational Linguistics, pp. 356–362.


Kitchenham, B., O. P. Brereton, D. Budgen, M. Turner, J. Bailey, and S. Linkman (2009).

Systematic literature reviews in software engineering–a systematic literature review.

Information and software technology 51(1), 7–15.


Kniberg, H. and M. Skarin (2010). Kanban and Scrum-making the most of both. Lulu.

com.


Le, Q. and T. Mikolov (2014). Distributed representations of sentences and documents.

In International conference on machine learning, pp. 1188–1196.


LeCun, Y., Y. Bengio, and G. Hinton (2015). Deep learning. nature 521(7553), 436–444.


Levy, O. and Y. Goldberg (2014). Neural word embedding as implicit matrix factoriza-

tion. In Advances in neural information processing systems, pp. 2177–2185.


Li, Y. and H. Fleyeh (2018). Twitter sentiment analysis of new ikea stores using machine

learning. In 2018 International Conference on Computer and Applications (ICCA), pp.

4–11. IEEE.


Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human

language technologies 5(1), 1–167.


Ludovici, M. and R. Weegar (2016). A sentiment model for swedish with automatically

created training data and handlers for language specific traits. In Sixth Swedish Lan-

guage Technology Conference (SLTC), Umeå, Sweden, 17-18 November, 2016.


Marco, C. S. (2014). An open source part-of-speech tagger for norwegian: Building on

existing language resources. In LREC, pp. 4111–4117.


Maron, M. E. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM

(JACM) 8(3), 404–417.


Matthews, B. W. (1975). Comparison of the predicted and observed secondary struc-

ture of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Struc-

ture 405(2), 442–451.


Mikolov, T., K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word rep-

resentations in vector space. arXiv preprint arXiv:1301.3781.


Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed repre-

sentations of words and phrases and their compositionality. In Advances in neural

information processing systems, pp. 3111–3119.


Mohammad, S. M., S. Kiritchenko, and X. Zhu (2013). Nrc-canada: Building the state-

of-the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242.


Nielsen, F. Å. (2011). A new anew: Evaluation of a word list for sentiment analysis in

microblogs. arXiv preprint arXiv:1103.2903.


Nielsen, F. Å. (2018). Danish resources. http://www2.imm.dtu.dk/pubdb/views/

edoc_download.php/6956/pdf/imm6956.pdf.


Nusko, B., N. Tahmasebi, and O. Mogren (2016). Building a sentiment lexicon for

swedish. In Digital Humanities 2016. From Digitization to Knowledge 2016: Re-

sources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of

the Workshop, July 11, 2016, Krakow, Poland, Number 126, pp. 32–37. Linköping Uni-

versity Electronic Press.


Palm, N. (2019). Sentiment classification of swedish twitter data.


Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,

P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011). Scikit-learn: Machine learning in

python. Journal of machine learning research 12(Oct), 2825–2830.


Petersen, K., F. R. M. S. . M. M. (2008). Systematic mapping studies in software engi-

neering. Ease 8, 68–77.


Petersen, K., V. S. . K. L. (2015). Guidelines for conducting systematic mapping studies

in software engineering: An update. Information and Softare Technology 64, 1–18.


Rosell, M. and V. Kann (2010). Constructing a swedish general purpose polarity lexicon

random walks in the people’s dictionary of synonyms. In Proceedings of Swedish

language technology conference, pp. 19–20.


Rouces, J., N. Tahmasebi, L. Borin, and S. R. Eide (2018a). Generating a gold standard

for a swedish sentiment lexicon. In Proceedings of the Eleventh International Confer-

ence on Language Resources and Evaluation (LREC 2018).


Rouces, J., N. Tahmasebi, L. Borin, and S. R. Eide (2018b). Sensaldo: Creating a senti-

ment lexicon for swedish. In Proceedings of the Eleventh International Conference on

Language Resources and Evaluation (LREC-2018).


Rumelhart, D. E., G. E. Hinton, R. J. Williams, et al. (1988). Learning representations by

back-propagating errors. Cognitive modeling 5(3), 1.


Sand, H., E. Velldal, and L. Øvrelid (2017). Wordnet extension via word embeddings:

Experiments on the norwegian wordnet. In Proceedings of the 21st Nordic Conference

on Computational Linguistics, pp. 298–302.


Schütze, H., C. D. Manning, and P. Raghavan (2008). Introduction to information re-

trieval. In Proceedings of the international communication of association for com-

puting machinery conference, pp. 260.


Socher, R., A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013).

Recursive deep models for semantic compositionality over a sentiment treebank. In

Proceedings of the 2013 conference on empirical methods in natural language pro-

cessing, pp. 1631–1642.


Solberg, P. E. (2013). Building gold-standard treebanks for norwegian. In Proceedings

of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May

22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16, Number 085, pp.

459–464. Linköping University Electronic Press.


Solberg, P. E., A. Skjærholt, L. Øvrelid, K. Hagen, and J. B. Johannessen (2014). The

norwegian dependency treebank.


Starkweather, J. and A. K. Moske (2011). Multinomial logistic regression. Consulted page at September 10th: http://www. unt.

edu/rss/class/Jon/Benchmarks/MLR_JDS_Aug2011. pdf 29, 2825–2830.


Velldal, E., L. Øvrelid, E. A. Bergem, C. Stadsnes, S. Touileb, and F. Jørgensen (2017).

Norec: The norwegian review corpus. arXiv preprint arXiv:1710.05370.


Velldal, E., L. Øvrelid, and P. Hohle (2017). Joint ud parsing of norwegian bokmål and

nynorsk. In Proceedings of the 21st Nordic Conference on Computational Linguistics,

NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden, Number 131, pp. 1–10. Linköping

University Electronic Press.


Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019). Xlnet:

Generalized autoregressive pretraining for language understanding. arXiv preprint

arXiv:1906.08237.


Zhu, X. and Z. Ghahramani (2002). Learning from labeled and unlabeled data with

label propagation.


https://bora.uib.no/bora-xmlui/bitstream/handle/1956/21345/Thesis.pdf?sequence=1