Abstract
Natural Language Processing has seen a tremendous boost in popularity following the widespread use of the World Wide Web, and emergence of machine learning tools.
The specific problem of sentiment analysis has become a popular topic with the availability of user generated content, from micro-blogs and the likes.
But these data dependent problems have seen a larger jump in popularity in the international field, compared to low-resource languages, due to the availability of language specific data.
This thesis seeks to delve into the problem of sentiment analysis research within some of these low-resource languages, specifically those of mainland Scandinavia, which are closely related languages.
We perform a literature review to uncover popular research topics within this language specific field, and seek to find practical and theoretical work as well as resources within this field.
Furthermore we perform experiments adapting international tools for these low-resource languages, and compare our results to that of the research, in order to further contribute to the language specific research field
REFERENCES
Adhikari, A., A. Ram, R. Tang, and J. Lin (2019). Docbert: Bert for document classifica-
tion. arXiv preprint arXiv:1904.08398.
Alonso, H. M., A. Johannsen, S. Olsen, S. Nimb, N. H. Sørensen, A. Braasch, A. Søgaard,
and B. S. Pedersen (2015). Supersense tagging for danish. In Proceedings of the 20th
Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015,
Vilnius, Lithuania, Number 109, pp. 21–29. Linköping University Electronic Press.
Alpaydin, E. (2009). Introduction to machine learning. MIT press.
Bai, A., H. Hammer, A. Yazidi, and P. Engelstad (2014). Constructing sentiment lexicons
in norwegian from a large text corpus. In 2014 IEEE 17th international conference on
computational science and engineering, pp. 231–237. IEEE.
Borin, L., M. Forsberg, and L. Lönngren (2013). Saldo: a touch of yin to wordnet’s yang.
Language resources and evaluation 47(4), 1191–1211.
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova (2018). Bert: Pre-training
of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805.
Durgesh, K. S. and B. Lekha (2010). Data classification using support vector machine.
Journal of theoretical and applied information technology 12(1), 1–7.
Eide, S. R., N. Tahmasebi, and L. Borin (2016). The swedish culturomics gigaword
corpus: A one billion word swedish reference dataset for nlp. In Digital Humani-
ties 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic
Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow,
Poland, Number 126, pp. 8–12. Linköping University Electronic Press.
Elming, J., B. Plank, and D. Hovy (2014). Robust cross-domain sentiment analysis for
low-resource languages. In Proceedings of the 5th Workshop on Computational Ap-
proaches to Subjectivity, Sentiment and Social Media Analysis, pp. 2–7.
Fink, A. (2019). Conducting research literature reviews: From the internet to paper. Sage
publications.
Friedman, J., T. Hastie, and R. Tibshirani (2009). glmnet: Lasso and elastic-net regular-
ized generalized linear models. R package version 1(4).
Goldberg, Y. (2017). Neural network methods for natural language processing. Synthe-
sis Lectures on Human Language Technologies 10(1), 1–309.
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. MIT Press. http:
//www.deeplearningbook.org.
Hagen, K., J. B. Johannessen, and A. Noklestad (2000). A constraint-based tagger
for norwegian. ODENSE WORKING PAPERS IN LANGUAGE AND COMMUNICA-
TIONS (1), 31–48.
Hammer, H., A. Bai, A. Yazidi, and P. Engelstad (2014). Building sentiment lexicons
applying graph theory on information from three norwegian thesauruses. Norsk In-
formatikkonferanse (NIK).
Harris, D. and S. Harris (2010). Digital design and computer architecture. Morgan Kauf-
mann.
Hohle, P., L. Øvrelid, and E. Velldal (2017). Optimizing a pos tagset for norwegian de-
pendency parsing. In Proceedings of the 21st Nordic Conference on Computational
Linguistics, pp. 142–151.
Holmberg, A. and C. Platzack (2005). The scandinavian languages. The Oxford hand-
book of comparative syntax, 420–459.
Ikonomakis, M., S. Kotsiantis, and V. Tampakas (2005). Text classification using ma-
chine learning techniques. WSEAS transactions on computers 4(8), 966–974.
Johannessen, J. B., K. Hagen, Å. Haaland, A. B. Jónsdottir, A. Nøklestad, D. Kokkinakis,
P. Meurer, E. Bick, and D. Haltrup (2005). Named entity recognition for the mainland
scandinavian languages. Literary and Linguistic Computing 20(1), 91–102.
Johannessen, J. B., K. Hagen, A. Nøklestad, and A. Lynum (2011). Obt+ stat: Evaluation
of a combined cg and statistical tagger. Constraint Grammar Applications, 26–34.
Jones, K. S. (2004). A statistical interpretation of term specificity and its application in
retrieval. Journal of documentation.
Joulin, A., E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016). Fast-
text.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
Kann, V. and M. Rosell (2006). Free construction of a free swedish dictionary of syn-
onyms. In Proceedings of the 15th Nordic Conference of Computational Linguistics
(NODALIDA 2005), pp. 105–110.
Karlsson, F., A. Voutilainen, J. Heikkilä, and A. Anttila (1995, 01). Constraint Grammar:
A Language-Independent System for Parsing Unrestricted Text.
Kirkedal, A., I. Copenhagen, B. Plank, L. Derczynski, and N. Schluter (2019). The lacu-
nae of danish natural language processing. In Proceedings of the 22nd Nordic Con-
ference on Computational Linguistics, pp. 356–362.
Kitchenham, B., O. P. Brereton, D. Budgen, M. Turner, J. Bailey, and S. Linkman (2009).
Systematic literature reviews in software engineering–a systematic literature review.
Information and software technology 51(1), 7–15.
Kniberg, H. and M. Skarin (2010). Kanban and Scrum-making the most of both. Lulu.
com.
Le, Q. and T. Mikolov (2014). Distributed representations of sentences and documents.
In International conference on machine learning, pp. 1188–1196.
LeCun, Y., Y. Bengio, and G. Hinton (2015). Deep learning. nature 521(7553), 436–444.
Levy, O. and Y. Goldberg (2014). Neural word embedding as implicit matrix factoriza-
tion. In Advances in neural information processing systems, pp. 2177–2185.
Li, Y. and H. Fleyeh (2018). Twitter sentiment analysis of new ikea stores using machine
learning. In 2018 International Conference on Computer and Applications (ICCA), pp.
4–11. IEEE.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human
language technologies 5(1), 1–167.
Ludovici, M. and R. Weegar (2016). A sentiment model for swedish with automatically
created training data and handlers for language specific traits. In Sixth Swedish Lan-
guage Technology Conference (SLTC), Umeå, Sweden, 17-18 November, 2016.
Marco, C. S. (2014). An open source part-of-speech tagger for norwegian: Building on
existing language resources. In LREC, pp. 4111–4117.
Maron, M. E. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM
(JACM) 8(3), 404–417.
Matthews, B. W. (1975). Comparison of the predicted and observed secondary struc-
ture of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Struc-
ture 405(2), 442–451.
Mikolov, T., K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word rep-
resentations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural
information processing systems, pp. 3111–3119.
Mohammad, S. M., S. Kiritchenko, and X. Zhu (2013). Nrc-canada: Building the state-
of-the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242.
Nielsen, F. Å. (2011). A new anew: Evaluation of a word list for sentiment analysis in
microblogs. arXiv preprint arXiv:1103.2903.
Nielsen, F. Å. (2018). Danish resources. http://www2.imm.dtu.dk/pubdb/views/
edoc_download.php/6956/pdf/imm6956.pdf.
Nusko, B., N. Tahmasebi, and O. Mogren (2016). Building a sentiment lexicon for
swedish. In Digital Humanities 2016. From Digitization to Knowledge 2016: Re-
sources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of
the Workshop, July 11, 2016, Krakow, Poland, Number 126, pp. 32–37. Linköping Uni-
versity Electronic Press.
Palm, N. (2019). Sentiment classification of swedish twitter data.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011). Scikit-learn: Machine learning in
python. Journal of machine learning research 12(Oct), 2825–2830.
Petersen, K., F. R. M. S. . M. M. (2008). Systematic mapping studies in software engi-
neering. Ease 8, 68–77.
Petersen, K., V. S. . K. L. (2015). Guidelines for conducting systematic mapping studies
in software engineering: An update. Information and Softare Technology 64, 1–18.
Rosell, M. and V. Kann (2010). Constructing a swedish general purpose polarity lexicon
random walks in the people’s dictionary of synonyms. In Proceedings of Swedish
language technology conference, pp. 19–20.
Rouces, J., N. Tahmasebi, L. Borin, and S. R. Eide (2018a). Generating a gold standard
for a swedish sentiment lexicon. In Proceedings of the Eleventh International Confer-
ence on Language Resources and Evaluation (LREC 2018).
Rouces, J., N. Tahmasebi, L. Borin, and S. R. Eide (2018b). Sensaldo: Creating a senti-
ment lexicon for swedish. In Proceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC-2018).
Rumelhart, D. E., G. E. Hinton, R. J. Williams, et al. (1988). Learning representations by
back-propagating errors. Cognitive modeling 5(3), 1.
Sand, H., E. Velldal, and L. Øvrelid (2017). Wordnet extension via word embeddings:
Experiments on the norwegian wordnet. In Proceedings of the 21st Nordic Conference
on Computational Linguistics, pp. 298–302.
Schütze, H., C. D. Manning, and P. Raghavan (2008). Introduction to information re-
trieval. In Proceedings of the international communication of association for com-
puting machinery conference, pp. 260.
Socher, R., A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013).
Recursive deep models for semantic compositionality over a sentiment treebank. In
Proceedings of the 2013 conference on empirical methods in natural language pro-
cessing, pp. 1631–1642.
Solberg, P. E. (2013). Building gold-standard treebanks for norwegian. In Proceedings
of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May
22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16, Number 085, pp.
459–464. Linköping University Electronic Press.
Solberg, P. E., A. Skjærholt, L. Øvrelid, K. Hagen, and J. B. Johannessen (2014). The
norwegian dependency treebank.
Starkweather, J. and A. K. Moske (2011). Multinomial logistic regression. Consulted page at September 10th: http://www. unt.
edu/rss/class/Jon/Benchmarks/MLR_JDS_Aug2011. pdf 29, 2825–2830.
Velldal, E., L. Øvrelid, E. A. Bergem, C. Stadsnes, S. Touileb, and F. Jørgensen (2017).
Norec: The norwegian review corpus. arXiv preprint arXiv:1710.05370.
Velldal, E., L. Øvrelid, and P. Hohle (2017). Joint ud parsing of norwegian bokmål and
nynorsk. In Proceedings of the 21st Nordic Conference on Computational Linguistics,
NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden, Number 131, pp. 1–10. Linköping
University Electronic Press.
Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019). Xlnet:
Generalized autoregressive pretraining for language understanding. arXiv preprint
arXiv:1906.08237.
Zhu, X. and Z. Ghahramani (2002). Learning from labeled and unlabeled data with
label propagation.
https://bora.uib.no/bora-xmlui/bitstream/handle/1956/21345/Thesis.pdf?sequence=1