mirage

UNSUPERVISED MACHINE LEARNING APPROACH FOR TIGRIGNA WORD SENSE DISAMBIGUATION

DSpace Repository

Show simple item record

dc.contributor.author Mebrahtu, Meresa
dc.date.accessioned 2017-07-31T06:06:51Z
dc.date.available 2017-07-31T06:06:51Z
dc.date.issued 2017-03-05
dc.identifier.uri http://hdl.handle.net/123456789/916
dc.description.abstract All human languages have words that can mean different things in different contexts. Word sense disambiguation (WSD) is an open problem of natural language processing, which governs the process of identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings (polysemy). In this paper, we are concerned with a corpus based approach to word sense disambiguation for Tigrigna texts that only requires information that can be automatically extracted from untagged text. We use unsupervised techniques to address the problem of automatically deciding the correct sense of an ambiguous word based on its surrounding context. And we report experiments on four selected Tigrigna ambiguous words due to lack of sufficient training data; these are መዯብ read as “medeb” has three different meaning (Program, Traditional bed and Grouping), ሓሇፈ read as “halefe”; has four dissimilar meanings (Pass, Promote, Boss and Pass away), ሃዯመ read as “hademe”; has two different meaning (Running and Building house) and, ከበረ read as “kebere”; has two different meaning (Respecting and Expensive). For the purposes of this research, unsupervised machine learning technique was applied to a corpus of Tigrigna sentences so as to acquire disambiguation information automatically. A total of 631 sense examples transcribed to Latin script for the four ambiguous words were collected from different online Tigrigna websites and newspapers. Finally we tested five clustering algorithms (simple k means, hierarchical agglomerative: Single, Average and complete link and Expectation Maximization algorithms) in the existing implementation of Weka 3.8.1 package. “Use training set” evaluation mode was selected to learn the selected algorithms in the preprocessed dataset. We have evaluated the algorithms for the four ambiguous words and achieved the best accuracy with in the range of 52 to 77.5% for Simple k-means, 67 to 83.3 for EM, 45.6 to 74.1 for Single, 65 to 73.3 for AL and 65 to 73.3 for CL clustering algorithms which is encouraging result. Finally we achieve the best accuracy 67 to 83.3 in EM algorithm. However, we face challenges in collecting datasets, properly stemming of words and transliterating the sentences to SERA system in order to get higher accuracy. Owing that, further experiments for other ambiguous words and using different approaches needed to better natural language understanding of Tigrigna language. en_US
dc.description.sponsorship UoG en_US
dc.language.iso en_US en_US
dc.subject Computer Science en_US
dc.title UNSUPERVISED MACHINE LEARNING APPROACH FOR TIGRIGNA WORD SENSE DISAMBIGUATION en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search in the Repository


Advanced Search

Browse

My Account