The order has two parts: one-page report to answer all the questions listed and the Java documentation.
Please submit both in time.
First, make sure you understand: What is the problem of Word Sense Disambiguation (WSD)? Why is it important from a language technology engineering point of view? How is it a classification problem? How do you use SVM learning and classification to solve classification problems generally and in this WSD case? (No detailed understanding of the internal mechanisms of SVM:s and SVM learning algorithms is required.) What is an n-fold crossvalidation? How do you compute the precision, recall and F1 scores? And what do they show?
The course literature, the links given here, and other net resources give plenty of useful information about the matters important to this assignment.
A Java implementation for WSD experiments provides the point of departure for the assignment [package wsd (tar), doc].
The sense keys are from WordNet and you can see how they are explained there. Classification applies to tokens belonging to a certain lemma, defined by the lemma and pos attributes, and is binary, predicting that a token has a certain sense (defined by a WordNet sense key) or that it doesn’t have that sense (has another sense).
You (and the Java implementation) will use the following resources (placed as specified in the FileLocations class fields, which you’re expected to modify according to your own preferences):
Data: The largest publicly available sense-tagged corpus: semcor3.0. (Download semcor3.0.) Put the semcor3.0 contents in the FileLocations.semCorLoc directory. (This is already done for our own linux system.)
We’ll use Thorsten Joachims’ SVMlight implementation of SVM learning (svm_learn) and classification (svm_classify). (Other SVM implementations can be used, of course.) Put the two programmes in the FileLocations.progLoc directory. (This is already done for our own linux system.)
We’ll use the classification tasks in the following text file for our experiments: senseDistinctions.txt. It lists those senses which have at least 100 positive instances, and which applies to between 40 and 60 percent of all the instances of that lemma. The 32 entries are like “add VB 2:30:00::”, i.e. lemma, pos tag, and sense key separated by blanks. Put senseDistinctions.txt in the FileLocations.wsdData directory.
The FileLocations.wsdData directory should have the subdirectories examples, model, and predictions, in which input and output from SVMlight will reside.
The code is compiled and executed in this way on our Linux system when you stand in the src directory (but you might want to use som other environment):
To describe the experiment briefly, it performs WSD SVM training and classification on each of the sense distinctions in “senseDistinctions.txt”. First, the example data are extracted from the SemCor corpus, i.e. positive and negative instances are located and features extracted. After that, training and validation sets are created for a 10-fold crossvalidation. They are used for SVM training and classification.
The Main.main method runs the experiment and reports precision, recall and F1 score for each sense distinction crossvalidation experiment and also averages of these outcome scores. This is an example of this: output.
The Java code that is provided gives us the following:
The ten-fold crossvalidation setup is for each sense distinction based on all available instances (i.e. the training set sizes vary). It computes precision, recall and F1 score for each sense distinction crossvalidation experiment.
An extremely parsimonious feature extraction class – FeatureExtractorLetterLeft – is provided. It only extracts the letter immediately to the left as a feature for a token. FeatureExtractorPosLetterLeft in addition extracts the pos tag for the token immediately to the left. (Of course, a lot of useful information escapes these schemes.) FeatureExtractorLetterLeft gives this result. As the result depends on a random partitioning of the data into ten folds, each run will give slightly different results.
The assignment consists in doing the following:
Evaluation may also be based on training sets of equal size. Overall evaluation metrics may also be based on the totality of token level outcomes. Contrast these evaluation choices with the ones made in the existing implementation! Which evaluation metrics are most interesting (in what context)? Furthermore, modify the implementation to allow evaluation metrics to be computed also in these ways.
Try to find and implement the best pos-tag-based collocational (pos at a certain relative position) feature extraction scheme.
Try to find and implement a better collocational (something at a certain relative position) feature extraction scheme.
In relation to the these extractions schemes, are there any sense distinctions that are unusually easy or difficult to predict? Is it possible to explain this with the help of a linguistics-based analysis of what is reasonable to expect from behaviour of the lemma?
Try to find and implement the best bag-of-lemmas and bag-of-word-forms feature extraction scheme and compare the two.