WSD Systems: Lesk and its variants

The problem of Word Sense Disambiguation has been extensively studied in light of word definitions. One of the earliest work on WSD, Lesk et.al.[0], used Oxford Dictionary to select the best sense. The approach then extended using wordnet and people have been proposing different approaches which extensively use wordnet hierarchy and glosses of senses to assign the most apt sense.

Lesk et. al. [0]:

The idea is that close-word senses are dependent i.e. more the common number of words in senses, the more better choices these senses are.
The paper is known for its pine-cone example.
The motivation for studying this algorithm comes from the fact that it a simple unsupervised algorithm and performs reasonably.

Pedersen et.al [1] :
Fix a context window, say 2n+1. Each word, which is present in wordnet, forms a part of the context. We evaluate all possible combinations of the senses of these words and for each candidate combination, assign a score to sense s of the target word.

For a word sense, we can get 7 different gloss descriptions [synset definition, holonyms, hypernyms etc.]. Hence, there are atmost 49 possible relation pairs, one must consider for a particular pair of words.
The overlap between glosses is defined as longest sequence of one or more consecutive words that occurs in both glosses. Each overlap contributes a score equal to the square of the number of words in the overlap.

Once all the gloss comparisons have been made for all pair of words in context, for all relation pairs, add up all the individual comparison scores to arrive at the combination score of a candidate combination. This process is repeated until all the candidate combinations have been scored. The candidate combination with the highest score is the winner and the target word is assigned the sense given in that combination. Ties are resolved in favour of more common senses.

Remarks on [1]:

Computationally expensive approach
Gives equal weightage to all the relation pairs. This is important step in the process and should be more carefully studied.
Scoring mechanism uses bag of words approach, which can be improved upon.

Simplified Lesk Algorithm [2] :
FindSense(word,sentence) returns mostAptSense

For each sense s of that word,
set weight(s) to zero.
Identify the context of the word = set of unique words W in surrounding sentence.
For each word w in W,
for each sense s,
description(s) = definition or example sentences of s
if w occurs in description(s) , add weight(w) to weight(s).
Choose sense with greatest weight(s).
Ties can be resolved on grounds of more commonly used sense.

Possible Variants of above algorithm :

Weights : [2] suggested use of the inverse document frequency (IDF) of the word w over the deﬁnitions and example sentences in the dictionary as Weight(w). The IDF of a word w is computed as -log(p(w)) , where p(w) is estimated as the fraction of dictionary “documents” —deﬁnition or examples— which contain the word w. A computationally easier option is to use binary presence as weight(w) i.e. just count the words in the definition/example.
Use of tagged corpora : As done by [2], the above method can also use sense-tagged corpus as set of examples and the algorithm can use the distribution of senses in real data
Context : Another idea, which one can manipulate is the definition of context of a word. One can attempt to experiment with idea using dependency graphs and shallow semantic parsing.
Word Description : Instead of just using the definition and examples, one can generalize using wordnet hierarchy i.e. by including descriptions from hypernyms, holonyms etc.

Vasilesecu et al. [3] :
[3] explores many of the above variants proposed.
Algorithm :

Description : In [2], Description(w) was set as {w} itself. In [3], Description(w) is defined as union over description of all senses, where Description(word_sense) :

DEF : Bag of Words(BOW) containing words from gloss and example.
REL : union of the synsets visited while following synonymic and hyperonymic links in wordnet
DEF union REL

Context :

the set of words centered around the target word t : +/- 2, 3, 8, 10 & 25 words
Audibert et.al [4] showed that a symmetrical context is not optimal for disambiguating verbs : -2/+4
Crestan et al.[5] showed that automatic context selection leads to improvements for some words
Words of the lexical chain of t, term borrowed from Hirst and St-Onge[6] : Given t the target word and w a word from its context, Set(t) is the set of synonyms and hypernyms of all senses of t according to the WORDNET hierarchy, and Set(w) is the corresponding set of synonyms and hypernyms of all senses of w. w belongs to the lexical chain of t if the Jaccard score computed for Set(t) and Set(w) is greater than an experimental threshold.

Scoring Function :

Lesk : Each intersection scores 1
Weighted : According to Lesk [1] , long descriptions can produce more overlaps than short ones, and thus dominate the decision making process. [3] multiplied the number of overlaps for a given candidate sense by the inverse of the logarithm of the description length for this sense

Naive Bayes Approach, proposed in [3]:

Choosing from the senses s of the word t, the one which maximizes the quantity p(s|Context(t)), making the assumption that there is no dependence between the words in the context of t

score(s) = log(p(s)) + \sum{w in context(t)} log ($$ p(w|s) + (1 - $$) p(w))

Distributions p(s), p(w|s) and p(w) are computed by relative frequency, using the SEMCOR corpus. The smoothing of p(w|s) by a unigram model p(w) is controlled by a unique parameter $$ , set to 0.95 in [3].

Conclusions by [3] :

Performance decreases with larger contexts, best performance observed for 4 to 6 plain-word contexts.
POS (known or estimated) is worth it (when used as a filter, as it reduces number of senses to consider)
Combining variants might bring clear improvements, e.g. boosting, as in Escudero et al.[7]
For short contexts (+-2), DEF is better. For larger contexts, REL seems more appropriate.

References:
[0] Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In SIGDOC '86: Proceedings of the 5th annual international conference on Systems documentation, pages 24-26, New York, NY, USA. ACM.
[1] Satanjeev Banerjee and Ted Pedersen. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet, Lecture Notes In Computer Science; Vol. 2276, Pages: 136 - 145, 2002.
[2] Kilgarriff and J. Rosenzweig. 2000. English SENSEVAL:Report and Results. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, LREC, Athens, Greece.
[3] Florentina Vasilescu, Philippe Langlais, and Guy Lapalme. 2004. Evaluating Variants of the Lesk Approach for Disambiguating Words. LREC, Portugal.

[4] Audibert et.al. 2003. Study of automatic semantic disambiguation criteria : results on the co-occurences

[5] Crestan et.al. 2003. Can we find the size of the optimal semantic context for disambiguation?

[6] Hirst G., St-Onge D. (1998), Lexical Chains as Representations of Context for the etection and Correction of Malapropisms, WordNet an Electronic Lexical Database, MIT Press, (pp. 305 -331).

[7] G. Escudero, L. Marquez, and G. Rigau. 2000. Boosting applied to word sense disambiguation. In 12th European Conference on Machine Learning, Barcelona, Spain.

WSD Systems

Saturday, March 31, 2012

Lesk and its variants

No comments:

Post a Comment