Saturday, March 31, 2012

Lesk and its variants

The problem of Word Sense Disambiguation has been extensively studied in light of word definitions.  One of the earliest work on WSD, Lesk et.al.[0], used Oxford Dictionary to select the best sense. The approach then extended using wordnet and people have been proposing different approaches which extensively use wordnet hierarchy and glosses of senses to assign the most apt sense.


Lesk et. al. [0]:
  • The idea is that close-word senses are dependent i.e. more the common number of words in senses, the more better choices these senses are. 
  • The paper is known for its pine-cone example.
  • The motivation for studying this algorithm comes from the fact that it a simple unsupervised algorithm and performs reasonably.
Pedersen et.al [1] :
Fix a context window, say 2n+1. Each word, which is present in wordnet, forms a part of the context. We evaluate all possible combinations of the senses of these words and for each candidate combination, assign a score to sense s of the target word.


For a word sense, we can get 7 different gloss descriptions [synset definition, holonyms, hypernyms etc.]. Hence, there are atmost 49 possible relation pairs, one must consider for a particular pair of words.
The overlap between glosses is defined as longest sequence of one or more consecutive words that occurs in both glosses. Each overlap contributes a score equal to the square of the number of words in the overlap.


Once all the gloss comparisons  have been made for all pair of words in context, for all relation pairs, add up all the individual comparison scores to arrive at the combination score of a candidate combination. This process is repeated until all the candidate combinations have been scored. The candidate combination with the highest score is the winner and the target word is assigned the sense given in that combination. Ties are resolved in favour of more common senses.


Remarks on [1]:
  • Computationally expensive approach
  • Gives equal weightage to all the relation pairs. This is important step in the process and should be more carefully studied.
  • Scoring mechanism uses bag of words approach, which can be improved upon.
Simplified Lesk Algorithm [2] :
FindSense(word,sentence) returns mostAptSense
For each sense s of that word,
         set weight(s) to zero.
Identify the context of the word = set of unique words W in surrounding sentence.
For each word w in W,
         for each sense s,
                 description(s) = definition or example sentences of s
                 if w occurs in description(s) , add weight(w) to weight(s).
Choose sense with greatest weight(s). 
Ties can be resolved on grounds of more commonly used sense.


Possible Variants of above algorithm : 
  1. Weights : [2] suggested use of the inverse document frequency (IDF) of the word w over the definitions and example sentences in the dictionary as Weight(w). The IDF of a word w is computed as -log(p(w)) , where p(w) is estimated as the fraction of dictionary “documents” —definition or examples— which contain the word w. A computationally easier option is to use binary presence as weight(w) i.e. just count the words in the definition/example.
  2. Use of tagged corpora : As done by [2], the above method can also use sense-tagged corpus as set of examples and the algorithm can use the distribution of senses in real data
  3. Context : Another idea, which one can manipulate is the definition of context of a word. One can attempt to experiment with idea using dependency graphs and shallow semantic parsing.
  4. Word Description :  Instead of just using the definition and examples, one can generalize using wordnet hierarchy i.e. by including descriptions from hypernyms, holonyms etc.  
Vasilesecu et al. [3] : 
[3] explores many of the above variants proposed.
Algorithm : 




Description : In [2], Description(w) was set as {w} itself. In [3], Description(w) is defined as union over description of all senses, where Description(word_sense) : 
  • DEF : Bag of Words(BOW) containing words from gloss and example.
  • REL : union of the synsets visited while following synonymic and hyperonymic links in wordnet
  • DEF union REL
Context
  • the set of words centered around the target word t : +/- 2, 3, 8, 10 & 25 words 
  • Audibert et.al [4] showed that a symmetrical context is not optimal for disambiguating verbs : -2/+4
  • Crestan et al.[5] showed that automatic context selection leads to improvements for some words
  • Words of the lexical chain of t, term borrowed from Hirst and St-Onge[6] : Given  t the target word and w a word from its context, Set(t) is the set of synonyms and hypernyms of all senses of  t according to the WORDNET hierarchy, and Set(w) is the corresponding set of synonyms and hypernyms of all senses of w.  w belongs to the lexical chain of t if the Jaccard score computed for Set(t) and Set(w) is greater than an experimental threshold.


Scoring Function :
  • Lesk : Each intersection scores 1
  • WeightedAccording to Lesk [1] , long descriptions can produce more overlaps than short ones, and thus dominate the decision making process. [3] multiplied the number of overlaps for a given candidate sense by the inverse of the logarithm of the description length for this sense
Naive Bayes Approach, proposed in [3]:


Choosing from the senses s of the word t, the one which maximizes the quantity p(s|Context(t)),  making the assumption that there is no dependence between the words in the context of t
score(s) = log(p(s)) + \sum{w in context(t)} log ($$ p(w|s) + (1 - $$) p(w))
Distributions p(s), p(w|s) and  p(w) are computed by relative frequency, using the SEMCOR corpus. The smoothing of p(w|s) by a unigram model p(w) is controlled by a unique parameter $$ , set to 0.95 in [3].


Conclusions by [3] :
  • Performance decreases with larger contexts, best performance observed for 4 to 6 plain-word contexts.
  • POS (known or estimated) is worth it (when used as a filter, as it reduces number of senses to consider)
  • Combining variants might bring clear improvements, e.g. boosting, as in Escudero et al.[7]
  • For short contexts (+-2), DEF is better. For larger contexts, REL seems more appropriate.


References:
[0] Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In SIGDOC '86: Proceedings of the 5th annual international conference on Systems documentation, pages 24-26, New York, NY, USA. ACM.
[1] Satanjeev Banerjee and Ted Pedersen. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet, Lecture Notes In Computer Science; Vol. 2276, Pages: 136 - 145, 2002.
[2] Kilgarriff and J. Rosenzweig. 2000. English SENSEVAL:Report and Results. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, LREC, Athens, Greece.
[3] Florentina Vasilescu, Philippe Langlais, and Guy Lapalme. 2004. Evaluating Variants of the Lesk Approach for Disambiguating Words. LREC, Portugal.
[4] Audibert et.al. 2003. Study of automatic semantic disambiguation criteria : results on the co-occurences
[5] Crestan et.al. 2003. Can we find the size of the optimal semantic context for disambiguation?
[6] Hirst G., St-Onge D. (1998), Lexical Chains as Representations of Context for the etection and Correction of Malapropisms, WordNet an Electronic Lexical Database, MIT Press, (pp. 305 -331).
[7] G. Escudero, L. Marquez, and G. Rigau. 2000. Boosting applied to word sense disambiguation. In 12th European Conference on Machine Learning, Barcelona, Spain.

Monday, March 26, 2012

WSD using Term to Term Similarity in a Context Space

The paper[1] proposes to represent terms as bags of contexts and define a similarity measure between terms. The idea is similar to Standard Term-Document matrices used for document similarity. The main challenge lies in representing words, lemmas and senses in same context space, for which they use a very simple idea.

Training Methodology
Training Corpus is represented as a matrix A, of size LXE, where rows being different lemmas/words encountered, basically the dictionary and columns be different examples. Entries in the matrix wi,j  are experimented with binary presence-absence, frequency and tf-idf weighting. Weights in q are set to presence-absence.

Testing Methodology
Representing the sentence in context space : 
A new instance q can represented with the vector of weights, of size 1XL, subsequently transformed into a vector in the context space, by the usual inner product q · A, of size 1XE.

Representing senses in context space : 
Let senik be the representation of kth candidate sense for the ambiguous lemma lemi. It is of size 1XE.
senik[j] = 1 if lemma lemi is used with sense senik in the training context j, and 0 otherwise.

Assigning the sense to ambiguous lemma :
For a new context of the ambiguous lemma lemi , the candidate sense with higher similarity is selected.


Similarity Measures [sim(sen, q)]
Two similarity measures have been compared. The first one (maximum) is a similarity of q as bag of words with the training contexts of sense sen. The second one (cosine) is the similarity of sense sen with q in the context space.

  • Maximum :  Max {j=1:N} (sen_j · q_j ) 
  • Cosine  : (sen · q) /  (||q|| ||sen||)

Observations/Suggestions from the above experiments:
  • almost all the results are improved when the similarity measure (cosine) is applied in the Context Space. The exception is the consideration of co-occurrences to disambiguate nouns.
  • if sense sen_1 
    has two training contexts with the highest 
    number of co-occurrences and sense sen_2 
    has only one with the same number of co-
    occurrences, sen_1 must receive a higher 
    value than sen_2 
Using the above ideas, they propose 
  • Artificially reducing number of co-occurrences: If c1 and c2 are contexts with highest and second highest number of co-occurrences with q, then assign to the first context c1 the number of co-occurrences of context c2
  • Modified Similarity : \sum_{j=1^N} sen_j N^{q_j}

Conclusions:
  • The idea to use semCor for exemplar based approach by term-document matrix idea is interesting and intuitive. 
  • The paper ignores the Wordnet semantic structure and is totally annotated data dependent. The point in focus is that its not a generalized for unseen text/ambiguous words. 

References:
[1] Word Sense Disambiguation based on Term to Term Similarity in a Context Space, Artiles et.al.





















Monday, March 19, 2012

Sense Learner v1.0

I am currently working on WSD systems in all-words disambiguation setting. SENSE LEARNER[1] participated in the SENSEVAL-3, 2004, English all words task, and achieved an average accuracy of 64.6%. It claims to be a minimally supervised sense tagger that attempts to disambiguate all content words in a text using the senses from WordNet. They use 2 datasets, one being the Wordnet itself and other being the SemCor corpora.

Semantic Language Model
The paper observes that the single noun/verb/adjective just before and after the ambiguous word, are most of the time sufficient to disambiguate the sense of that word. They develop a Semantic Language Model on this hypothesis. The problem with this model is that it cannot disambiguate words which it had not seen in training corpus, and hence requires another module for handling unseen words.

Noun Model : The first noun, verb, or adjective before the target noun, within a window of at most five words to the left, and its POS.

Verb Model: The first word before and the first word after the target verb, and its POS.

Adjective Model 1: One relying on the first noun after the target adjective, within a window of at most five words.

Adjective Model 2 : A second model relying on the first word before and the first word after the target adjective, and its POS.

The label of each such feature vector consists of the target word and corresponding sense, as word#sense. For learning, they use Timbl Memory based learning[2]. During sense prediction, each content word is labeled with a predicted word and its sense. If the predicted word matches with the word to be disambiguated, we assign it the predicted sense, otherwise we pass it to the next module to decipher.

Semantic Generalization using Syntactic Dependencies and a Conceptual Network
Since the module 1 doesn't work on unseen words, we need a way to generalize. For this , authors propose another memory based learning algorithm.

Training Phase
1. Find all the dependencies in the sentence using a dependency parser. Authors use Link parser. Add POS and sense information to the dependency pair if applicable.
2. For each noun and verb in the dependency pair, obtain the hypernym tree of the word. Build a vector consisting of the words themselves, their POS, their wordnet sense and a reference to all the hypernym synsets in wordnet.
3. For each dependency pair, we generate positive feature vectors for the senses that appear in the training set and negative feature vectors for the remaining possible senses.

Testing Phase
1.We will do the same as above and get a feature vector from the dependency pair. Now, we enumerate all possible feature vectors using all combinations of possible senses.
2. We now attempt to label a feature vector with a positive or negative label using Timbl.

Discussion and Conclusions
1.The memory based learning algorithm is a new idea, instead of using k-NN approaches.
2.Also, the feature set in first model is minimal, providing an insight into what are the useful features. One can probably experiment with these feature sets and obtain other results.
3. Using dependencies to understand semantic structure has always been interesting/complicated. The paper shows a simple way on how it can be done.

References:
[1] R. Mihalcea and E. Faruque. 2004. SenseLearner: Minimally supervised word sense disambiguation for all words in open text. In Proceedings of ACL/SIGLEX Senseval-3, Barcelona, Spain, July.

[2] W.Daelemans, J. Zavrel, K. van der Sloot, and A. van der Bosch. 2001. Timbl : Tilberg Memory based learner.