The paper[1] proposes to represent terms as bags of contexts and define a similarity measure between terms. The idea is similar to Standard Term-Document matrices used for document similarity. The main challenge lies in representing words, lemmas and senses in same context space, for which they use a very simple idea.
Training Methodology
Training Corpus is represented as a matrix A, of size LXE, where rows being different lemmas/words encountered, basically the dictionary and columns be different examples. Entries in the matrix wi,j are experimented with binary presence-absence, frequency and tf-idf weighting. Weights in q are set to presence-absence.
Testing Methodology
Representing the sentence in context space :
A new instance q can represented with the vector of weights, of size 1XL, subsequently transformed into a vector in the context space, by the usual inner product q · A, of size 1XE.
Representing senses in context space :
Let senik be the representation of kth candidate sense for the ambiguous lemma lemi. It is of size 1XE.
senik[j] = 1 if lemma lemi is used with sense senik in the training context j, and 0 otherwise.
Assigning the sense to ambiguous lemma :
For a new context of the ambiguous lemma lemi , the candidate sense with higher similarity is selected.
Similarity Measures [sim(sen, q)]
Two similarity measures have been compared. The first one (maximum) is a similarity of q as bag of words with the training contexts of sense sen. The second one (cosine) is the similarity of sense sen with q in the context space.
Training Methodology
Training Corpus is represented as a matrix A, of size LXE, where rows being different lemmas/words encountered, basically the dictionary and columns be different examples. Entries in the matrix wi,j are experimented with binary presence-absence, frequency and tf-idf weighting. Weights in q are set to presence-absence.
Testing Methodology
Representing the sentence in context space :
A new instance q can represented with the vector of weights, of size 1XL, subsequently transformed into a vector in the context space, by the usual inner product q · A, of size 1XE.
Representing senses in context space :
Let senik be the representation of kth candidate sense for the ambiguous lemma lemi. It is of size 1XE.
senik[j] = 1 if lemma lemi is used with sense senik in the training context j, and 0 otherwise.
Assigning the sense to ambiguous lemma :
For a new context of the ambiguous lemma lemi , the candidate sense with higher similarity is selected.
Similarity Measures [sim(sen, q)]
Two similarity measures have been compared. The first one (maximum) is a similarity of q as bag of words with the training contexts of sense sen. The second one (cosine) is the similarity of sense sen with q in the context space.
- Maximum : Max {j=1:N} (sen_j · q_j )
- Cosine : (sen · q) / (||q|| ||sen||)
Observations/Suggestions from the above experiments:
- almost all the results are improved when the similarity measure (cosine) is applied in the Context Space. The exception is the consideration of co-occurrences to disambiguate nouns.
- if sense sen_1 has two training contexts with the highestnumber of co-occurrences and sense sen_2has only one with the same number of co-occurrences, sen_1 must receive a higher
Using the above ideas, they propose
- Artificially reducing number of co-occurrences: If c1 and c2 are contexts with highest and second highest number of co-occurrences with q, then assign to the first context c1 the number of co-occurrences of context c2
- Modified Similarity : \sum_{j=1^N} sen_j N^{q_j}
Conclusions:
- The idea to use semCor for exemplar based approach by term-document matrix idea is interesting and intuitive.
- The paper ignores the Wordnet semantic structure and is totally annotated data dependent. The point in focus is that its not a generalized for unseen text/ambiguous words.
References:
[1] Word Sense Disambiguation based on Term to Term Similarity in a Context Space, Artiles et.al.
No comments:
Post a Comment