The Similarity Library aims at providing developers with a library for assessing similarity both between words and sentences. This library in an extension of the JWSL (Java WordNet Similarity Library). In the current implementation, there are two categories of similarity measures between words:
- measures exploiting ontologies such as WordNet, MeSH or the Gene Ontology
- measures exploiting search engines.
In the future we aim at integrating measures exploiting Wikipedia. If anyone wants to contribute on this, s/he is welcome to participate. Moreover, also measures based on concept glosses have to be integrated.
As for WordNet, Mesh and the Gene ontology, the library implements the following measures:
- Rada et. al
- Wu & Palmer
- Leacock & Chodorow
- Li et. al
Information Content based:
- Jiang & Conrad
Features and Information Content based:
- Pirrò & Seco
A note about Information Content
The last two categories of measures exploit the intrinsic information content (Seco et al., ECAI 2004), which enables to obtain information content values directly from the ontology structure. Moreover, another formulation of information content, called extended information content (Pirrò & Euzenat, ISWC 2010) taking into account relations beyond specializations, is also implemented.
As for sentence similarity, an approach inspired by the maximum weighted matching problem in a bipartite graph is implemented. It consists in comparing all the words in two sentences to find their best coupling. This approach can exploit whatever similarity measures between words implemented.