260 likes | 554 Views
Using String Similarity Metrics for Terminology Recognition. Jonathan Butters March 2008 LREC 2008 – Marrakech, Morocco. Introduction - Terms. Objects are discrete, the terms people use to describe objects are usually not!
E N D
Using String Similarity Metrics for Terminology Recognition Jonathan Butters March 2008 LREC 2008 – Marrakech, Morocco
Introduction - Terms • Objects are discrete, the terms people use to describe objects are usually not! • Different groups of people tend to use different terms to refer to identical objects – Sublanguage (Harris, 1968) • Terms can differ due to: orthographical differences, abbreviations, acronyms and synonyms Icosahedron Football Foot-ball Soccerball Footy Design Personnel Maintenance Other Countries!
Introduction – Relating Terms • There are many applications where the ability to relate the different terms would be useful • String similarity metrics can be used to relate terms • String similarity metrics inherently take into consideration aspects such as: • Word Order • Acronyms • Abbreviations Matching component concepts Predictive text suggestions Reducing lists of options
An Example Application • We have a list of locations • Some are similar • Most are dissimilar (irrelevant) • How do we choose the most similar? • Top 10? Top 100? Top 1000? Top 10%?
Introduction – Selecting Terms • Background In Aerospace Engineering • Specialising in Avionics • Electronic noise is a problem • But can be filtered! • Can dissimilar string matches be identified as noise? • Can this noise be removed?... Automatically
Introduction – Similarity Metrics • String metrics automatically calculate how similar (or dissimilar) two strings are: • Two strings are identical if they have the same characters in the same order • Each similarity measure assigns a numeric value based upon the relative similarity between the two strings • Vector based • Cost based
Metric Selection - Examples • Query String = “language resources and evaluation conference 2008” • String A = “language resources and evaluation conference 2009” • String B = “lrec 2008”
Metric Selection - SimMetrics • SimMetrics – Java library of 23 string similarity metrics • Developed at the University of Sheffield (Chapman, 2004) • Outputs a normalised similarity score!
Metric Selection - Investigation • Investigation focused on Aerospace domain terms • Reduce list of components presented to user • 298 automatically extracted sublanguage engine component terms • 513 official component terms • The similarity of each combination of 298 terms was calculated... 298C2 = 44253 comparisons • Carried out for each of the 23 metrics in SimMetrics
Metric Selection - Investigation • For each metric - each string pair (and score) was ordered by decreasing similarity • Few string pairs scored high results - wide similarity band • Vast majority scored low scores • Bands of similarity score were made, the number of strings that scored within those bands were totalled • Distribution graphs were Gaussian or Dirac • Depending on the scoring mechanism of the similarity metric
Metric Selection - Results • Dirac distributions • Gaussian distributions
Metric Selection - Levenshtein Because: • Jaro-Winkler gave consistently relatively high scores to unrelated strings • Levenshtein grouped dissimilar strings further towards the lower end of the scale - More similar strings over a wider range
Metric Selection - Example “Starter Valve” & “Tail Bearing Oil Feed Tube ASSY.” “Air Oil Heat Exchanger” & “Air/Oil Heat Exchanger”
Noise Detection & Removal • The peak is formed by the strings that are dissimilar • If two random strings are compared, they will have a random similarity score • As there are manyrandomly similar string pairs their scores form a Gaussian noise pattern... • Approximately 100% of a randomly distributed variable falls below approximately four standard deviations above the mean
Noise Detection & Removal • Strings that scored outside the randomly distributed scores were... by definition, not randomly distributed! • Strings that were not randomly distributed tended to include terms that were relevant to one another!... • The noise peak can be located and isolated by disregarding all similarities below four standard deviations above the mean:
Noise Detection & Removal A standard Gaussian (normal) distribution
Shorter Terms • Although the dataset used contained mostly long strings, noise removal method remains effective for shorter strings within the dataset • Shorter terms constitute a small, random match of longer and more typical strings • longer strings are now randomly distributed! • The mean similarity tends to be lower, and hence, the cut-off similarity automatically reduces, now similar shorter strings fall above the automatic cut off
Noise Detection & Removal • Advantages of this automatic method: • Scales with source data size • Selecting top 10 may include or exclude relevant results! • Can be used to pick out strings that are more similar than, or stand out from the rest of the strings
Results • The 298 extracted terms were compared against each of the 513 official terms. • After noise was automatically removed, in some cases more than one relevant result suggested, in this case, the first n results were considered as follows:
Example – List Reduction • List of 10028 unique UK locations • Query checked against list • Noise removed
Conclusions • Dissimilar string matches can be modelled as a noise pattern • The noise pattern can be removed! • Methodology is applicable to any set of strings • Not only for Aerospace domain terms! • Method is scalable • Can be used to automatically remove obviously incorrect matches • Provides users with fewer options – faster selection! • Can be used to extract strings that are more similar than, or stand out from the rest
Future Work • Integrate approach into many apps • Form Filling • Improved similarity metrics • Domain specific datasets (Aerospace) • Stop words, mutually exclusive words • Combine metrics to break ties
Refs • Butters, Jonathan (2007) - A Terminology Recognizer for the Aerospace Domain. Masters’ Thesis, The University of Sheffield • http://www.dcs.shef.ac.uk/~jdb/papers.html • Harris, Z. (1968). Mathematical Structures of Language. • John Wiley & Sons, New York. • Sam Chapman – SimMetrics • http://www.dcs.shef.ac.uk/~sam/simmetrics.html