- java.lang.Object
-
- com.pervasive.datarush.matching.functions.Similarity
-
public class Similarity extends Object
A collection of functions for computing similarity of strings. These functions are used to perform "fuzzy" matches between values, attempting to account for common entry errors or other equivalences.These functions return a floating point value normalized to the range [0,1] with 0 representing no similarity at all and 1 representing an exact match. Null valued inputs are considered totally dissimilar to any other string (including the null value) and will always return 0.
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_MAX_DISTANCE
The default maximum distance for the positional q-gram measure.static int
DEFAULT_PREFIX_LENGTH
The default prefix length for the Jaro-Winkler measure.static int
DEFAULT_Q
The default q-gram size for the q-gram and positional q-gram measures.static float
DEFAULT_SCALING_FACTOR
The default scaling factor for the Jaro-Winkler measure.static String
PROP_MAX_DISTANCE
Name of the property specifying maximum distance for the positional q-gram measure.static String
PROP_PREFIX_LENGTH
Name of the property specifying prefix length for the Jaro-Winkler measure.static String
PROP_Q
Name of the property specifying the q-gram size for the q-gram and positional q-gram measures.static String
PROP_SCALING_FACTOR
Name of the property specifying scaling factor for the Jaro-Winkler measure.
-
Constructor Summary
Constructors Constructor Description Similarity()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static ScalarValuedFunction
contains(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function testing whether either of the specified string valued expressions contains the other.static ScalarValuedFunction
contains(String left, String right)
Builds a function testing whether either of the specified fields has a string value containing the other.static ScalarValuedFunction
damerauLevenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing the Damerau-Levenshtein distance between the specified string valued expressions.static ScalarValuedFunction
damerauLevenshtein(String left, String right)
Builds a function computing the Damerau-Levenshtein distance between the string values of the specified fields.static ScalarValuedFunction
exact(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function testing whether the specified string valued expressions are equal.static ScalarValuedFunction
exact(String left, String right)
Builds a function testing exact equality between the string values of the specified fields.static ScalarValuedFunction
jaro(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing the Jaro distance between the specified string valued expressions.static ScalarValuedFunction
jaro(String left, String right)
Builds a function computing the Jaro distance between the specified string valued expressions.static ScalarValuedFunction
jaroWinkler(ScalarValuedFunction left, ScalarValuedFunction right, int prefixLen, float scaling)
Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.static ScalarValuedFunction
jaroWinkler(String left, String right, int prefixLen, float scaling)
Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.static ScalarValuedFunction
levenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing the Levenshtein distance between the specified string valued expressions.static ScalarValuedFunction
levenshtein(String left, String right)
Builds a function computing the Levenshtein distance between the string values of the specified fields.static ScalarValuedFunction
positionalQgram(ScalarValuedFunction left, ScalarValuedFunction right, int q, int maxDist)
Builds a function computing the percentage of q-grams in common between the specified string valued expressions.static ScalarValuedFunction
positionalQgram(String left, String right, int q, int maxDist)
Builds a function computing the percentage of q-grams in common between the string values of the specified fields.static ScalarValuedFunction
proximity(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing an adjusted quotient of the numeric values of the specified fields.static ScalarValuedFunction
proximity(String left, String right)
Builds a function computing an adjusted quotient of the numeric values of the specified fields.static ScalarValuedFunction
qgram(ScalarValuedFunction left, ScalarValuedFunction right, int q)
Builds a function computing the percentage of q-grams in common between the specified string valued expressions.static ScalarValuedFunction
qgram(String left, String right, int q)
Builds a function computing the percentage of q-grams in common between the string values of the specified fields.static ScalarValuedFunction
shorthand(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function testing shorthand equivalence between the specified string valued expressions.static ScalarValuedFunction
shorthand(String left, String right)
Builds a function testing shorthand equivalence between the string values of the specified fields.
-
-
-
Field Detail
-
PROP_Q
public static final String PROP_Q
Name of the property specifying the q-gram size for the q-gram and positional q-gram measures.- See Also:
- Constant Field Values
-
DEFAULT_Q
public static final int DEFAULT_Q
The default q-gram size for the q-gram and positional q-gram measures.- See Also:
- Constant Field Values
-
PROP_MAX_DISTANCE
public static final String PROP_MAX_DISTANCE
Name of the property specifying maximum distance for the positional q-gram measure.- See Also:
- Constant Field Values
-
DEFAULT_MAX_DISTANCE
public static final int DEFAULT_MAX_DISTANCE
The default maximum distance for the positional q-gram measure.- See Also:
- Constant Field Values
-
PROP_PREFIX_LENGTH
public static final String PROP_PREFIX_LENGTH
Name of the property specifying prefix length for the Jaro-Winkler measure.- See Also:
- Constant Field Values
-
PROP_SCALING_FACTOR
public static final String PROP_SCALING_FACTOR
Name of the property specifying scaling factor for the Jaro-Winkler measure.- See Also:
- Constant Field Values
-
DEFAULT_PREFIX_LENGTH
public static final int DEFAULT_PREFIX_LENGTH
The default prefix length for the Jaro-Winkler measure.- See Also:
- Constant Field Values
-
DEFAULT_SCALING_FACTOR
public static final float DEFAULT_SCALING_FACTOR
The default scaling factor for the Jaro-Winkler measure.- See Also:
- Constant Field Values
-
-
Method Detail
-
contains
public static ScalarValuedFunction contains(String left, String right)
Builds a function testing whether either of the specified fields has a string value containing the other. This function returns1
if one value is contained in the other,0
if not.- Parameters:
left
- the first record fieldright
- the second record field- Returns:
- the specified function
-
contains
public static ScalarValuedFunction contains(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function testing whether either of the specified string valued expressions contains the other. This function returns1
if one value is contained in the other,0
if not.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
damerauLevenshtein
public static ScalarValuedFunction damerauLevenshtein(String left, String right)
Builds a function computing the Damerau-Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left
- the first record fieldright
- the second record field- Returns:
- the specified function
-
damerauLevenshtein
public static ScalarValuedFunction damerauLevenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing the Damerau-Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
exact
public static ScalarValuedFunction exact(String left, String right)
Builds a function testing exact equality between the string values of the specified fields. This function returns1
if the values are equal,0
if not.- Parameters:
left
- the first record fieldright
- the second record field- Returns:
- the specified function
-
exact
public static ScalarValuedFunction exact(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function testing whether the specified string valued expressions are equal. This function returns1
if the values are equal,0
if not.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
jaro
public static ScalarValuedFunction jaro(String left, String right)
Builds a function computing the Jaro distance between the specified string valued expressions.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
jaro
public static ScalarValuedFunction jaro(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing the Jaro distance between the specified string valued expressions.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
jaroWinkler
public static ScalarValuedFunction jaroWinkler(String left, String right, int prefixLen, float scaling)
Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.- Parameters:
left
- the first string valued expressionright
- the second string valued expressionprefixLen
- the maximum length of the common prefix for score adjustment purposesscaling
- the scaling factor for scoring common prefixes- Returns:
- the specified function
-
jaroWinkler
public static ScalarValuedFunction jaroWinkler(ScalarValuedFunction left, ScalarValuedFunction right, int prefixLen, float scaling)
Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.- Parameters:
left
- the first string valued expressionright
- the second string valued expressionprefixLen
- the maximum length of the common prefix for score adjustment purposesscaling
- the scaling factor for scoring common prefixes- Returns:
- the specified function
-
levenshtein
public static ScalarValuedFunction levenshtein(String left, String right)
Builds a function computing the Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left
- the first record fieldright
- the second record field- Returns:
- the specified function
-
levenshtein
public static ScalarValuedFunction levenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing the Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
qgram
public static ScalarValuedFunction qgram(String left, String right, int q)
Builds a function computing the percentage of q-grams in common between the string values of the specified fields. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.- Parameters:
left
- the first record fieldright
- the second record fieldq
- the size of q-grams to compare- Returns:
- the specified function
-
qgram
public static ScalarValuedFunction qgram(ScalarValuedFunction left, ScalarValuedFunction right, int q)
Builds a function computing the percentage of q-grams in common between the specified string valued expressions. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.- Parameters:
left
- the first string valued expressionright
- the second string valued expressionq
- the size of q-grams to compare- Returns:
- the specified function
-
positionalQgram
public static ScalarValuedFunction positionalQgram(String left, String right, int q, int maxDist)
Builds a function computing the percentage of q-grams in common between the string values of the specified fields. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.- Parameters:
left
- the first record fieldright
- the second record fieldq
- the size of q-grams to comparemaxDist
- the maximum distance, in characters, a q-gram can be from its original position- Returns:
- the specified function
-
positionalQgram
public static ScalarValuedFunction positionalQgram(ScalarValuedFunction left, ScalarValuedFunction right, int q, int maxDist)
Builds a function computing the percentage of q-grams in common between the specified string valued expressions. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.- Parameters:
left
- the first string valued expressionright
- the second string valued expressionq
- the size of q-grams to comparemaxDist
- the maximum distance, in characters, a q-gram can be from its original position- Returns:
- the specified function
-
shorthand
public static ScalarValuedFunction shorthand(String left, String right)
Builds a function testing shorthand equivalence between the string values of the specified fields. This function returns1
if the shorter value is a shorthand representation of the longer value,0
if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.- Parameters:
left
- the first record fieldright
- the second record field- Returns:
- the specified function
-
shorthand
public static ScalarValuedFunction shorthand(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function testing shorthand equivalence between the specified string valued expressions. This function returns1
if the shorter value is a shorthand representation of the longer value,0
if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
proximity
public static ScalarValuedFunction proximity(String left, String right)
Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ,0
is returned.- Parameters:
left
- the first record fieldright
- the second record field- Returns:
- the specified function
-
proximity
public static ScalarValuedFunction proximity(ScalarValuedFunction left, ScalarValuedFunction right)
Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ,0
is returned.- Parameters:
left
- the first string valued expressionright
- the second string valued expression- Returns:
- the specified function
-
-