java.lang.Object
com.pervasive.datarush.matching.functions.Similarity
A collection of functions for computing similarity of strings.
These functions are used to perform "fuzzy" matches between values,
attempting to account for common entry errors or other equivalences.
These functions return a floating point value normalized to the range [0,1] with 0 representing no similarity at all and 1 representing an exact match. Null valued inputs are considered totally dissimilar to any other string (including the null value) and will always return 0.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intThe default maximum distance for the positional q-gram measure.static final intThe default prefix length for the Jaro-Winkler measure.static final intThe default q-gram size for the q-gram and positional q-gram measures.static final floatThe default scaling factor for the Jaro-Winkler measure.static final StringName of the property specifying maximum distance for the positional q-gram measure.static final StringName of the property specifying prefix length for the Jaro-Winkler measure.static final StringName of the property specifying the q-gram size for the q-gram and positional q-gram measures.static final StringName of the property specifying scaling factor for the Jaro-Winkler measure. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic ScalarValuedFunctioncontains(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function testing whether either of the specified string valued expressions contains the other.static ScalarValuedFunctionBuilds a function testing whether either of the specified fields has a string value containing the other.static ScalarValuedFunctiondamerauLevenshtein(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function computing the Damerau-Levenshtein distance between the specified string valued expressions.static ScalarValuedFunctiondamerauLevenshtein(String left, String right) Builds a function computing the Damerau-Levenshtein distance between the string values of the specified fields.static ScalarValuedFunctionexact(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function testing whether the specified string valued expressions are equal.static ScalarValuedFunctionBuilds a function testing exact equality between the string values of the specified fields.static ScalarValuedFunctionjaro(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function computing the Jaro distance between the specified string valued expressions.static ScalarValuedFunctionBuilds a function computing the Jaro distance between the specified string valued expressions.static ScalarValuedFunctionjaroWinkler(ScalarValuedFunction left, ScalarValuedFunction right, int prefixLen, float scaling) Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.static ScalarValuedFunctionjaroWinkler(String left, String right, int prefixLen, float scaling) Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.static ScalarValuedFunctionlevenshtein(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function computing the Levenshtein distance between the specified string valued expressions.static ScalarValuedFunctionlevenshtein(String left, String right) Builds a function computing the Levenshtein distance between the string values of the specified fields.static ScalarValuedFunctionpositionalQgram(ScalarValuedFunction left, ScalarValuedFunction right, int q, int maxDist) Builds a function computing the percentage of q-grams in common between the specified string valued expressions.static ScalarValuedFunctionpositionalQgram(String left, String right, int q, int maxDist) Builds a function computing the percentage of q-grams in common between the string values of the specified fields.static ScalarValuedFunctionproximity(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function computing an adjusted quotient of the numeric values of the specified fields.static ScalarValuedFunctionBuilds a function computing an adjusted quotient of the numeric values of the specified fields.static ScalarValuedFunctionqgram(ScalarValuedFunction left, ScalarValuedFunction right, int q) Builds a function computing the percentage of q-grams in common between the specified string valued expressions.static ScalarValuedFunctionBuilds a function computing the percentage of q-grams in common between the string values of the specified fields.static ScalarValuedFunctionshorthand(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function testing shorthand equivalence between the specified string valued expressions.static ScalarValuedFunctionBuilds a function testing shorthand equivalence between the string values of the specified fields.
-
Field Details
-
PROP_Q
Name of the property specifying the q-gram size for the q-gram and positional q-gram measures.- See Also:
-
DEFAULT_Q
public static final int DEFAULT_QThe default q-gram size for the q-gram and positional q-gram measures.- See Also:
-
PROP_MAX_DISTANCE
Name of the property specifying maximum distance for the positional q-gram measure.- See Also:
-
DEFAULT_MAX_DISTANCE
public static final int DEFAULT_MAX_DISTANCEThe default maximum distance for the positional q-gram measure.- See Also:
-
PROP_PREFIX_LENGTH
Name of the property specifying prefix length for the Jaro-Winkler measure.- See Also:
-
PROP_SCALING_FACTOR
Name of the property specifying scaling factor for the Jaro-Winkler measure.- See Also:
-
DEFAULT_PREFIX_LENGTH
public static final int DEFAULT_PREFIX_LENGTHThe default prefix length for the Jaro-Winkler measure.- See Also:
-
DEFAULT_SCALING_FACTOR
public static final float DEFAULT_SCALING_FACTORThe default scaling factor for the Jaro-Winkler measure.- See Also:
-
-
Constructor Details
-
Similarity
public Similarity()
-
-
Method Details
-
contains
Builds a function testing whether either of the specified fields has a string value containing the other. This function returns1if one value is contained in the other,0if not.- Parameters:
left- the first record fieldright- the second record field- Returns:
- the specified function
-
contains
Builds a function testing whether either of the specified string valued expressions contains the other. This function returns1if one value is contained in the other,0if not.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-
damerauLevenshtein
Builds a function computing the Damerau-Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left- the first record fieldright- the second record field- Returns:
- the specified function
-
damerauLevenshtein
public static ScalarValuedFunction damerauLevenshtein(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function computing the Damerau-Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-
exact
Builds a function testing exact equality between the string values of the specified fields. This function returns1if the values are equal,0if not.- Parameters:
left- the first record fieldright- the second record field- Returns:
- the specified function
-
exact
Builds a function testing whether the specified string valued expressions are equal. This function returns1if the values are equal,0if not.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-
jaro
Builds a function computing the Jaro distance between the specified string valued expressions.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-
jaro
Builds a function computing the Jaro distance between the specified string valued expressions.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-
jaroWinkler
public static ScalarValuedFunction jaroWinkler(String left, String right, int prefixLen, float scaling) Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.- Parameters:
left- the first string valued expressionright- the second string valued expressionprefixLen- the maximum length of the common prefix for score adjustment purposesscaling- the scaling factor for scoring common prefixes- Returns:
- the specified function
-
jaroWinkler
public static ScalarValuedFunction jaroWinkler(ScalarValuedFunction left, ScalarValuedFunction right, int prefixLen, float scaling) Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.- Parameters:
left- the first string valued expressionright- the second string valued expressionprefixLen- the maximum length of the common prefix for score adjustment purposesscaling- the scaling factor for scoring common prefixes- Returns:
- the specified function
-
levenshtein
Builds a function computing the Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left- the first record fieldright- the second record field- Returns:
- the specified function
-
levenshtein
public static ScalarValuedFunction levenshtein(ScalarValuedFunction left, ScalarValuedFunction right) Builds a function computing the Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-
qgram
Builds a function computing the percentage of q-grams in common between the string values of the specified fields. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.- Parameters:
left- the first record fieldright- the second record fieldq- the size of q-grams to compare- Returns:
- the specified function
-
qgram
public static ScalarValuedFunction qgram(ScalarValuedFunction left, ScalarValuedFunction right, int q) Builds a function computing the percentage of q-grams in common between the specified string valued expressions. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.- Parameters:
left- the first string valued expressionright- the second string valued expressionq- the size of q-grams to compare- Returns:
- the specified function
-
positionalQgram
Builds a function computing the percentage of q-grams in common between the string values of the specified fields. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.- Parameters:
left- the first record fieldright- the second record fieldq- the size of q-grams to comparemaxDist- the maximum distance, in characters, a q-gram can be from its original position- Returns:
- the specified function
-
positionalQgram
public static ScalarValuedFunction positionalQgram(ScalarValuedFunction left, ScalarValuedFunction right, int q, int maxDist) Builds a function computing the percentage of q-grams in common between the specified string valued expressions. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.- Parameters:
left- the first string valued expressionright- the second string valued expressionq- the size of q-grams to comparemaxDist- the maximum distance, in characters, a q-gram can be from its original position- Returns:
- the specified function
-
shorthand
Builds a function testing shorthand equivalence between the string values of the specified fields. This function returns1if the shorter value is a shorthand representation of the longer value,0if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.- Parameters:
left- the first record fieldright- the second record field- Returns:
- the specified function
-
shorthand
Builds a function testing shorthand equivalence between the specified string valued expressions. This function returns1if the shorter value is a shorthand representation of the longer value,0if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-
proximity
Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ,0is returned.- Parameters:
left- the first record fieldright- the second record field- Returns:
- the specified function
-
proximity
Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ,0is returned.- Parameters:
left- the first string valued expressionright- the second string valued expression- Returns:
- the specified function
-