Class Similarity


  • public class Similarity
    extends Object
    A collection of functions for computing similarity of strings. These functions are used to perform "fuzzy" matches between values, attempting to account for common entry errors or other equivalences.

    These functions return a floating point value normalized to the range [0,1] with 0 representing no similarity at all and 1 representing an exact match. Null valued inputs are considered totally dissimilar to any other string (including the null value) and will always return 0.

    • Field Detail

      • PROP_Q

        public static final String PROP_Q
        Name of the property specifying the q-gram size for the q-gram and positional q-gram measures.
        See Also:
        Constant Field Values
      • DEFAULT_Q

        public static final int DEFAULT_Q
        The default q-gram size for the q-gram and positional q-gram measures.
        See Also:
        Constant Field Values
      • PROP_MAX_DISTANCE

        public static final String PROP_MAX_DISTANCE
        Name of the property specifying maximum distance for the positional q-gram measure.
        See Also:
        Constant Field Values
      • DEFAULT_MAX_DISTANCE

        public static final int DEFAULT_MAX_DISTANCE
        The default maximum distance for the positional q-gram measure.
        See Also:
        Constant Field Values
      • PROP_PREFIX_LENGTH

        public static final String PROP_PREFIX_LENGTH
        Name of the property specifying prefix length for the Jaro-Winkler measure.
        See Also:
        Constant Field Values
      • PROP_SCALING_FACTOR

        public static final String PROP_SCALING_FACTOR
        Name of the property specifying scaling factor for the Jaro-Winkler measure.
        See Also:
        Constant Field Values
      • DEFAULT_PREFIX_LENGTH

        public static final int DEFAULT_PREFIX_LENGTH
        The default prefix length for the Jaro-Winkler measure.
        See Also:
        Constant Field Values
      • DEFAULT_SCALING_FACTOR

        public static final float DEFAULT_SCALING_FACTOR
        The default scaling factor for the Jaro-Winkler measure.
        See Also:
        Constant Field Values
    • Constructor Detail

      • Similarity

        public Similarity()
    • Method Detail

      • contains

        public static ScalarValuedFunction contains​(String left,
                                                    String right)
        Builds a function testing whether either of the specified fields has a string value containing the other. This function returns 1 if one value is contained in the other, 0 if not.
        Parameters:
        left - the first record field
        right - the second record field
        Returns:
        the specified function
      • contains

        public static ScalarValuedFunction contains​(ScalarValuedFunction left,
                                                    ScalarValuedFunction right)
        Builds a function testing whether either of the specified string valued expressions contains the other. This function returns 1 if one value is contained in the other, 0 if not.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function
      • damerauLevenshtein

        public static ScalarValuedFunction damerauLevenshtein​(String left,
                                                              String right)
        Builds a function computing the Damerau-Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
        Parameters:
        left - the first record field
        right - the second record field
        Returns:
        the specified function
      • damerauLevenshtein

        public static ScalarValuedFunction damerauLevenshtein​(ScalarValuedFunction left,
                                                              ScalarValuedFunction right)
        Builds a function computing the Damerau-Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function
      • exact

        public static ScalarValuedFunction exact​(String left,
                                                 String right)
        Builds a function testing exact equality between the string values of the specified fields. This function returns 1 if the values are equal, 0 if not.
        Parameters:
        left - the first record field
        right - the second record field
        Returns:
        the specified function
      • exact

        public static ScalarValuedFunction exact​(ScalarValuedFunction left,
                                                 ScalarValuedFunction right)
        Builds a function testing whether the specified string valued expressions are equal. This function returns 1 if the values are equal, 0 if not.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function
      • jaro

        public static ScalarValuedFunction jaro​(String left,
                                                String right)
        Builds a function computing the Jaro distance between the specified string valued expressions.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function
      • jaro

        public static ScalarValuedFunction jaro​(ScalarValuedFunction left,
                                                ScalarValuedFunction right)
        Builds a function computing the Jaro distance between the specified string valued expressions.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function
      • jaroWinkler

        public static ScalarValuedFunction jaroWinkler​(String left,
                                                       String right,
                                                       int prefixLen,
                                                       float scaling)
        Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        prefixLen - the maximum length of the common prefix for score adjustment purposes
        scaling - the scaling factor for scoring common prefixes
        Returns:
        the specified function
      • jaroWinkler

        public static ScalarValuedFunction jaroWinkler​(ScalarValuedFunction left,
                                                       ScalarValuedFunction right,
                                                       int prefixLen,
                                                       float scaling)
        Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        prefixLen - the maximum length of the common prefix for score adjustment purposes
        scaling - the scaling factor for scoring common prefixes
        Returns:
        the specified function
      • levenshtein

        public static ScalarValuedFunction levenshtein​(String left,
                                                       String right)
        Builds a function computing the Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
        Parameters:
        left - the first record field
        right - the second record field
        Returns:
        the specified function
      • levenshtein

        public static ScalarValuedFunction levenshtein​(ScalarValuedFunction left,
                                                       ScalarValuedFunction right)
        Builds a function computing the Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function
      • qgram

        public static ScalarValuedFunction qgram​(String left,
                                                 String right,
                                                 int q)
        Builds a function computing the percentage of q-grams in common between the string values of the specified fields. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.
        Parameters:
        left - the first record field
        right - the second record field
        q - the size of q-grams to compare
        Returns:
        the specified function
      • qgram

        public static ScalarValuedFunction qgram​(ScalarValuedFunction left,
                                                 ScalarValuedFunction right,
                                                 int q)
        Builds a function computing the percentage of q-grams in common between the specified string valued expressions. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        q - the size of q-grams to compare
        Returns:
        the specified function
      • positionalQgram

        public static ScalarValuedFunction positionalQgram​(String left,
                                                           String right,
                                                           int q,
                                                           int maxDist)
        Builds a function computing the percentage of q-grams in common between the string values of the specified fields. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.
        Parameters:
        left - the first record field
        right - the second record field
        q - the size of q-grams to compare
        maxDist - the maximum distance, in characters, a q-gram can be from its original position
        Returns:
        the specified function
      • positionalQgram

        public static ScalarValuedFunction positionalQgram​(ScalarValuedFunction left,
                                                           ScalarValuedFunction right,
                                                           int q,
                                                           int maxDist)
        Builds a function computing the percentage of q-grams in common between the specified string valued expressions. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        q - the size of q-grams to compare
        maxDist - the maximum distance, in characters, a q-gram can be from its original position
        Returns:
        the specified function
      • shorthand

        public static ScalarValuedFunction shorthand​(String left,
                                                     String right)
        Builds a function testing shorthand equivalence between the string values of the specified fields. This function returns 1 if the shorter value is a shorthand representation of the longer value, 0 if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.
        Parameters:
        left - the first record field
        right - the second record field
        Returns:
        the specified function
      • shorthand

        public static ScalarValuedFunction shorthand​(ScalarValuedFunction left,
                                                     ScalarValuedFunction right)
        Builds a function testing shorthand equivalence between the specified string valued expressions. This function returns 1 if the shorter value is a shorthand representation of the longer value, 0 if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function
      • proximity

        public static ScalarValuedFunction proximity​(String left,
                                                     String right)
        Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ, 0 is returned.
        Parameters:
        left - the first record field
        right - the second record field
        Returns:
        the specified function
      • proximity

        public static ScalarValuedFunction proximity​(ScalarValuedFunction left,
                                                     ScalarValuedFunction right)
        Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ, 0 is returned.
        Parameters:
        left - the first string valued expression
        right - the second string valued expression
        Returns:
        the specified function