java.lang.Object
com.pervasive.datarush.matching.functions.Similarity

public class Similarity extends Object
A collection of functions for computing similarity of strings. These functions are used to perform "fuzzy" matches between values, attempting to account for common entry errors or other equivalences.

These functions return a floating point value normalized to the range [0,1] with 0 representing no similarity at all and 1 representing an exact match. Null valued inputs are considered totally dissimilar to any other string (including the null value) and will always return 0.

  • Field Details

    • PROP_Q

      public static final String PROP_Q
      Name of the property specifying the q-gram size for the q-gram and positional q-gram measures.
      See Also:
    • DEFAULT_Q

      public static final int DEFAULT_Q
      The default q-gram size for the q-gram and positional q-gram measures.
      See Also:
    • PROP_MAX_DISTANCE

      public static final String PROP_MAX_DISTANCE
      Name of the property specifying maximum distance for the positional q-gram measure.
      See Also:
    • DEFAULT_MAX_DISTANCE

      public static final int DEFAULT_MAX_DISTANCE
      The default maximum distance for the positional q-gram measure.
      See Also:
    • PROP_PREFIX_LENGTH

      public static final String PROP_PREFIX_LENGTH
      Name of the property specifying prefix length for the Jaro-Winkler measure.
      See Also:
    • PROP_SCALING_FACTOR

      public static final String PROP_SCALING_FACTOR
      Name of the property specifying scaling factor for the Jaro-Winkler measure.
      See Also:
    • DEFAULT_PREFIX_LENGTH

      public static final int DEFAULT_PREFIX_LENGTH
      The default prefix length for the Jaro-Winkler measure.
      See Also:
    • DEFAULT_SCALING_FACTOR

      public static final float DEFAULT_SCALING_FACTOR
      The default scaling factor for the Jaro-Winkler measure.
      See Also:
  • Constructor Details

    • Similarity

      public Similarity()
  • Method Details

    • contains

      public static ScalarValuedFunction contains(String left, String right)
      Builds a function testing whether either of the specified fields has a string value containing the other. This function returns 1 if one value is contained in the other, 0 if not.
      Parameters:
      left - the first record field
      right - the second record field
      Returns:
      the specified function
    • contains

      public static ScalarValuedFunction contains(ScalarValuedFunction left, ScalarValuedFunction right)
      Builds a function testing whether either of the specified string valued expressions contains the other. This function returns 1 if one value is contained in the other, 0 if not.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function
    • damerauLevenshtein

      public static ScalarValuedFunction damerauLevenshtein(String left, String right)
      Builds a function computing the Damerau-Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
      Parameters:
      left - the first record field
      right - the second record field
      Returns:
      the specified function
    • damerauLevenshtein

      public static ScalarValuedFunction damerauLevenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
      Builds a function computing the Damerau-Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function
    • exact

      public static ScalarValuedFunction exact(String left, String right)
      Builds a function testing exact equality between the string values of the specified fields. This function returns 1 if the values are equal, 0 if not.
      Parameters:
      left - the first record field
      right - the second record field
      Returns:
      the specified function
    • exact

      Builds a function testing whether the specified string valued expressions are equal. This function returns 1 if the values are equal, 0 if not.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function
    • jaro

      public static ScalarValuedFunction jaro(String left, String right)
      Builds a function computing the Jaro distance between the specified string valued expressions.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function
    • jaro

      Builds a function computing the Jaro distance between the specified string valued expressions.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function
    • jaroWinkler

      public static ScalarValuedFunction jaroWinkler(String left, String right, int prefixLen, float scaling)
      Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      prefixLen - the maximum length of the common prefix for score adjustment purposes
      scaling - the scaling factor for scoring common prefixes
      Returns:
      the specified function
    • jaroWinkler

      public static ScalarValuedFunction jaroWinkler(ScalarValuedFunction left, ScalarValuedFunction right, int prefixLen, float scaling)
      Builds a function computing the Jaro-Winkler distance between the specified string valued expressions.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      prefixLen - the maximum length of the common prefix for score adjustment purposes
      scaling - the scaling factor for scoring common prefixes
      Returns:
      the specified function
    • levenshtein

      public static ScalarValuedFunction levenshtein(String left, String right)
      Builds a function computing the Levenshtein distance between the string values of the specified fields. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
      Parameters:
      left - the first record field
      right - the second record field
      Returns:
      the specified function
    • levenshtein

      public static ScalarValuedFunction levenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
      Builds a function computing the Levenshtein distance between the specified string valued expressions. This value is normalized to the range 0 to 1 (inclusive) by dividing by the length of the longer input string.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function
    • qgram

      public static ScalarValuedFunction qgram(String left, String right, int q)
      Builds a function computing the percentage of q-grams in common between the string values of the specified fields. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.
      Parameters:
      left - the first record field
      right - the second record field
      q - the size of q-grams to compare
      Returns:
      the specified function
    • qgram

      public static ScalarValuedFunction qgram(ScalarValuedFunction left, ScalarValuedFunction right, int q)
      Builds a function computing the percentage of q-grams in common between the specified string valued expressions. This is the count of q-grams in common divided by the number of possible q-grams in the longer string.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      q - the size of q-grams to compare
      Returns:
      the specified function
    • positionalQgram

      public static ScalarValuedFunction positionalQgram(String left, String right, int q, int maxDist)
      Builds a function computing the percentage of q-grams in common between the string values of the specified fields. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.
      Parameters:
      left - the first record field
      right - the second record field
      q - the size of q-grams to compare
      maxDist - the maximum distance, in characters, a q-gram can be from its original position
      Returns:
      the specified function
    • positionalQgram

      public static ScalarValuedFunction positionalQgram(ScalarValuedFunction left, ScalarValuedFunction right, int q, int maxDist)
      Builds a function computing the percentage of q-grams in common between the specified string valued expressions. Q-grams are only counted as common if they appear near the position they appear in the longer string. The resulting count is then divided by the number of possible q-grams in the longer string.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      q - the size of q-grams to compare
      maxDist - the maximum distance, in characters, a q-gram can be from its original position
      Returns:
      the specified function
    • shorthand

      public static ScalarValuedFunction shorthand(String left, String right)
      Builds a function testing shorthand equivalence between the string values of the specified fields. This function returns 1 if the shorter value is a shorthand representation of the longer value, 0 if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.
      Parameters:
      left - the first record field
      right - the second record field
      Returns:
      the specified function
    • shorthand

      public static ScalarValuedFunction shorthand(ScalarValuedFunction left, ScalarValuedFunction right)
      Builds a function testing shorthand equivalence between the specified string valued expressions. This function returns 1 if the shorter value is a shorthand representation of the longer value, 0 if not. To be a shorthand representation, the all characters in the shorter string must appear in the longer in the same order. Additionally, if the longer contains multiple words, a letter can only match if at least the first letter of the word was matched also. For example, "ABC" would not match "Acme SoftBall Corporation" since the 'S' starting "SoftBall" doesn't match, therefore the 'B' doesn't count as a match. However, "Acme Soft Ball Corporation" does match, as the 'B' starts a word.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function
    • proximity

      public static ScalarValuedFunction proximity(String left, String right)
      Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ, 0 is returned.
      Parameters:
      left - the first record field
      right - the second record field
      Returns:
      the specified function
    • proximity

      public static ScalarValuedFunction proximity(ScalarValuedFunction left, ScalarValuedFunction right)
      Builds a function computing an adjusted quotient of the numeric values of the specified fields. The result is the smaller number divided by the larger. If either value is zero (excepting if both are zero) or if the signs differ, 0 is returned.
      Parameters:
      left - the first string valued expression
      right - the second string valued expression
      Returns:
      the specified function