public class Similarity extends Object
These functions return a floating point value normalized to the range [0,1] with 0 representing no similarity at all and 1 representing an exact match. Null valued inputs are considered totally dissimilar to any other string (including the null value) and will always return 0.
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_MAX_DISTANCE
The default maximum distance for the positional q-gram measure.
|
static int |
DEFAULT_PREFIX_LENGTH
The default prefix length for the Jaro-Winkler measure.
|
static int |
DEFAULT_Q
The default q-gram size for the q-gram and positional q-gram measures.
|
static float |
DEFAULT_SCALING_FACTOR
The default scaling factor for the Jaro-Winkler measure.
|
static String |
PROP_MAX_DISTANCE
Name of the property specifying maximum distance for the positional q-gram measure.
|
static String |
PROP_PREFIX_LENGTH
Name of the property specifying prefix length for the Jaro-Winkler measure.
|
static String |
PROP_Q
Name of the property specifying the q-gram size for the q-gram and positional q-gram measures.
|
static String |
PROP_SCALING_FACTOR
Name of the property specifying scaling factor for the Jaro-Winkler measure.
|
Constructor and Description |
---|
Similarity() |
Modifier and Type | Method and Description |
---|---|
static ScalarValuedFunction |
contains(ScalarValuedFunction left,
ScalarValuedFunction right)
Builds a function testing whether either of the specified
string valued expressions contains the other.
|
static ScalarValuedFunction |
contains(String left,
String right)
Builds a function testing whether either of the specified
fields has a string value containing the other.
|
static ScalarValuedFunction |
damerauLevenshtein(ScalarValuedFunction left,
ScalarValuedFunction right)
Builds a function computing the Damerau-Levenshtein distance
between the specified string valued expressions.
|
static ScalarValuedFunction |
damerauLevenshtein(String left,
String right)
Builds a function computing the Damerau-Levenshtein distance
between the string values of the specified fields.
|
static ScalarValuedFunction |
exact(ScalarValuedFunction left,
ScalarValuedFunction right)
Builds a function testing whether the specified
string valued expressions are equal.
|
static ScalarValuedFunction |
exact(String left,
String right)
Builds a function testing exact equality
between the string values of the specified fields.
|
static ScalarValuedFunction |
jaro(ScalarValuedFunction left,
ScalarValuedFunction right)
Builds a function computing the Jaro distance
between the specified string valued expressions.
|
static ScalarValuedFunction |
jaro(String left,
String right)
Builds a function computing the Jaro distance
between the specified string valued expressions.
|
static ScalarValuedFunction |
jaroWinkler(ScalarValuedFunction left,
ScalarValuedFunction right,
int prefixLen,
float scaling)
Builds a function computing the Jaro-Winkler distance
between the specified string valued expressions.
|
static ScalarValuedFunction |
jaroWinkler(String left,
String right,
int prefixLen,
float scaling)
Builds a function computing the Jaro-Winkler distance
between the specified string valued expressions.
|
static ScalarValuedFunction |
levenshtein(ScalarValuedFunction left,
ScalarValuedFunction right)
Builds a function computing the Levenshtein distance
between the specified string valued expressions.
|
static ScalarValuedFunction |
levenshtein(String left,
String right)
Builds a function computing the Levenshtein distance
between the string values of the specified fields.
|
static ScalarValuedFunction |
positionalQgram(ScalarValuedFunction left,
ScalarValuedFunction right,
int q,
int maxDist)
Builds a function computing the percentage of q-grams in common
between the specified string valued expressions.
|
static ScalarValuedFunction |
positionalQgram(String left,
String right,
int q,
int maxDist)
Builds a function computing the percentage of q-grams in common
between the string values of the specified fields.
|
static ScalarValuedFunction |
proximity(ScalarValuedFunction left,
ScalarValuedFunction right)
Builds a function computing an adjusted quotient
of the numeric values of the specified fields.
|
static ScalarValuedFunction |
proximity(String left,
String right)
Builds a function computing an adjusted quotient
of the numeric values of the specified fields.
|
static ScalarValuedFunction |
qgram(ScalarValuedFunction left,
ScalarValuedFunction right,
int q)
Builds a function computing the percentage of q-grams in common
between the specified string valued expressions.
|
static ScalarValuedFunction |
qgram(String left,
String right,
int q)
Builds a function computing the percentage of q-grams in common
between the string values of the specified fields.
|
static ScalarValuedFunction |
shorthand(ScalarValuedFunction left,
ScalarValuedFunction right)
Builds a function testing shorthand equivalence
between the specified string valued expressions.
|
static ScalarValuedFunction |
shorthand(String left,
String right)
Builds a function testing shorthand equivalence
between the string values of the specified fields.
|
public static final String PROP_Q
public static final int DEFAULT_Q
public static final String PROP_MAX_DISTANCE
public static final int DEFAULT_MAX_DISTANCE
public static final String PROP_PREFIX_LENGTH
public static final String PROP_SCALING_FACTOR
public static final int DEFAULT_PREFIX_LENGTH
public static final float DEFAULT_SCALING_FACTOR
public static ScalarValuedFunction contains(String left, String right)
1
if one value is contained
in the other, 0
if not.left
- the first record fieldright
- the second record fieldpublic static ScalarValuedFunction contains(ScalarValuedFunction left, ScalarValuedFunction right)
1
if one value is contained
in the other, 0
if not.left
- the first string valued expressionright
- the second string valued expressionpublic static ScalarValuedFunction damerauLevenshtein(String left, String right)
left
- the first record fieldright
- the second record fieldpublic static ScalarValuedFunction damerauLevenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
left
- the first string valued expressionright
- the second string valued expressionpublic static ScalarValuedFunction exact(String left, String right)
1
if the values are
equal, 0
if not.left
- the first record fieldright
- the second record fieldpublic static ScalarValuedFunction exact(ScalarValuedFunction left, ScalarValuedFunction right)
1
if the values are
equal, 0
if not.left
- the first string valued expressionright
- the second string valued expressionpublic static ScalarValuedFunction jaro(String left, String right)
left
- the first string valued expressionright
- the second string valued expressionpublic static ScalarValuedFunction jaro(ScalarValuedFunction left, ScalarValuedFunction right)
left
- the first string valued expressionright
- the second string valued expressionpublic static ScalarValuedFunction jaroWinkler(String left, String right, int prefixLen, float scaling)
left
- the first string valued expressionright
- the second string valued expressionprefixLen
- the maximum length of the common prefix for score adjustment purposesscaling
- the scaling factor for scoring common prefixespublic static ScalarValuedFunction jaroWinkler(ScalarValuedFunction left, ScalarValuedFunction right, int prefixLen, float scaling)
left
- the first string valued expressionright
- the second string valued expressionprefixLen
- the maximum length of the common prefix for score adjustment purposesscaling
- the scaling factor for scoring common prefixespublic static ScalarValuedFunction levenshtein(String left, String right)
left
- the first record fieldright
- the second record fieldpublic static ScalarValuedFunction levenshtein(ScalarValuedFunction left, ScalarValuedFunction right)
left
- the first string valued expressionright
- the second string valued expressionpublic static ScalarValuedFunction qgram(String left, String right, int q)
left
- the first record fieldright
- the second record fieldq
- the size of q-grams to comparepublic static ScalarValuedFunction qgram(ScalarValuedFunction left, ScalarValuedFunction right, int q)
left
- the first string valued expressionright
- the second string valued expressionq
- the size of q-grams to comparepublic static ScalarValuedFunction positionalQgram(String left, String right, int q, int maxDist)
left
- the first record fieldright
- the second record fieldq
- the size of q-grams to comparemaxDist
- the maximum distance, in characters, a q-gram can be
from its original positionpublic static ScalarValuedFunction positionalQgram(ScalarValuedFunction left, ScalarValuedFunction right, int q, int maxDist)
left
- the first string valued expressionright
- the second string valued expressionq
- the size of q-grams to comparemaxDist
- the maximum distance, in characters, a q-gram can be
from its original positionpublic static ScalarValuedFunction shorthand(String left, String right)
1
if the shorter value
is a shorthand representation of the longer value,
0
if not. To be a shorthand representation,
the all characters in the shorter string must appear in
the longer in the same order. Additionally,
if the longer contains multiple words, a letter can
only match if at least the first letter of the
word was matched also. For example, "ABC" would
not match "Acme SoftBall Corporation" since the
'S' starting "SoftBall" doesn't match, therefore
the 'B' doesn't count as a match. However,
"Acme Soft Ball Corporation" does match,
as the 'B' starts a word.left
- the first record fieldright
- the second record fieldpublic static ScalarValuedFunction shorthand(ScalarValuedFunction left, ScalarValuedFunction right)
1
if the shorter value
is a shorthand representation of the longer value,
0
if not. To be a shorthand representation,
the all characters in the shorter string must appear in
the longer in the same order. Additionally,
if the longer contains multiple words, a letter can
only match if at least the first letter of the
word was matched also. For example, "ABC" would
not match "Acme SoftBall Corporation" since the
'S' starting "SoftBall" doesn't match, therefore
the 'B' doesn't count as a match. However,
"Acme Soft Ball Corporation" does match,
as the 'B' starts a word.left
- the first string valued expressionright
- the second string valued expressionpublic static ScalarValuedFunction proximity(String left, String right)
0
is returned.left
- the first record fieldright
- the second record fieldpublic static ScalarValuedFunction proximity(ScalarValuedFunction left, ScalarValuedFunction right)
0
is returned.left
- the first string valued expressionright
- the second string valued expressionCopyright © 2021 Actian Corporation. All rights reserved.