- java.lang.Object
-
- com.pervasive.datarush.analytics.text.TextTokenUtil
-
public class TextTokenUtil extends Object
Utility methods for operating on TextContainer objects.
-
-
Constructor Summary
Constructors Constructor Description TextTokenUtil()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static NGramMap
calcNGramFreq(TextContainer text, int n)
Creates an n-gram frequency model based on the contents of the TextContainer.static NGramMap
calcNGramFreq(TextContainer text, int n, Set<NGram> nGramSet)
Creates an n-gram frequency model containing the specified set of terms based on the contents of the TextContainer.static WordMap
calcWordFreq(TextContainer text)
Creates a term frequency model based on the contents of the TextContainer.static WordMap
calcWordFreq(TextContainer text, Set<String> wordSet)
Creates a term frequency model containing the specified set of terms based on the contents of the TextContainer.static int
countElementType(TextContainer text, TextElementType type)
Counts the number of elements of a specific type in the TextContainer.static TextContainer
createTreeFromList(List<TextContainer> nodes)
Creates a TextContainer from a list of TextContainer nodes.static TextContainer
createTreeFromString(String textTokens)
static Set<String>
genBagOfWords(TextContainer text)
Creates a bag of words based on the contents of the TextContainer.static List<NGram>
generateNGramList(TextContainer text, int n)
Lists the unique n-grams contained in the TextContainer.static List<String>
generateWordList(TextContainer text)
Lists the unique words contained in the TextContainer.static <K,V extends Comparable<V>>
List<Map.Entry<K,V>>sortMapByValue(Map<K,V> map)
Sorts a map by the values associated with each key and returns a list of the entries that have been sorted.
-
-
-
Method Detail
-
generateWordList
public static List<String> generateWordList(TextContainer text)
Lists the unique words contained in the TextContainer.- Parameters:
text
- the container of tokenized text- Returns:
- a list of unique word strings
-
generateNGramList
public static List<NGram> generateNGramList(TextContainer text, int n)
Lists the unique n-grams contained in the TextContainer.- Parameters:
text
- the container of tokenized textn
- the degree of the n-grams- Returns:
- a list of unique n-grams
-
countElementType
public static int countElementType(TextContainer text, TextElementType type)
Counts the number of elements of a specific type in the TextContainer.- Parameters:
text
- the container of tokenized texttype
- the type of text element to count- Returns:
- the count of the specific text elements
-
genBagOfWords
public static Set<String> genBagOfWords(TextContainer text)
Creates a bag of words based on the contents of the TextContainer.- Parameters:
text
- the container of tokenized text- Returns:
- the bag of words
-
calcWordFreq
public static WordMap calcWordFreq(TextContainer text)
Creates a term frequency model based on the contents of the TextContainer.- Parameters:
text
- the container of tokenized text- Returns:
- the term frequency model
-
calcWordFreq
public static WordMap calcWordFreq(TextContainer text, Set<String> wordSet)
Creates a term frequency model containing the specified set of terms based on the contents of the TextContainer.- Parameters:
text
- the container of tokenized textwordSet
- the set of terms to include in the model- Returns:
- the term frequency model
-
calcNGramFreq
public static NGramMap calcNGramFreq(TextContainer text, int n)
Creates an n-gram frequency model based on the contents of the TextContainer.- Parameters:
text
- the container of tokenized textn
- the degree of the n-grams- Returns:
- the n-gram frequency model
-
calcNGramFreq
public static NGramMap calcNGramFreq(TextContainer text, int n, Set<NGram> nGramSet)
Creates an n-gram frequency model containing the specified set of terms based on the contents of the TextContainer.- Parameters:
text
- the container of tokenized textn
- the degree of the n-gramsnGramSet
- the set of n-grams to include in the model- Returns:
- the n-gram frequency model
-
sortMapByValue
public static <K,V extends Comparable<V>> List<Map.Entry<K,V>> sortMapByValue(Map<K,V> map)
Sorts a map by the values associated with each key and returns a list of the entries that have been sorted.- Parameters:
map
- the map of entries that will be sorted- Returns:
- a list of the sorted map entries
-
createTreeFromList
public static TextContainer createTreeFromList(List<TextContainer> nodes)
Creates a TextContainer from a list of TextContainer nodes. The list must be a pre-order traversal of the original tree.- Parameters:
nodes
- a list of TextContainers representing a pre-order traversal- Returns:
- the tokenized text tree
-
createTreeFromString
public static TextContainer createTreeFromString(String textTokens)
-
-