java.lang.Object
com.pervasive.datarush.analytics.text.TextTokenUtil
Utility methods for operating on TextContainer objects.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic NGramMapcalcNGramFreq(TextContainer text, int n) Creates an n-gram frequency model based on the contents of the TextContainer.static NGramMapcalcNGramFreq(TextContainer text, int n, Set<NGram> nGramSet) Creates an n-gram frequency model containing the specified set of terms based on the contents of the TextContainer.static WordMapcalcWordFreq(TextContainer text) Creates a term frequency model based on the contents of the TextContainer.static WordMapcalcWordFreq(TextContainer text, Set<String> wordSet) Creates a term frequency model containing the specified set of terms based on the contents of the TextContainer.static intcountElementType(TextContainer text, TextElementType type) Counts the number of elements of a specific type in the TextContainer.static TextContainercreateTreeFromList(List<TextContainer> nodes) Creates a TextContainer from a list of TextContainer nodes.static TextContainercreateTreeFromString(String textTokens) genBagOfWords(TextContainer text) Creates a bag of words based on the contents of the TextContainer.generateNGramList(TextContainer text, int n) Lists the unique n-grams contained in the TextContainer.Lists the unique words contained in the TextContainer.static <K,V extends Comparable<V>>
List<Map.Entry<K,V>> sortMapByValue(Map<K, V> map) Sorts a map by the values associated with each key and returns a list of the entries that have been sorted.
-
Constructor Details
-
TextTokenUtil
public TextTokenUtil()
-
-
Method Details
-
generateWordList
Lists the unique words contained in the TextContainer.- Parameters:
text- the container of tokenized text- Returns:
- a list of unique word strings
-
generateNGramList
Lists the unique n-grams contained in the TextContainer.- Parameters:
text- the container of tokenized textn- the degree of the n-grams- Returns:
- a list of unique n-grams
-
countElementType
Counts the number of elements of a specific type in the TextContainer.- Parameters:
text- the container of tokenized texttype- the type of text element to count- Returns:
- the count of the specific text elements
-
genBagOfWords
Creates a bag of words based on the contents of the TextContainer.- Parameters:
text- the container of tokenized text- Returns:
- the bag of words
-
calcWordFreq
Creates a term frequency model based on the contents of the TextContainer.- Parameters:
text- the container of tokenized text- Returns:
- the term frequency model
-
calcWordFreq
Creates a term frequency model containing the specified set of terms based on the contents of the TextContainer.- Parameters:
text- the container of tokenized textwordSet- the set of terms to include in the model- Returns:
- the term frequency model
-
calcNGramFreq
Creates an n-gram frequency model based on the contents of the TextContainer.- Parameters:
text- the container of tokenized textn- the degree of the n-grams- Returns:
- the n-gram frequency model
-
calcNGramFreq
Creates an n-gram frequency model containing the specified set of terms based on the contents of the TextContainer.- Parameters:
text- the container of tokenized textn- the degree of the n-gramsnGramSet- the set of n-grams to include in the model- Returns:
- the n-gram frequency model
-
sortMapByValue
Sorts a map by the values associated with each key and returns a list of the entries that have been sorted.- Parameters:
map- the map of entries that will be sorted- Returns:
- a list of the sorted map entries
-
createTreeFromList
Creates a TextContainer from a list of TextContainer nodes. The list must be a pre-order traversal of the original tree.- Parameters:
nodes- a list of TextContainers representing a pre-order traversal- Returns:
- the tokenized text tree
-
createTreeFromString
-