Class TextTokenUtil

java.lang.Object
com.pervasive.datarush.analytics.text.TextTokenUtil

public class TextTokenUtil extends Object
Utility methods for operating on TextContainer objects.
  • Constructor Details

    • TextTokenUtil

      public TextTokenUtil()
  • Method Details

    • generateWordList

      public static List<String> generateWordList(TextContainer text)
      Lists the unique words contained in the TextContainer.
      Parameters:
      text - the container of tokenized text
      Returns:
      a list of unique word strings
    • generateNGramList

      public static List<NGram> generateNGramList(TextContainer text, int n)
      Lists the unique n-grams contained in the TextContainer.
      Parameters:
      text - the container of tokenized text
      n - the degree of the n-grams
      Returns:
      a list of unique n-grams
    • countElementType

      public static int countElementType(TextContainer text, TextElementType type)
      Counts the number of elements of a specific type in the TextContainer.
      Parameters:
      text - the container of tokenized text
      type - the type of text element to count
      Returns:
      the count of the specific text elements
    • genBagOfWords

      public static Set<String> genBagOfWords(TextContainer text)
      Creates a bag of words based on the contents of the TextContainer.
      Parameters:
      text - the container of tokenized text
      Returns:
      the bag of words
    • calcWordFreq

      public static WordMap calcWordFreq(TextContainer text)
      Creates a term frequency model based on the contents of the TextContainer.
      Parameters:
      text - the container of tokenized text
      Returns:
      the term frequency model
    • calcWordFreq

      public static WordMap calcWordFreq(TextContainer text, Set<String> wordSet)
      Creates a term frequency model containing the specified set of terms based on the contents of the TextContainer.
      Parameters:
      text - the container of tokenized text
      wordSet - the set of terms to include in the model
      Returns:
      the term frequency model
    • calcNGramFreq

      public static NGramMap calcNGramFreq(TextContainer text, int n)
      Creates an n-gram frequency model based on the contents of the TextContainer.
      Parameters:
      text - the container of tokenized text
      n - the degree of the n-grams
      Returns:
      the n-gram frequency model
    • calcNGramFreq

      public static NGramMap calcNGramFreq(TextContainer text, int n, Set<NGram> nGramSet)
      Creates an n-gram frequency model containing the specified set of terms based on the contents of the TextContainer.
      Parameters:
      text - the container of tokenized text
      n - the degree of the n-grams
      nGramSet - the set of n-grams to include in the model
      Returns:
      the n-gram frequency model
    • sortMapByValue

      public static <K, V extends Comparable<V>> List<Map.Entry<K,V>> sortMapByValue(Map<K,V> map)
      Sorts a map by the values associated with each key and returns a list of the entries that have been sorted.
      Parameters:
      map - the map of entries that will be sorted
      Returns:
      a list of the sorted map entries
    • createTreeFromList

      public static TextContainer createTreeFromList(List<TextContainer> nodes)
      Creates a TextContainer from a list of TextContainer nodes. The list must be a pre-order traversal of the original tree.
      Parameters:
      nodes - a list of TextContainers representing a pre-order traversal
      Returns:
      the tokenized text tree
    • createTreeFromString

      public static TextContainer createTreeFromString(String textTokens)