Class TextTokenUtil


  • public class TextTokenUtil
    extends Object
    Utility methods for operating on TextContainer objects.
    • Constructor Detail

      • TextTokenUtil

        public TextTokenUtil()
    • Method Detail

      • generateWordList

        public static List<String> generateWordList​(TextContainer text)
        Lists the unique words contained in the TextContainer.
        Parameters:
        text - the container of tokenized text
        Returns:
        a list of unique word strings
      • generateNGramList

        public static List<NGram> generateNGramList​(TextContainer text,
                                                    int n)
        Lists the unique n-grams contained in the TextContainer.
        Parameters:
        text - the container of tokenized text
        n - the degree of the n-grams
        Returns:
        a list of unique n-grams
      • countElementType

        public static int countElementType​(TextContainer text,
                                           TextElementType type)
        Counts the number of elements of a specific type in the TextContainer.
        Parameters:
        text - the container of tokenized text
        type - the type of text element to count
        Returns:
        the count of the specific text elements
      • genBagOfWords

        public static Set<String> genBagOfWords​(TextContainer text)
        Creates a bag of words based on the contents of the TextContainer.
        Parameters:
        text - the container of tokenized text
        Returns:
        the bag of words
      • calcWordFreq

        public static WordMap calcWordFreq​(TextContainer text)
        Creates a term frequency model based on the contents of the TextContainer.
        Parameters:
        text - the container of tokenized text
        Returns:
        the term frequency model
      • calcWordFreq

        public static WordMap calcWordFreq​(TextContainer text,
                                           Set<String> wordSet)
        Creates a term frequency model containing the specified set of terms based on the contents of the TextContainer.
        Parameters:
        text - the container of tokenized text
        wordSet - the set of terms to include in the model
        Returns:
        the term frequency model
      • calcNGramFreq

        public static NGramMap calcNGramFreq​(TextContainer text,
                                             int n)
        Creates an n-gram frequency model based on the contents of the TextContainer.
        Parameters:
        text - the container of tokenized text
        n - the degree of the n-grams
        Returns:
        the n-gram frequency model
      • calcNGramFreq

        public static NGramMap calcNGramFreq​(TextContainer text,
                                             int n,
                                             Set<NGram> nGramSet)
        Creates an n-gram frequency model containing the specified set of terms based on the contents of the TextContainer.
        Parameters:
        text - the container of tokenized text
        n - the degree of the n-grams
        nGramSet - the set of n-grams to include in the model
        Returns:
        the n-gram frequency model
      • sortMapByValue

        public static <K,​V extends Comparable<V>> List<Map.Entry<K,​V>> sortMapByValue​(Map<K,​V> map)
        Sorts a map by the values associated with each key and returns a list of the entries that have been sorted.
        Parameters:
        map - the map of entries that will be sorted
        Returns:
        a list of the sorted map entries
      • createTreeFromList

        public static TextContainer createTreeFromList​(List<TextContainer> nodes)
        Creates a TextContainer from a list of TextContainer nodes. The list must be a pre-order traversal of the original tree.
        Parameters:
        nodes - a list of TextContainers representing a pre-order traversal
        Returns:
        the tokenized text tree
      • createTreeFromString

        public static TextContainer createTreeFromString​(String textTokens)