Class HyphenationTree

  • All Implemented Interfaces:
    PatternConsumer, java.io.Serializable, java.lang.Cloneable

    public class HyphenationTree
    extends TernaryTree
    implements PatternConsumer
    This tree structure stores the hyphenation patterns in an efficient way for fast lookup. It provides the provides the method to hyphenate a word.
    See Also:
    Serialized Form
    • Constructor Summary

      Constructors 
      Constructor Description
      HyphenationTree()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addClass​(java.lang.String chargroup)
      Add a character class to the tree.
      void addException​(java.lang.String word, java.util.ArrayList<java.lang.Object> hyphenatedword)
      Add an exception to the tree.
      void addPattern​(java.lang.String pattern, java.lang.String ivalue)
      Add a pattern to the tree.
      java.lang.String findPattern​(java.lang.String pat)  
      protected byte[] getValues​(int k)  
      protected int hstrcmp​(char[] s, int si, char[] t, int ti)
      String compare, returns 0 if equal or t is a substring of s
      Hyphenation hyphenate​(char[] w, int offset, int len, int remainCharCount, int pushCharCount)
      Hyphenate word and return an array of hyphenation points.
      Hyphenation hyphenate​(java.lang.String word, int remainCharCount, int pushCharCount)
      Hyphenate word and return a Hyphenation object.
      void loadSimplePatterns​(java.io.InputStream stream)  
      protected int packValues​(java.lang.String values)
      Packs the values by storing them in 4 bits, two values into a byte Values range is from 0 to 9.
      void printStats()  
      protected void searchPatterns​(char[] word, int index, byte[] il)
      Search for all possible partial matches of word starting at index an update interletter values.
      protected java.lang.String unpackValues​(int k)  
      • Methods inherited from class java.lang.Object

        equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • vspace

        protected ByteVector vspace
        value space: stores the interletter values
      • stoplist

        protected java.util.HashMap<java.lang.String,​java.util.ArrayList<java.lang.Object>> stoplist
        This map stores hyphenation exceptions
      • classmap

        protected TernaryTree classmap
        This map stores the character classes
      • ivalues

        private transient TernaryTree ivalues
        Temporary map to store interletter values on pattern loading.
    • Constructor Detail

      • HyphenationTree

        public HyphenationTree()
    • Method Detail

      • packValues

        protected int packValues​(java.lang.String values)
        Packs the values by storing them in 4 bits, two values into a byte Values range is from 0 to 9. We use zero as terminator, so we'll add 1 to the value.
        Parameters:
        values - a string of digits from '0' to '9' representing the interletter values.
        Returns:
        the index into the vspace array where the packed values are stored.
      • unpackValues

        protected java.lang.String unpackValues​(int k)
      • loadSimplePatterns

        public void loadSimplePatterns​(java.io.InputStream stream)
      • findPattern

        public java.lang.String findPattern​(java.lang.String pat)
      • hstrcmp

        protected int hstrcmp​(char[] s,
                              int si,
                              char[] t,
                              int ti)
        String compare, returns 0 if equal or t is a substring of s
      • getValues

        protected byte[] getValues​(int k)
      • searchPatterns

        protected void searchPatterns​(char[] word,
                                      int index,
                                      byte[] il)

        Search for all possible partial matches of word starting at index an update interletter values. In other words, it does something like:

        for(i=0; i < patterns.length; i++) { if ( word.substring(index).startsWidth(patterns[i]) ) update_interletter_values(patterns[i]); }

        But it is done in an efficient way since the patterns are stored in a ternary tree. In fact, this is the whole purpose of having the tree: doing this search without having to test every single pattern. The number of patterns for languages such as English range from 4000 to 10000. Thus, doing thousands of string comparisons for each word to hyphenate would be really slow without the tree. The tradeoff is memory, but using a ternary tree instead of a trie, almost halves the the memory used by Lout or TeX. It's also faster than using a hash table

        Parameters:
        word - null terminated word to match
        index - start index from word
        il - interletter values array to update
      • hyphenate

        public Hyphenation hyphenate​(java.lang.String word,
                                     int remainCharCount,
                                     int pushCharCount)
        Hyphenate word and return a Hyphenation object.
        Parameters:
        word - the word to be hyphenated
        remainCharCount - Minimum number of characters allowed before the hyphenation point.
        pushCharCount - Minimum number of characters allowed after the hyphenation point.
        Returns:
        a Hyphenation object representing the hyphenated word or null if word is not hyphenated.
      • hyphenate

        public Hyphenation hyphenate​(char[] w,
                                     int offset,
                                     int len,
                                     int remainCharCount,
                                     int pushCharCount)
        Hyphenate word and return an array of hyphenation points.
        Parameters:
        w - char array that contains the word
        offset - Offset to first character in word
        len - Length of word
        remainCharCount - Minimum number of characters allowed before the hyphenation point.
        pushCharCount - Minimum number of characters allowed after the hyphenation point.
        Returns:
        a Hyphenation object representing the hyphenated word or null if word is not hyphenated.
      • addClass

        public void addClass​(java.lang.String chargroup)
        Add a character class to the tree. It is used by SimplePatternParser as callback to add character classes. Character classes define the valid word characters for hyphenation. If a word contains a character not defined in any of the classes, it is not hyphenated. It also defines a way to normalize the characters in order to compare them with the stored patterns. Usually pattern files use only lower case characters, in this case a class for letter 'a', for example, should be defined as "aA", the first character being the normalization char.
        Specified by:
        addClass in interface PatternConsumer
        Parameters:
        chargroup - character group
      • addException

        public void addException​(java.lang.String word,
                                 java.util.ArrayList<java.lang.Object> hyphenatedword)
        Add an exception to the tree. It is used by SimplePatternParser class as callback to store the hyphenation exceptions.
        Specified by:
        addException in interface PatternConsumer
        Parameters:
        word - normalized word
        hyphenatedword - a vector of alternating strings and hyphen objects.
      • addPattern

        public void addPattern​(java.lang.String pattern,
                               java.lang.String ivalue)
        Add a pattern to the tree. Mainly, to be used by SimplePatternParser class as callback to add a pattern to the tree.
        Specified by:
        addPattern in interface PatternConsumer
        Parameters:
        pattern - the hyphenation pattern
        ivalue - interletter weight values indicating the desirability and priority of hyphenating at a given point within the pattern. It should contain only digit characters. (i.e. '0' to '9').