Class LocationTextExtractionStrategy

  • All Implemented Interfaces:
    RenderListener, TextExtractionStrategy

    public class LocationTextExtractionStrategy
    extends java.lang.Object
    implements TextExtractionStrategy
    Development preview - this class (and all of the parser classes) are still experiencing heavy development, and are subject to change both behavior and interface.
    A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.
    This renderer keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation. Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance, but different parallel distance is treated as being on the same line.
    This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
    Since:
    5.0.2
    • Constructor Detail

      • LocationTextExtractionStrategy

        public LocationTextExtractionStrategy()
        Creates a new text extraction renderer.
      • LocationTextExtractionStrategy

        public LocationTextExtractionStrategy​(LocationTextExtractionStrategy.TextChunkLocationStrategy strat)
        Creates a new text extraction renderer, with a custom strategy for creating new TextChunkLocation objects based on the input of the TextRenderInfo.
        Parameters:
        strat - the custom strategy
    • Method Detail

      • startsWithSpace

        private boolean startsWithSpace​(java.lang.String str)
        Parameters:
        str -
        Returns:
        true if the string starts with a space character, false if the string is empty or starts with a non-space character
      • endsWithSpace

        private boolean endsWithSpace​(java.lang.String str)
        Parameters:
        str -
        Returns:
        true if the string ends with a space character, false if the string is empty or ends with a non-space character
      • isChunkAtWordBoundary

        protected boolean isChunkAtWordBoundary​(LocationTextExtractionStrategy.TextChunk chunk,
                                                LocationTextExtractionStrategy.TextChunk previousChunk)
        Determines if a space character should be inserted between a previous chunk and the current chunk. This method is exposed as a callback so subclasses can fine time the algorithm for determining whether a space should be inserted or not. By default, this method will insert a space if the there is a gap of more than half the font space character width between the end of the previous chunk and the beginning of the current chunk. It will also indicate that a space is needed if the starting point of the new chunk appears *before* the end of the previous chunk (i.e. overlapping text).
        Parameters:
        chunk - the new chunk being evaluated
        previousChunk - the chunk that appeared immediately before the current chunk
        Returns:
        true if the two chunks represent different words (i.e. should have a space between them). False otherwise.
      • getResultantText

        public java.lang.String getResultantText​(LocationTextExtractionStrategy.TextChunkFilter chunkFilter)
        Gets text that meets the specified filter If multiple text extractions will be performed for the same page (i.e. for different physical regions of the page), filtering at this level is more efficient than filtering using FilteredRenderListener - but not nearly as powerful because most of the RenderInfo state is not captured in LocationTextExtractionStrategy.TextChunk
        Parameters:
        chunkFilter - the filter to to apply
        Returns:
        the text results so far, filtered using the specified filter
      • getResultantText

        public java.lang.String getResultantText()
        Returns the result so far.
        Specified by:
        getResultantText in interface TextExtractionStrategy
        Returns:
        a String with the resulting text.
      • dumpState

        private void dumpState()
        Used for debugging only
      • compareInts

        private static int compareInts​(int int1,
                                       int int2)
        Parameters:
        int1 -
        int2 -
        Returns:
        comparison of the two integers