Class SimpleTextExtractionStrategy

  • All Implemented Interfaces:
    RenderListener, TextExtractionStrategy

    public class SimpleTextExtractionStrategy
    extends java.lang.Object
    implements TextExtractionStrategy
    A simple text extraction renderer. This renderer keeps track of the current Y position of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
    Since:
    2.1.5
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private Vector lastEnd  
      private Vector lastStart  
      private java.lang.StringBuffer result
      used to store the resulting String.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected void appendTextChunk​(java.lang.CharSequence text)
      Used to actually append text to the text results.
      void beginTextBlock()
      Called when a new text block is beginning (i.e.
      void endTextBlock()
      Called when a text block has ended (i.e.
      java.lang.String getResultantText()
      Returns the result so far.
      void renderImage​(ImageRenderInfo renderInfo)
      no-op method - this renderer isn't interested in image events
      void renderText​(TextRenderInfo renderInfo)
      Captures text using a simplified algorithm for inserting hard returns and spaces
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • lastStart

        private Vector lastStart
      • lastEnd

        private Vector lastEnd
      • result

        private final java.lang.StringBuffer result
        used to store the resulting String.
    • Constructor Detail

      • SimpleTextExtractionStrategy

        public SimpleTextExtractionStrategy()
        Creates a new text extraction renderer.
    • Method Detail

      • beginTextBlock

        public void beginTextBlock()
        Description copied from interface: RenderListener
        Called when a new text block is beginning (i.e. BT)
        Specified by:
        beginTextBlock in interface RenderListener
        Since:
        5.0.1
      • endTextBlock

        public void endTextBlock()
        Description copied from interface: RenderListener
        Called when a text block has ended (i.e. ET)
        Specified by:
        endTextBlock in interface RenderListener
        Since:
        5.0.1
      • getResultantText

        public java.lang.String getResultantText()
        Returns the result so far.
        Specified by:
        getResultantText in interface TextExtractionStrategy
        Returns:
        a String with the resulting text.
      • appendTextChunk

        protected final void appendTextChunk​(java.lang.CharSequence text)
        Used to actually append text to the text results. Subclasses can use this to insert text that wouldn't normally be included in text parsing (e.g. result of OCR performed against image content)
        Parameters:
        text - the text to append to the text results accumulated so far
      • renderText

        public void renderText​(TextRenderInfo renderInfo)
        Captures text using a simplified algorithm for inserting hard returns and spaces
        Specified by:
        renderText in interface RenderListener
        Parameters:
        renderInfo - render info