Class PdfTextExtractor


  • public final class PdfTextExtractor
    extends java.lang.Object
    Extracts text from a PDF file.
    Since:
    2.1.4
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private PdfTextExtractor()
      This class only contains static methods.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String getTextFromPage​(PdfReader reader, int pageNumber)
      Extract text from a specified page using the default strategy.
      static java.lang.String getTextFromPage​(PdfReader reader, int pageNumber, TextExtractionStrategy strategy)
      Extract text from a specified page using an extraction strategy.
      static java.lang.String getTextFromPage​(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, java.util.Map<java.lang.String,​ContentOperator> additionalContentOperators)
      Extract text from a specified page using an extraction strategy.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • PdfTextExtractor

        private PdfTextExtractor()
        This class only contains static methods.
    • Method Detail

      • getTextFromPage

        public static java.lang.String getTextFromPage​(PdfReader reader,
                                                       int pageNumber,
                                                       TextExtractionStrategy strategy,
                                                       java.util.Map<java.lang.String,​ContentOperator> additionalContentOperators)
                                                throws java.io.IOException
        Extract text from a specified page using an extraction strategy. Also allows registration of custom ContentOperators
        Parameters:
        reader - the reader to extract text from
        pageNumber - the page to extract text from
        strategy - the strategy to use for extracting text
        additionalContentOperators - an optional map of custom ContentOperators for rendering instructions
        Returns:
        the extracted text
        Throws:
        java.io.IOException - if any operation fails while reading from the provided PdfReader
      • getTextFromPage

        public static java.lang.String getTextFromPage​(PdfReader reader,
                                                       int pageNumber,
                                                       TextExtractionStrategy strategy)
                                                throws java.io.IOException
        Extract text from a specified page using an extraction strategy.
        Parameters:
        reader - the reader to extract text from
        pageNumber - the page to extract text from
        strategy - the strategy to use for extracting text
        Returns:
        the extracted text
        Throws:
        java.io.IOException - if any operation fails while reading from the provided PdfReader
        Since:
        5.0.2
      • getTextFromPage

        public static java.lang.String getTextFromPage​(PdfReader reader,
                                                       int pageNumber)
                                                throws java.io.IOException
        Extract text from a specified page using the default strategy.

        Note: the default strategy is subject to change. If using a specific strategy is important, use getTextFromPage(PdfReader, int, TextExtractionStrategy)

        Parameters:
        reader - the reader to extract text from
        pageNumber - the page to extract text from
        Returns:
        the extracted text
        Throws:
        java.io.IOException - if any operation fails while reading from the provided PdfReader
        Since:
        5.0.2