Apache Tika Per Page Content Extraction

  • Post published:July 24, 2015

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. That’s the official side story and its true, but making a tool which have a single interface for huge variety of documents put much limitations on features too which are specialized for one type of content. We faced a similar situation with Apache Tika in a project.

Tika extracts contents of application’s whole pages, but we need to extract the content from one page at a time so that we can index those properly and search pages accordingly. Apache Tika does not support this functionality, so we need to do custom implementation for content handler.

In reference to pages, we can divide documents into two types:
1- Page based documents
2-Section Based documents

PDF is a page based document whereas DOCX is a section based document.

There are special content handler in Apache Tika called as ToXMLContentHandler which convert every document content into xml format. Page based documents when converted into xml have special attributes on tags to show pages, so we modified the ToXMLContentHandler content handler to achieve our goal. The implementation here is in JRuby language:

class PageContentHandler < ToXMLContentHandler
       attr_accessor :page_tag
       attr_accessor :page_number
       attr_accessor :page_class
       attr_accessor :page_map

       def initialize
        @page_number = 0
        @page_tag = 'div'
        @page_class = 'page'
        @page_map = Hash.new
       end

       def startElement(uri, local_name, q_name, atts)
        start_page() if @page_tag == q_name and atts.getValue('class') == @page_class
       end

       def endElement(uri, local_name, q_name)
        end_page() if @page_tag == q_name
       end

       def characters(ch, start, length)
        if length > 0
          builder = StringBuilder.new(length)
          builder.append(ch)
          @page_map[@page_number] << builder.to_s if @page_number > 0
        end
       end

       def start_page
        @page_number = @page_number + 1
        @page_map[@page_number] = String.new
       end

       def end_page
        return
       end
      end

To use this content handler, here is the code:

parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
puts handler.page_map

We tested it with different pdf documents and it worked 100% perfectly. Unfortunately it does not work well with docx format since its a section based document. We checked the XML format of docx file, it does not have any division with class page. Instead the page is identified with <footer> tag, so alternatively you can convert docx to pdf and then extract its content per page.