The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. That’s the official side story and its true, but making a tool which have a single interface for huge variety of documents put much limitations on features too which are specialized for one type of content. We faced a similar situation with Apache Tika in a project.
Tika extracts contents of application’s whole pages, but we need to extract the content from one page at a time so that we can index those properly and search pages accordingly. Apache Tika does not support this functionality, so we need to do custom implementation for content handler.
In reference to pages, we can divide documents into two types:
1- Page based documents
2-Section Based documents
PDF is a page based document whereas DOCX is a section based document.
There are special content handler in Apache Tika called as ToXMLContentHandler which convert every document content into xml format. Page based documents when converted into xml have special attributes on tags to show pages, so we modified the ToXMLContentHandler content handler to achieve our goal. The implementation here is in JRuby language:
class PageContentHandler < ToXMLContentHandler attr_accessor :page_tag attr_accessor :page_number attr_accessor :page_class attr_accessor :page_map def initialize @page_number = 0 @page_tag = 'div' @page_class = 'page' @page_map = Hash.new end def startElement(uri, local_name, q_name, atts) start_page() if @page_tag == q_name and atts.getValue('class') == @page_class end def endElement(uri, local_name, q_name) end_page() if @page_tag == q_name end def characters(ch, start, length) if length > 0 builder = StringBuilder.new(length) builder.append(ch) @page_map[@page_number] << builder.to_s if @page_number > 0 end end def start_page @page_number = @page_number + 1 @page_map[@page_number] = String.new end def end_page return end end
To use this content handler, here is the code:
parser = AutoDetectParser.new handler = PageContentHandler.new parser.parse(input_stream, handler, @metadata_java, ParseContext.new) puts handler.page_map
We tested it with different pdf documents and it worked 100% perfectly. Unfortunately it does not work well with docx format since its a section based document. We checked the XML format of docx file, it does not have any division with class page. Instead the page is identified with <footer> tag, so alternatively you can convert docx to pdf and then extract its content per page.