(858) 586 7777 | About | Testimonials | Contact
vteams vteams vteams vteams
  • How does it work?
    • Startup Process
    • Your First Day
  • Technologies
    • Hire PHP Developer
    • Hire App Developer
    • Hire JavaScript Developer
    • Hire ROR Developer
    • Hire IOS Developer
    • Hire .NET Developer
    • Hire AI Developer
    • Hire Robotics Engineer
  • Sample Budgets
  • Meet The Team
  • Experiments
  • Captain’s Log
  • Blog
vteams vteams
  • How does it work?
    • Startup Process
    • Your First Day
  • Technologies
    • Hire PHP Developer
    • Hire App Developer
    • Hire JavaScript Developer
    • Hire ROR Developer
    • Hire IOS Developer
    • Hire .NET Developer
    • Hire AI Developer
    • Hire Robotics Engineer
  • Sample Budgets
  • Meet The Team
  • Experiments
  • Captain’s Log
  • Blog
Blog
  1. vteams
  2. Blog
  3. Apache Tika Per Page Content Extraction
Jul 24

Apache Tika Per Page Content Extraction

  • July 24, 2015

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. That’s the official side story and its true, but making a tool which have a single interface for huge variety of documents put much limitations on features too which are specialized for one type of content. We faced a similar situation with Apache Tika in a project.

Tika extracts contents of application’s whole pages, but we need to extract the content from one page at a time so that we can index those properly and search pages accordingly. Apache Tika does not support this functionality, so we need to do custom implementation for content handler.

In reference to pages, we can divide documents into two types:
1- Page based documents
2-Section Based documents

PDF is a page based document whereas DOCX is a section based document.

There are special content handler in Apache Tika called as ToXMLContentHandler which convert every document content into xml format. Page based documents when converted into xml have special attributes on tags to show pages, so we modified the ToXMLContentHandler content handler to achieve our goal. The implementation here is in JRuby language:

Ruby
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class PageContentHandler < ToXMLContentHandler
       attr_accessor :page_tag
       attr_accessor :page_number
       attr_accessor :page_class
       attr_accessor :page_map
 
       def initialize
        @page_number = 0
        @page_tag = 'div'
        @page_class = 'page'
        @page_map = Hash.new
       end
 
       def startElement(uri, local_name, q_name, atts)
        start_page() if @page_tag == q_name and atts.getValue('class') == @page_class
       end
 
       def endElement(uri, local_name, q_name)
        end_page() if @page_tag == q_name
       end
 
       def characters(ch, start, length)
        if length > 0
          builder = StringBuilder.new(length)
          builder.append(ch)
          @page_map[@page_number] << builder.to_s if @page_number > 0
        end
       end
 
       def start_page
        @page_number = @page_number + 1
        @page_map[@page_number] = String.new
       end
 
       def end_page
        return
       end
      end

To use this content handler, here is the code:

Ruby
1
2
3
4
parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
puts handler.page_map

We tested it with different pdf documents and it worked 100% perfectly. Unfortunately it does not work well with docx format since its a section based document. We checked the XML format of docx file, it does not have any division with class page. Instead the page is identified with <footer> tag, so alternatively you can convert docx to pdf and then extract its content per page.

  • Facebook
  • Twitter
  • Tumblr
  • Pinterest
  • Google+
  • LinkedIn
  • E-Mail

Comments are closed.

SEARCH BLOG

Categories

  • Blog (490)
  • Captain's Log (1)
  • Closure Reports (45)
  • Experiments (7)
  • How-To (56)
  • Implementation Notes (148)
  • Learn More (156)
  • LMS (8)
  • Look Inside (10)
  • Operations Log (12)
  • Programmer Notes (20)
  • R&D (14)
  • Rescue Log (4)
  • Testimonials (25)
  • Uncategorized (4)

RECENT STORIES

  • GitHub Actions- Automate your software workflows with excellence
  • Yii Framework – Accomplish Repetitive & Iterative Projects with Ease
  • A Recipe for CRM Software Development
  • Are Agile and DevOps the same?
  • The Data Scientist’s Toolset

ARCHIVES

In Short

With the vteams model, you bypass the middleman and hire your own offshore engineers - they work exclusively for you. You pay a reasonable monthly wage and get the job done without hassles, re-negotiations, feature counts or budget overruns.

Goals for 2020

  • Open development center in Australia
  • Complete and Launch the Robot
  • Structural changes to better address Clients' needs

Contact Us

Address: NEXTWERK INC.
6790 Embarcadero Ln, Ste 100,
Carlsbad, CA 92011, USA

Tel: (858) 586 7777
Email: fahad@nextwerk.com
Web: www.vteams.com

© 2020 vteams. All Rights Reserved.

Content Protection by DMCA.com