The Apache Software Foundation announced Tika v.1, an embeddable toolkit for content detection and analysis five years in the making.
What is Tika? The announcement describes it as a one-stop shop for identifying, retrieving and parsing text and metadata from more than 1,200 file formats, such as HTML, PDF, images, OpenOffice, Microsoft Office, email and more:
Users and software applications use Apache Tika to explore the information landscape through flexible interfaces in Java, from the command line, REST-ful Web services, and also by consuming its functionality from a multitude of programming languages directly, including Python, .NET and C++. Tika defines a standard application programming interface (API) and makes use of existing libraries such Apache POI and PDFBox to detect and extract metadata and structured text content from various documents using existing parser libraries."
Tika Takes Off
Tika, which became an ASF Top-level Project in April of last year, has been tested in repositories with more than 500 million documents. Dan Crichton, Program Manager and Principal Computer Scientist, NASA Jet Propulsion Laboratory, says that NASA leverages Tika on several Earth science data system projects to help process hundreds of terabytes of scientific data in a variety of formats. Arjé Cahn, CTO of Hippo, explains that Hippo is exploring ways to enhance access to metadata in the Hippo CMS's navigation feature with Tika.
Tika is released under the Apache License v2.0, and the source code and documentation are available from the Apache website.
Tika in Action
This week, ApacheCon attendees can see Apache Tika v1.0 at the Content Technologies track on November 10. Chris Mattmann, a senior computer scientist at the NASA Jet Propulsion Laboratory, will lead the talk.
Mattmann, who co-wrote Tika in Action with Jukka L. Zitting, will no doubt be plugging his new book, too. The Manning Publications title is set to drop some time this month.