OutWit Hub: New Semantic Search Tool for Web Harvesting
OutWit Technologies released a new Firefox 3.0 extension designed to outwit the web and perform smart web harvesting and semantic search. Based on a recognition technology, this tool is deemed to be the “first step towards a full-blown semantic browser.” Based on an open API, OutWit Hub allows users to harvest data elements, documents or media from virtually any public source of content.

Semantic Search is Gaining Popularity

With COGITO Monitor and COGITO Focus in the market, semantic search and web services is certainly starting to gain more and more attention. Performance issues aside, OutWit came out with this new application at the right time. As the semantic search space heats up with the likes of Quintura, semantic search beats Google’s offering by adding some extra value to Google’s primary focus on the “most-linked” algorithm. As we've previously discussed, semantic search is not an easy undertaking, so it will be interesting to see where OutWit will go with its semantic browsing dreams and hopes.

OutWit Hub In a Nutshell

OutWit Hub is a web collection engine - running on Windows, Mac or Linux - that allows users to browse and collect data, images, contacts or files from the web with several clicks. Originally developed for researchers and data managers, the application can be used for business or professional web scraping. While you browse the web, OutWit will scan the pages and format it into tables, allowing you to export the data into files, spreadsheets or databases for later use. J.C. Combaz, CEO of OutWit Technologies, puts it this way: "If you are looking for photos of sport cars, search engines will give you thumbnails with links to pages containing the images; OutWit will directly put the high resolution pictures in a folder on your hard disk. If you want stock quotes, search engines will tell you where to find the figures; OutWit will put them into a spreadsheet on your desktop."

How It Works

The OutWit Platform is composed of a kernel that contains a large library of data recognition and extraction functions. Using the kernel, original extensions -- or outfits -- can be developed for specific applications. An outfit is an .xul extension with its own user interface, features, scripts and directory of web sources. OutWit Hub is one of the outfits developed based on the OutWit platform’s kernel.

Concepts and Features

The OutWit software is based on three concepts: # Dissecting a web page into data elements to enable the users to sift through data and see only what they are looking for (images, links, e-mail addresses, etc.), # Offering a universal collection basket, “the Catch”, into which the users can drag and drop any data, as they surf the Web, # Harvesting the web with one click.


* Data structure recognition * Automatic multi-page browsing * Full-screen browsing * Automatic slide show on image searches * Page and image link extraction * E-mail extraction (automatic extraction limited to 100 addresses) * Table and list extraction * Syntax colored page source * Scrapers

Final Words

Be careful. The API is not yet stabilized. There is virtually no documentation and/or user guides on how to install and use the application. The company resorts to providing video previews to provide users with a “feeling of the environment capabilities.” One of them – a video tutorial on how to use OutWit Hub to harvest and download images - can be found here. The installation file is provided in the form of an .xpi file, which is a Mozilla Installed Package file type. It can be typically executed by dragging and dropping it into a Firefox window. Remember, as of now, the Hub is only compatible with Firefox 3.0. In the future, there are plans to develop specific applications for specific purposes, such as OutWit Images, OutWit Jobs, etc. As the Hub evolves, it may actually become useful for developing scripts, scrapers, etc.; but it doesn’t look like that time is now. Editor's Note: OutWit heard us and helped us find the "a little bit hidden" OutWit Hub tutorials on their blog. While there is still no documentation, in its classic sense, the tutorials on how to get started and how to extract data are available here.