CMS News, Reviews and Resources
Content Management Matters ™
 
 

OutWit Hub: Semantic Search for Web Harvesting

OutWit Hub: New Semantic Search Tool for Web Harvesting

OutWit Technologies released a new Firefox 3.0 extension designed to outwit the web and perform smart web harvesting and semantic search. Based on a recognition technology, this tool is deemed to be the “first step towards a full-blown semantic browser.”

Based on an open API, OutWit Hub allows users to harvest data elements, documents or media from virtually any public source of content.

Semantic Search is Gaining Popularity

With COGITO Monitor and COGITO Focus in the market, semantic search and web services is certainly starting to gain more and more attention.

Performance issues aside, OutWit came out with this new application at the right time. As the semantic search space heats up with the likes of Quintura, semantic search beats Google’s offering by adding some extra value to Google’s primary focus on the “most-linked” algorithm.

SPONSORSHIP
CMSWire speaks to a specific audience of professionals. You can too. Advertise here.

As we've previously discussed, semantic search is not an easy undertaking, so it will be interesting to see where OutWit will go with its semantic browsing dreams and hopes.

OutWit Hub In a Nutshell

OutWit Hub is a web collection engine - running on Windows, Mac or Linux - that allows users to browse and collect data, images, contacts or files from the web with several clicks.

Originally developed for researchers and data managers, the application can be used for business or professional web scraping. While you browse the web, OutWit will scan the pages and format it into tables, allowing you to export the data into files, spreadsheets or databases for later use.

J.C. Combaz, CEO of OutWit Technologies, puts it this way: "If you are looking for photos of sport cars, search engines will give you thumbnails with links to pages containing the images; OutWit will directly put the high resolution pictures in a folder on your hard disk. If you want stock quotes, search engines will tell you where to find the figures; OutWit will put them into a spreadsheet on your desktop."

How It Works

The OutWit Platform is composed of a kernel that contains a large library of data recognition and extraction functions. Using the kernel, original extensions -- or outfits -- can be developed for specific applications.

An outfit is an .xul extension with its own user interface, features, scripts and directory of web sources. OutWit Hub is one of the outfits developed based on the OutWit platform’s kernel.

Concepts and Features

The OutWit software is based on three concepts:

  1. Dissecting a web page into data elements to enable the users to sift through data and see only what they are looking for (images, links, e-mail addresses, etc.),
  2. Offering a universal collection basket, “the Catch”, into which the users can drag and drop any data, as they surf the Web,
  3. Harvesting the web with one click.

Features

  • Data structure recognition
  • Automatic multi-page browsing
  • Full-screen browsing
  • Automatic slide show on image searches
  • Page and image link extraction
  • E-mail extraction (automatic extraction limited to 100 addresses)
  • Table and list extraction
  • Syntax colored page source
  • Scrapers

Final Words

Be careful. The API is not yet stabilized. There is virtually no documentation and/or user guides on how to install and use the application. The company resorts to providing video previews to provide users with a “feeling of the environment capabilities.” One of them – a video tutorial on how to use OutWit Hub to harvest and download images - can be found here.

The installation file is provided in the form of an .xpi file, which is a Mozilla Installed Package file type. It can be typically executed by dragging and dropping it into a Firefox window. Remember, as of now, the Hub is only compatible with Firefox 3.0.

In the future, there are plans to develop specific applications for specific purposes, such as OutWit Images, OutWit Jobs, etc. As the Hub evolves, it may actually become useful for developing scripts, scrapers, etc.; but it doesn’t look like that time is now.

Editor's Note: OutWit heard us and helped us find the "a little bit hidden" OutWit Hub tutorials on their blog. While there is still no documentation, in its classic sense, the tutorials on how to get started and how to extract data are available here.


Did you find this useful?

3 Reader Comments

1 | john conroy — August 19, 2008 7:27 PM

all very interesting. I wonder what other kind of stuff would make it to a dedicated semantic browser?? I wonder if Mozilla and MS are working on features to utilize semantic structured data??

2 | Anny Dean — August 20, 2008 11:29 AM

I don't know if it can really be called a semantic browser yet, but it's definitely a tool of a new kind. They do say on their site ( outwit.com ) that they'll release the documented API soon, to build our own tools. That should be interesting. In the meantime, it is already possible to create scrapers if the automatic features are not enough. It's already an extremely useful data extraction tool.
Pretty promising.

3 | Emmanuel — August 20, 2008 1:02 PM

Sounds great. The tool is really exciting but lacks good documentation. The scraper looks really promising and I would like to test the scripting functionnalities asap. Keep moving forward guys ;-)

Leave a Response

  Remember me?

Related Web Content Articles

 
 

From our Job Board  View all jobs | feed Jobs RSS feed | Post a job right now

 

Featured Events  View all events | feed Events RSS feed | Add your event

Add to Technorati Favorites
STAY UP TO DATE
Subscribe to our RSS feed...
SUBSCRIBE TO OUR RSS FEED