Diffbot Deconstructs, Repackages Structured Web Data

One of the greatest revolutions in computing has been the success of the Web browser. They help us navigate the Internet and show us the things we love, such as celebrity news and cat videos.

But as powerful as modern browsers have become, the information on modern Web pages can be organized in other ways as well. While Web browsers are great at reading the markup languages (such as HTML and XML) that Web pages are built from, they need help to do lots of other things. That's one reason why there are several popular browser types out there. Some play nicer with those helper apps, and some are just really good at doing things most people expect them to do.

The Structured Web

What we're talking about here is all the stuff that's out there on the Internet. Most people are happy to use their browser to display Web pages and send messages and images with their Web-based email and IM systems. Web site data can be organized by browsers to accomplish these tasks. That's why they are so great. They can take numbers and text and display them properly on our laptops, smartphones and tablets so we can understand what they mean. However, the online world is still a disorganized place, and technologists are now focusing more on organizing virtual data into even more useful ways.

The exploding amount of online data, brought on by increasing numbers of Internet and smartphone users, has made it more important than ever to structure data that can be linked more efficiently across the Web. Using metadata, Web page information can be made easier to search, index and repackage for uses other than displaying in a Web browser. This embedded information on Web sites can help databases and search engines find things faster, and make them more useful to us. Startup Diffbot is one company that is taking this approach in a pretty unusual way.

Diffbot is less than a year old, and as it grows, more types of Web pages will be readable.

Diffbot for Machine Readable Content

Diffbot has created a system that can read the contents of a Web page and break it into component parts such as text, image and headline. It uses machine learning and natural language processing (not unlike Apple's Siri voice recognition) to do this, but so far, only for home pages and article pages. The company has determined there are at least 18 types of Web pages out there, and it will be working on making them readable as the company grows.

Once Diffbot breaks those pages into their component parts, the content can be repackaged into a magazine style layout for the iPad, for example. In fact, this is exactly what AOL is using Diffbot for with its Editions by AOL iPad app.

Without Diffbot, that content would have to be ported over from the existing CMS, a much slower process. Businesses that have a large amount of content they want to aggregate would benefit the most from this system. It's actually free if you're only doing 10,000 API calls per month, and Diffbot scales for enterprise users. Let us know in the comments if you've ever had to do a ton of content aggregation and would have loved to have Diffbot around to help you.

The Structured Web

Diffbot for Machine Readable Content

About the Author