Screen Scraping Becomes an Industry After $13M Buy-In

The screen scraper has a history as long as the web.

Ever since the first web designers used the <TABLE> element in HTML to construct long lists of records in response to database queries, there have been automated tools to capture the data from web pages into some form of separate database.

While perhaps not yet the crux of an empire, scrapers certainly were a business. The output of scraped screens dating back at least 13 years are still, to this day, put on display by scraping services as examples of a skill that hasn’t changed since Clay Aiken was popular.

And for script developers using PHP, Ruby or Python, screen scraping has evolved into an art form.

Now there's word that it’s an industry worthy of serious financial investment, at least in someone’s eyes.

London-based Import.io — a web-based service for harvesting data from web pages — announced yesterday that it closed on a $13 million Series A funding round, led by London-based venture capital firm Imperial Innovations.

That’s an astonishing deal, given Imperial’s typical penchant for funding pharmaceutical firms with heavy intellectual property portfolios, and analytics firms with innovative value propositions.

Not a Scraper: An ‘Extractor’

What’s Import.io’s innovative value proposition? It presents itself as a kind of integration system, effectively extracting the layout patterns from web pages and generating scripts called crawlers (not to be confused with the tools used by search engines) that automate the process of capturing data from their same URLs going forward.

This way, APIs can be generated for any kind of app that produces the same data under its own format. You can build your own web app around someone else’s database.

“A crawler is, in effect, an automated extractor,” explained Alex Gimson, a community evangelist for Import.io, for a company webcast last April.

“You simply give it five pages, and then using those five pages, it will learn where, exactly, on the website your data sits, and it will crawl through that whole website — hence the name ‘crawler’ — and it will extract all of the data from pages that match the initial training you’ve given it.”

Indeed, Import.io has produced multiple tutorial videos, including for developers, instructing them about how they can pull together data from multiple sources into a single app. One example shows how data generated by technology retailers’ websites may be gathered together by a single app, to produce a handy price comparison tool.

The legitimate applications of such a tool are certainly compelling. Arguably, two industries that have the most difficult time coping with the abundance of data that it produces for itself, are healthcare and pharmaceuticals.

Knowledge management systems dating back to the 1990s, coupled with completely separate and non-standardized approaches to complying with federal data management guidelines, have led to potentially life-saving data being locked down in databases whose only method of output, quite literally, is 1990s-era HTML.

Every Which Way on a One-Way Street

A 2013 paper [PDF] by University of Massachusetts Lowell Professor Edward T. Chen tells the story of how, in 2004, then-President Bush spearheaded an effort to standardize healthcare IT practices, by appointing a federal IT coordinator specifically for that industry. The goal for that coordinator was to come up with a set of universal guidelines for storing, managing, and reporting data.

Nine years later, Professor Chen noted, it had become impossible for any universal guidelines to become adopted, ironically because it was left to hospitals’ independent owners and operators to determine for themselves how to comply with newly imposed HIPAA guidelines. Everybody had his own way of complying, and nobody wanted to change that now.

As a result, Chen wrote, “The availability or flow of information may become a bottleneck of the deployment of a healthcare system.” This before the Government’s first attempt at deploying Healthcare.gov.

The urgent need for a single methodology for gathering the healthcare data that healthcare organizations publish for themselves, back into one place, may have been one factor inspiring Imperial Innovations — which has funded both analytics and healthcare firms — to lead an investment in Import.io.

On the other hand, data publishers whose life’s blood is web-based revenue are already investing heavily in systems that can thwart the very tasks that Import.io and similar services perform.

Defensive Posture

Portland, Oregon-based FlightStats is a publisher of airline transport data, both for consumers and for the industries that depend upon flight itineraries — hotels, transportation and tourism. It’s this latter group that uses FlightStats data in bulk, and which may often be compelled to acquire that bulk data through a back door – or whatever door can be carved out.

In a recent interview with FlightStats CEO Chad Berkeley, we learned that he and his team are actively exploring the use of application performance monitoring tools, including from New Relic, to ascertain when its data is being scraped by services such as Import.io, and to shut off the data flow in those cases.

“We made this assumption that a lot of the people who were using our mobile apps were travelers. It turns out, that wasn’t a good assumption,” Berkeley said, “and we busted that assumption very quickly.”

A thorough assessment of FlightStats’ web traffic patterns revealed, the CTO said, that as much as two-thirds of its traffic comes from what the firm now calls “watchers” — effectively second and third parties other than the people actually flying. A majority of this group use FlightStats data for commercial purposes.

One way this group could be identified was through the traffic patterns it generated — repetitive, fast, and contiguous, all true signs of automation. APM services such as New Relic and Dynatrace are working to incorporate more intelligent traffic pattern detection, in order that customers can not only stop the bleeding but identify potential, new sources for monetization and revenue — just as FlightStats did.

So here are two ends of the technology spectrum, both with legitimate use cases, and yet both of which are actively working against one another. Import.io has not replied to CMSWire’s request for comment.

Learning Opportunities