There is some new data on the war to protect enterprise data. Neither of these data points are they types of "big data" that you are used to, but that does not mean they are less important. The new data refers to a significantly important set of user-behavior statistics that is set apart from the fashionable analytic statistics driving the current data-scientist hype. In fact, most analytic departments tend to try and remove this specific set of usage data from any and all reports they generate. This hated chunk of analytic data refers to the "non-human traffic" create by "scrapers" who come to web pages not for the feature-functionality of a site, but for the gold within the site (i.e., the data and content).

Big Data Part 1

The Jury Is In. Many of us knew it all along. Science has caught up to the enlightened few and shown that reasonably regulated marketplaces are the best way to combat illegal piracy of data and content. I've chosen not to say "the only way", because you can try the old fashioned ways of protection like detecting automated scripts, content honey-pots and digital watermarking. The funny thing about these other ways is this -- they don't actually work. That's right folks! Newton's third law of motion wins again. "When one body exerts a force on a second body, the second body simultaneously exerts a force equal in magnitude and opposite in direction to that of the first body."

The corporate IT departments, security teams, lawyers and external security vendors have met their match; Scrapers. Scrapers are people who make a living writing automated scripts that go to websites and scrape data and content out of the HTML and store it for use in another website. The scrapers are so brazen that they openly sell their services out in the open marketplace -- because after all, writing a script is not illegal, using it in violation of a website's terms of service is; and that's the problem of the guy who hired the scraper to harvest the data. Scrapers are so good at what they do, they are better than LeBron James. You not only can't stop them, you can't even hope to contain them because of basic supply demand behaviors. Attempts to suppress the supply without meaningful attempts to satisfy or stem demand create black markets because the profit incentive is too good to refuse.

  • In a recent study commissioned by Spotify, efforts to reduce piracy through controlling distribution (called 'artist holdout') ended up having the reverse of the intended effect. In an admittedly small sample size, Spotify notes that the artists who engage in 'artist holdouts' sold 1 song per 1 song illegally downloaded. The artists who released on Spotify at the same time as iTunes sold 4 tracks for every one illegally downloaded.

There is hope. The big data shows us that there is another way: