Data Scientists: From Zero to Done, Faster

3 minute read
Virginia Backaitis avatar

Stories are legion of the shortage of data scientists, so you can’t blame me for being more than a little surprised when I found out that these wise men and women sometimes get put on hold when they want to run sophisticated queries. 

The reason? IT often needs to get involved and, more likely than not, they are short on time, have limited resources, not to mention other responsibilities.

Easier, Faster

While one way to ease the frustrations of the data scientists and analytics professionals who work at your company is to quit hiring so that they don’t have to fight for resources (kidding, kind of), other options are to make queries run faster, to leverage technologies like machine learning and to make access easier via new tools.

We already told you about Cloudera’s Navigator Optimizer, which is specific to the Hadoop vendor’s big data crunching solution, but there’s also IBM’s SystemML which was accepted as an Apache Open Source project yesterday, as well as Microsoft’s Data Virtual Machine which data scientists, aspiring data scientists and developers can use without needing help from IT.

Going Virtual

The latter provides a Microsoft Data Science Virtual Machine — a Windows Server 2012-based custom virtual machine image on the Azure marketplace containing several popular tools that can be used by data scientists and developers for advanced analytics. It is preloaded with tools such as Revolution R Open, Anaconda Python distribution including Jupyter notebook server, Visual Studio Community Edition, Power BI Desktop, SQL Server Express edition and Azure SDK. Users can get busy in seconds and pay for only their time on Azure, according to Herain Oberoi, director of product management at Microsoft.

Learning Opportunities

Community Access and Development

IBM, for its part, announced that its SystemML has been accepted as an Apache open source project. SystemML eases the challenges encountered when porting algorithms to production environments. It does this by dynamically compiling and optimizing machine learning algorithms in the environments familiar to the data scientist, and automatically porting these algorithms to production environments.

IBM says it is helping data scientists iterate faster and that SystemML eliminates the need for data engineers to rewrite for varying environments. The result? More app developers will be able to apply deep intelligence into everything from mobile applications to large mainframe processes.

Big Data Brings a New Posture

While it’s still too early to say that data science has been democratized, we’re a lot further along than we were just one year ago. And when it comes to open source, we’re seeing some the world’s biggest IT companies weave it into their strategies, which would have been hard to imagine just five years ago.