The misunderstood relationship between big data and machine-learning

Moving big data from "it's complicated" to the honeymoon phase.

July 19, 2017

3 min read

By Markus Noga, head of machine learning at SAP and Dan Wellers, global lead for digital futures at SAP

It’s no secret that machine-learning and big data have emerged as a “power couple” for enterprises looking to leverage new automation technologies. Machine-learning trains itself on data, and for a time, that data was scarce. This is no longer a problem. By 2025, the world will create 180 zettabytes of data per year (up from 4.4 zettabytes in 2013),according to IDC.

SAP's Markus Noga

Big data and machine-learning may seem to be a perfect match, coming together at just the right time. Their relationship is one that’s understood in terms of a simple equation: large amounts of data mined=actionable insights that were previously unknown or invisible.

But it’s not that simple. Without a thorough understanding of both the strengths and limitations of the data at hand, having more of it can actually increase the likelihood of making spurious connections.

Historically, most of the data that businesses analyzed for the purpose of decision-making has been of the structured variety: easily entered, stored and queried. In the digital age, however, the connected world enables the capture and storage of more—and more diverse—data sets than ever before. Nearly 5,000 devices are being connected to the internet every minute today; within ten years, there will be 80 billion devices collecting and transmitting data around the world. As a recentMcKinsey Global Institute report noted: “Much of this newly available data is in the form of clicks, images, text, or signals of various sorts, which is very different than the structured data that can be cleanly placed in rows and columns.”

SAP's Dan Wellers

This creates a data-management challenge similar to the age-old computing axiom: garbage in, garbage out. To quote UC Berkeley professor and machine learning expert Michael I. Jordan, data variety leads to a decline in data quality—“It’s like having billions of monkeys typing.

So how do we move big data and machine-learning out of the “it’s complicated” zone and into the honeymoon phase? For machine-learning tools to work, they need to be fed high-quality data, and they must also be guided by highly skilled humans. Preparing data can be heavy lifting, but it can also be the most important part of a data scientist’s job—one that accounts for as much as 50 percent of his or her time, according to some estimates. In fact, it took one bank 150 people, and two years to achieve the data quality necessary to build an enterprise-wide data lake from which advanced analytics tools might drink.

These challenges are not insurmountable, but they reinforce that big data and machine-learning will be a perfect match with the necessary human intelligence. The demand for data scientists has reached critical-level value, predicted to grow at double-digit rates for the foreseeable future. When done properly with the right human workforce, the benefits of a big data and machine-learning couple for enterprises will almost certainly be huge. Over time, companies must work through complications and drawbacks to reap the long-term benefits of this oft-hyped couple.

This blog is based on a piece which was published in& The Digitalist Magazine, online edition. Dr. Markus Noga and Dan Wellers