In this post, guest blogger Leo Zancani, CTO of Ontology Systems, looks at the challenge of data variety. He argues that to benefit from Big Data opportunities, a new generation of data variety tools are required to handle data integration and data alignment challenges that CSPs are facing.
A variety of California Grapes in a Vase, William J. McCloskey, 1921
Back in 2001, Gartner (then META Group) analyst Doug Laney famously defined three 'dimensions' for thinking about data management challenges: volume, velocity and variety (See: Laney, 3D Data Management: Controlling Data Volume, Velocity and Variety, 2001). In 2012, Laney went on to use these as the basis for Gartner’s definition of big data (See: Laney, The Importance of 'Big Data': A Definition, 2012).
An interesting thing about the '3 Vs' is that only two of them are easily quantified: volume in bytes and velocity in seconds. Variety? Well, that’s a much more slippery notion – is it the number of different formats? Different subject matter domains? Different originating systems?
Human beings and the technology market being what they are, once the demand for tools to handle big data was established, technical endeavour focussed eagerly on the easily measurable. The argument for investing in and buying a new data store that can store more terabytes or handle more updates per millisecond is pretty straightforward: competitive positioning is based on a single, well defined and easily understood metric. Variety? Once again, not so much.
All of which perhaps begins to explain why it is that – in spite of the fact that analysts consider it to be by far the biggest and baddest V (Leopold, 2014) – variety has received so little attention in terms of tools and technology. Volume? Hadoop, MongoDB, Cassandra to name just a few. Velocity? Storm, Hibari and any number of commercial products. Variety? Um…
As well as the basic difficulty of dealing with a wide variety of data (whatever that means), a less obvious issue is starting to come to light, caused by the growing availability of cheap, high-volume storage: the 'data attic' effect.
Before the advent of big data, when retaining data was an expensive thing to do, the default disposition of organisations was to discard data; retaining it was very much an opt-in choice. With the notable exception of data covered by retention regulation, the existence of big data stores encourages people to horde data – 'just in case'. Retention has become an opt-out.
This is apt to create data attics – big, dusty, rarely visited rooms of data, isolated from the rest of the enterprise data estate, just waiting around for a garage sale data scientist to magic value out of them. Why? Because the technology to easily join them up, between themselves and to the rest of an organisation’s systems – that is to say, to handle the variety problem – just isn’t available.
For communications service providers, a combination of the sudden technology diversification brought about by the boom times, and under-investment from more recent lean times, means that the issue of siloed systems and data is already very acute. Data attics certainly don’t help improve that.
So we find ourselves in a situation where the expectation is being set that it’s possible to get value from any amount of any data at any time, but in fact we are in very real danger of creating more and even less penetrable data stores, divorced from the day-to-day operations and concerns of the business and too hard to see as a unified whole and therefore of little strategic value either (Kelly, 2013).
There are two sides to this problem: one is a technology challenge, the other a business challenge.
Big data, in spite of its bigness, is still data, and expenditure made to record, retain and analyse it needs to have a coherent and meaningful purpose and justification, just like any other technology initiative in a business. The advice from Deloitte (amongst other commentators) is that simply collecting a lot of data and expecting 'insights' to materialise isn’t going to work, industry hype notwithstanding (Sharwood, 2014). The business challenge is clear and should be easy to resolve: big data projects need a use-case and a business-case in order to be successful.
The technology challenge though is more subtle.
The Register’s Paul Kunert aptly sums this up by saying that “big data is like teenage sex: everyone is talking about it and nobody is doing it correctly” (Kunert, 2014).
As with any industry trend that emerges very quickly, the pressure on organisations to communicate what they are 'doing about it' is immense – and has become yet greater as the communications cycle has accelerated in recent years. This type of undirected pressure has led technical organisations to take the path of most immediately executable action: the tools available focus on volume and velocity, so they have stored data quickly, on the – apparently reasonable - assumption that since the data is being retained, once the businesses requirements are articulated, they can be implemented retrospectively on the retained corpus.
What this approach didn’t consider was the lurking demon of data variety.
A curious aspect of this oversight is that while the data resided in its originating systems, the obstacle was very visible: projects to use data from multiple systems anticipated (although usually vastly underestimated) the substantial effort required to link that data together in order to make use of it as a single body.
In combination with the lack of a clear motivation from the business, these effects will lead to a colossal data integration deficit.
Previously, the business wanted to make a specific use of some data – so it figured out where that data would come from and then went about costing the activity of joining it up to make it usable for that specific use. It often found this cost to be much higher than expected, as the sorry state of data integration projects shows.
Now, the (often implicit) requirement is to be able to ask any question of any data, and so that data must all be joined up in a way that enables it to be used it in any way. The cost of this is likely to be tremendous – the square of the cost of joining any two sources up - but it hasn’t been accounted for at all!
In order to move forward, the answer is clearly to move the focus of big data tooling onto the variety dimension.
Just as the recent frenzy of innovation in high-volume and high-velocity data tools brought down the cost of acquisition, storage and query by orders of magnitude, so a new generation of data variety tools are now urgently required before the hidden treasure in the data attics can be rescued from an impenetrable layer of data integration dust and data alignment cobwebs.
About the author
Leo Zancani co-founded Ontology Systems in 2005 and is now leading the application of semantic technologies to management systems for IT, Data Center, and Network environments.