Tuesday 7 August 2012

Big…Bigger…Biggest! - making sense of world data.

There have been numerous articles and publications on the volume of data across analog and digital forms that exist in the world and how these could be put to constructive use. The estimates are that the total volume of data stored by the end of this year would be in the 2.7 zettabyte range and growing at a rate of just under 50% year-on-year. To the uninitiated, a zettabyte is a trillion gigabytes (GB) and a billion terabytes (TB). Now that sure sounds like the next biggest business opportunity that everyone should latch on to, right?

The answer is yes, but with a note of caution. Here is why.

Over 90% of all the data that exists (and over 99% of the new data that is being created or will be created in the future) is unstructured media data including video, audio and images. Take this out of the equation and we still are talking about around 220-230 exabytes of existing data and another 170 exabytes being created additionally through to 2017. Analog data, in one form or the other, was around 6% of the total data of 220-230 exabytes that was available in 2007. However, that would only be growing infinitesimally, so we will take the total analog data as around 3.5% of all data, on an average, and take that piece out of this calculation!

The new denominator is therefore in the 375 exabytes ballpark!

A closer look at this data would reveal a whole bunch of realities that are mind-boggling and help you put the numbers in the right perspective. We will, over the course of this discussion, make some assumptions to help the math, some of which may be incorrect (though no one would be able to prove it one way or the other, despite big data!), but will not thematically challenge the hypothesis. Here’s one. The total amount of storage for non-user data (system and application software, for example) is assumed to be around a third of the total data. That seems a fair assumption when you take the Forrester view that we will have around 2 billion computers in the world by 2015 and that the minimum that each such device would need is around 50 GB only for OS, office, security and networking tools. That would throw a further 125 exabytes out of the window (no pun intended!). That leaves us with around 250 exabytes of user data!

Now that media files and system/application software are out of the equation, we will assume that 90% of the total remaining data is corporate in nature (the other 10%, or 25 exabytes, is not necessarily personal data, it could even be, for instance, data generated in Office tools, mail data etc). And I cannot think of a corporate that does not back-up its data. So the minimum redundancy at the transaction data level itself would be 50%. Take that out of the equation because the same data analyzed twice over would not yield any greater intelligence than doing it once! That would take out 45% of the total data we started with at the beginning of this paragraph. Also, most organizations of a size where the data volume should matter for this calculation would have a data warehouse, an operation data store or a bunch of denormalized data stores at the least to introduce a further 50-67% redundancy. That would translate to another ~30% of the data not to be considered in the equation. This leaves us with 25% of the 250 exabytes of data or ~60 exabytes.

A significant part of the non-corporate data (the 25 exabytes) is essentially generated and consumed with a view to sharing and communication with other data/information stakeholders. I cannot think of someone creating a file locally and sending it to someone for consumption and then diligently removing the redundancy of storing it in their file systems and again in their mail folders. This also happens at the receiver’s end. And not all communication is one-on-one. I cannot think of communication that ends with the first receiver either! Assuming a two-step average communication and a storage redundancy of 50% at each of the 3 players, we are talking of a mere 16% of the data that is unique. It would be significantly lesser in personal data given how much of what you send in your mails is stuff you generated – think about the number of forwarded mails and messages in all your personal mailboxes, accounts etc!! Giving a benefit of doubt and taking a conservative 20% of the non-corporate data as being unique, this part of the data only contributes 5 exabytes taking the total data that we can derive intelligence from down to 40 exabytes.

Assuming there is not much of information worthy of analysis in data that is over 5 years old, effectively the amount of data we are likely to deal with over the next few years, with a view to extracting intelligence and drive decision making is closer to a 30 exabytes magnitude. Now, to put this into perspective, half of this data would be residing in transaction or operational systems and the rest would be in one or the other form of data warehouses already. That would mean around 15 exabytes of data or around 15 million terabytes of data. Imagine the above in the context of what your take on the number of existing relational and dimensional data stores and what your take on the average size of an installation is.

That surely should help put Big, Bigger and Biggest data into perspective. The need is to be pragmatic about what is the real volume of data that we are dealing with and how effectively we could use the intelligence we can derive from it and put it to profitable use. And how far does it push the current data paradigm, therefore! 

You may also find some interesting perspectives on this theme in

Note: The views expressed here and in any of my posts are my personal views and not to be construed as being shared by any organization or group that I am or have been associated with presently or in the past.

No comments:

Post a Comment