Already faced with crippling data storage bills, the last thing enterprises need is to waste money on storing multiple copies of the same data

In a ground-breaking study by Veritas Technologies published last month, it was found that on average, 41% of an enterprise’s data is “stale”, that is, has not been modified in the past three years. The study estimates that, per enterprise, this amounts to as much as USD 20.5 million of additional data management costs that could be otherwise saved.

Orphans on the rise

For some industries, for example banking, a portion of this data is stored to meet regulatory requirements. But according to Veritas, the majority is just the result of a passive approach to data storage. For example, “orphaned data”, ie data without an associated owner because of personnel changes, is a particular culprit. Not only is such data on the rise, but comprises of presentations and images that take up a disproportionate share of disk space, and is unattended for even longer than other stale data.

Veritas’s report highlights the fact that data growth, estimated to be as much as 39% p.a. in the US, is caused both by the number of files stored, and the doubling of average file sizes in the past decade. Veritas stresses the need for enterprises to prioritise how they manage their data, that is, what to store, archive or delete.

Needless duplication

What the report fails to mention however is how much of this stale data is actually the result of multiple data duplicates. In large organisations that house several functional departments, analytics teams and data warehouses, it is common for one team to copy already-copied data, which is then copied again for a different purpose. Copies are also made where back-ups are required, further exacerbating storage costs, error risks and the risk of contravening data security laws.

Ironically, many big data management solutions are actually contributing to the problem. For example, in order to combine data stored in a traditional EDW with that moved to a Hadoop data lake, the data is often re-copied and re-stored into expensive solution-specific servers. It is also not unusual for whole databases to be copied in order to query only a limited number of datasets.

Painful lessons

In fact, there could be as many as 30 copies of the same data within a single organisation, according to Iain Chidley, GM at Software company, Delphix. He warns that enterprises can end up storing ten times more data than they had orginally anticipated. While cloud storage is now a cheaper alternative to traditional data warehouses, substantial amounts of money is still being wasted if all this duplicated data is moved wholesale into the cloud.

Clearly, major clean-ups are called for in many large companies. But avoiding the copying process wherever possible can help drastically limit further storage challenges. And for those not yet facing these challenges, prevention is of course better than cure.

RAMp it up

The solution lies in in-memory processing, which eliminates the need to write data to physical disks, track copies, and delete them when no longer required. Instead, data is processed in a computer’s memory or RAM, which has seen price declines of over 200% in the past three years. Meanwhile, newer 64-bit operating systems now offer one terabyte or more of addressable memory, potentially allowing for an entire data warehouse to be cached.

Conducting analytics in-memory also reduces the need for some ETL processes, such as data indexing and storing pre-aggregated data in aggregate tables, thereby further reducing IT costs. The application of the revolutionary open source software, Apache Spark, which supports in-memory computing, is yet another step towards low cost big data processing and analytics.

These technological advances make it possible for enterprises to embrace a vast uptick in the amount of data that can be accessed and analysed, while at the same time making substantial storage cost savings.

Who says you can’t have your cake and eat it too?