In the past few years the term dark data has been used more frequently, but what does it mean? Consulting and market research company Gartner Inc. describes dark data as
"information assets that organizations collect, process and store in the course of their regular business activity,
but generally fail to use for other purposes." So basically, this is data that no one is accessing or hasn’t accessed for a number of years. Some vendors go further in stating that dark data is files that has no ownership or has been dormant for a number of years. The consensus though, is the files are potentially old, unused files.
A number of organisations have performed analysis of customer data and within a percent or two provide the same results. Veritas performed one of the largest reviews in 2016. They analysed 9 ½ billion unstructured files. Unstructured files were the target as these are the files stored on file servers and user workstations that have multiple versions, and copies made and distributed in an uncontrolled manner. Structured data like databases are generally better managed, although database dump files can still be an issue and not deleted but would be stored on a file server, so be part of an unstructured file audit.
The results of the Veritas research showed that 52% of all these unstructured files was dark data. A further 33% was classed as ROT. This is Redundant, Obsolete, or Trivial data leaving only 15% of organisations data that is actually of any use! As I mentioned earlier, this is just one research, but the other organisations running their own research have very similar results.
Why is dark data and ROT an issue then? In order to store this 85% of additional data that potentially is not required, has physical costs, like the disks and servers it is stored on, but also a lot of hidden costs. In order to store this additional data, more disk arrays or servers will be required. These in turn will require more rack space that will then require potentially more racks. Each rack and the number of devices in a rack consume power. The rental of rack space includes the amount of power consumption.
Hardware has a habit of failing. With additional unnecessary hardware, the amount of hardware failures and replacements will be increased. To protect against these failures, systems are configured to be highly available (HA) and a disaster recovery (DR) plan is usually deployed. The more data you have the larger the HA and DR environments need to be, further adding to the cost. Just think of the recent BA IT failure and the cost of not having a suitable HA or DR system in place. With all this extra hardware comes the responsibility of an individual or team to maintain, monitor, patch and upgrade the systems. This might require out of hours work with employee overtime costs. The more hardware you have, the large the team will need to be to maintain it.
There are also the legal implications of having all this dark data. If you are being prosecuted or you are taking someone else to court, you will want to find that key piece of evidence to help your defence or prosecution. Having 85% of your data as potentially unnecessary, but without knowing what content is within this 85%, means that a search for that key piece of information becomes much harder, slower and therefore more costly. The General Data Protection Regulation (GDPR) went live in April 2016 (although penalties don’t come into force until 25th May 2018). Part of these regulations require an organisation to be able to find all personal data, amend or delete within a 30-day time frame. Failure to comply with this law will result in very serious financial penalties.