Data is growing. Exponentially – faster than we can grasp it, faster than we can imagine. There are many who estimate its pace, Forbes summarizes IDC’s forecast of 180 zettabytes of data (or 180 trillion gigabytes) production in 2025, up from less than 10 zettabytes in 2015. But, the world is changing so fast, that in the next five years those estimations can change and foresee even a more dramatic increase.
This is why we are looking for new ways of storing, analyzing and using data, trying to find a perfect approach to make the best of it. As there is no argue – data is gold, we just have to find right tools to extract it.
LAKES OR HOUSES?
There are two, quite different approaches to take control over data in an organization. One is to build a Data Warehouse – a system for reporting and analysis, created in a systemic way, with rules defined upfront. The other is to maintain a Data Lake – a data repository with almost no rules defined upfront.
As you can imagine, both have their pros and cons. Data Warehouse is easier to manage and easier to use. But it takes a lot of time to build and the requirements change along the way. It may happen that as we have set up the Data Warehouse it has already become obsolete. And, it is very expensive, you have to invest in advanced storage technology, which becomes even more costly when in need of an upgrade.
Data Lake is a repository of all kinds of data – structured, semi-structured and unstructured, kept ‘as they are’, together with metadata, in a very raw, native manner, stored in dispersed technology – Hadoop. It gives far better analyzing possibilities, is quick to set up, less expensive and highly agile in usage. But, if not managed properly you may end up with maintaining a Data Swamp – a repository full of garbage with long response time.
DATA LAKE – CLEAR WATERS WITH NO RESIDUE
Data Lake can be a highly productive method of handling Big Data if only done right. Big Data Governance is a starting point here. You have to formulate policies related to optimization, privacy and monetization of the data kept. Big Data Governance policies have to be aligned to objectives of multiple functions the data is to be used by in your organization. One, very important part of Data Governance is implementing Data Lineage – keeping track of your data lifecycles, their origins and processing, all based on meta-data.
Some organizations, TUATARA included, see great value in setting up a Big Data Competence Center – specialized employees who act as advocates help data users and foremost make sure the Data Governance policies are followed.
These two – Big Data Governance with Data Lineage and a Big Data Competence Center will make your Data Lake waters clear and pleasant to use.
MAKE IT FIT YOUR NEEDS
So what is better Data Lakes or Data Warehouses? There is no simple answer to this question. In modern organizations there is space for both – Data Warehouses and Data Lakes. The most important factor to determine your approach to Big Data should be the business objective of data processing. If you need reporting to stock exchange – you should probably decide on Data Warehouse for this part of your operations. If you are looking for added value from better understanding your customers by adding additional insights from various sources or discovering relations between your customers you should probably start building a Data Lake of your own to dive in.
And we will be there for you to help you determine your Big Data needs and propose an appropriate approach, fit to your needs, optimizing the end result.