We live in the times of abundant data. Every second sensors capture information; surveillance cameras record footage; GPS trackers offer coordinates; social media platforms burst with user-generated content, and cookies are tracking internet surfers’ every movement. Such data is too diverse to be stored into a traditional tabular and indexed form. How can companies make sense of all the information and turn it into actionable insights? Welcome to the data lake era.
Data lake vs data warehouse
The best analogy about data lakes and data warehouses comes from James Dixon back in 2010 when he introduced the former term. He compares a datamart with bottled water which is clean, packaged, and ready to consume, while a data lake, like its natural counterpart, can be used in various forms: to drink, to dive in, to sample it.
This difference comes from the fact that a data lake records data as it is generated, and is cleaned, filtered, and assessed only as needed. The advantage of this approach is that the same raw data can be used for multiple purposes.
Although this seems like a more flexible and useful tool for the future, there is no one-size-fits-all solution in data management. Here are a few questions to ask before deciding upon data lake implementation.
1. What kind of data do you have, and where does it come from?
If you are working with traditional, table-structured data that comes from surveys, reports, or sales and is included in a CRM system, you don’t need a data lake. A data warehouse is a more appropriate and cost-effective tool in this case.
However, if you use unstructured or semi-structured information such as web-scraped comments or image analysis, you could benefit more from a data lake. This is also useful for data which is generated continuously, such as stock market data or that coming from users constantly interacting with content, for example on social media.
Keep in mind that data lakes and data warehouses are not mutually exclusive. Most of the times, an organization can benefit more from a merger of these technologies as they are complementary and serve very different purposes.
2. How do you plan to use the data?
Think about the journey of the data in your organization before deciding on a solution. If data comes in tables at regulated intervals and is only used for a single purpose, you are better off with a traditional approach.
However, if you gather data from various sources, with varying complexity for different purposes, you might take into consideration a data lake. Remember the 3 V’s of big data (Variety, Velocity, and Volume) to assess if this is the right solution for you.
In a data lake, you first store data as it is recorded, and you can clean, organize and analyze it later as you need it. Be careful not to turn this into a data swamp, where you know you have the needed information somewhere but you can’t find it.
Keep in mind that while data scientists prefer raw data, which offers more intel about the structure of a particular phenomenon, business users are keener on cleaned data, which provides faster answers to their questions.
3. What technologies and skills do you currently have?
Working with a data lake requires some Hadoop know-how, which is not as common as Excel or SPSS. The typical developer of data lake technologies has some insights in big data engineering, a rare and expensive skillset. If you don’t have such human resources in your organization, it is best to look for SaaS solutions for data lakes, which more companies are rolling out now.
The best idea is to start small and implement a data lake as just a pilot project that is marginal to your organization, and use it as a sandbox for training.
When doing so, sake into consideration your integration tools. These need to have ETL capabilities, which means that data should be transformed before it is stored.
4. How do you manage data?
Adding a data lake to your organization also comes with some governance problems. Since by its nature it is less structured, you need to change your policies regarding information management to avoid security challenges.
If you put in sensitive data such as that generated by customers, even in the form of their browsing behavior, you need to have a way to retrieve it and quickly delete it if necessary on their request. Since a data lake hasn’t got the logical structure of a data warehouse, this simple task needs to be addressed before implementation.
A sub-question here could be how you get your data. How often do you add new data sources to your repository? If this rarely happens, a data lake might not be the best option. However, it suits if you expect to diversify your data sources in the next years yet don’t know the type of data you’ll need or its size.
5. How does all this go with your company’s culture?
Most people are reluctant to change. That is a fact you need to take into consideration every time you think about bringing new technology into your office. It has to be embraced by people, so the learning curve should be smooth enough to encourage participation and collaboration.
It should improve the synergy between your departments, reduce the workload, and offer new insights. For a successful implementation, define the role of each department and the data governance limits, as most IT departments tend to be overly protective with their access rights.
So, is your organization ready for a data lake?