Is My Data Lake Actually a Data Swamp?

Author:

Bryce Pilcher

Date Published:
March 28, 2023

Let’s talk about Data Lakes!  Maybe you are here because you want to know more about them, or maybe you suspect you have a Data Swamp and want to learn what to do about it.  Either way, we will cover some strategies and questions to think about when designing your Data Lake.

What is a Data Lake?

A scalable, organized, blob-oriented data storage paradigm.  It is a place to store data of all sizes and kinds.  You can have some structured data in Parquet, say, and/or some unstructured text data in a raw txt file.  It is a great place for, for example, storing raw data prior to processing, old reporting datasets that don’t need to be accessed on demand.

Those features sound nice and Data Lakes have been recommended as a great tool for organizations of all sizes, so what is the catch?  Well, a Data Lake without proper planning and maintenance can turn into a Data Swamp.

What is a Data Swamp?

A collection of data that is hard for analysts to find the data they need whether seeking the correct dataset or the most up to date data. Often a group has versions of a dataset stored in multiple places with slightly different properties or dates.

 

If you find yourself there, have no fear: we can engineer a way out!

How to tell the difference

Organization is the key factor in whether a collection of data is a lake or a swamp.  I place the organization of a Data Lake on three pillars:

Structure

Structure provides a clear path to access the data you need.

Intuitiveness

Intuitiveness means it is easy to find the path to your data.

Consistency

Consistency means that you can expect the path to be the same wherever you look.

Obviously, the path cannot be exactly the same, so we’ll look at how to make the path appear consistent.

STRUCTURE

The high-level structure in a Data Lake typically follows a zone pattern where each zone serves a different purpose for your organization.  A common model is to have a Bronze, Silver, and Gold zones that hold your data at different stages of refinement.

The Bronze Zone

is a copy of data from the source system. It is intentionally redundant so that any failure in the source system will not affect your ability to rebuild your Data Lake. 
step image

The Silver Zone

contains data that has been cleaned and standardized in route from the Bronze Zone.  Typically, your Silver will closely mirror your Bronze zone in low level structure. 
step image

The Gold Zone

here data is curated into actionable sets of data that can be directly used for analysis or reporting.  You can have other zones too; maybe one for Data Science experiments and/or one for employee training.
step image

Underneath this high-level outline, each zone needs further structure and governance.  A tough question asked at this stage is

“Do I organize by data function or by data source?” .

The answer may look different for each zone in your Data Lake and can change from company to company.  The important part is to set the governance to make sense for your company and abide by it.


INTUITIVENESS

The second pillar is *intuitiveness* and it closely follows structure.  Picture a maze; although, it is a structure, it is not intuitive to find your way from the start to the finish.  We want our data to be findable, not forgotten in some muddy corner of the Data Lake.  This task takes time and patience to get right for each piece of the structure that you establish.  And yes, it can look different in each structure if that makes sense!  In the Bronze zone you might organize by source because it is a copy of source.  In the Gold zone, you might switch and organize by dashboard, project, or business function.  This allows end users to go find all reviews in one place.


CONSISTENCY

Consistency, the third pillar of Data Lake organization, reminds us that paths need to be the same within a zone.  As we talked about earlier, not just one path, but an understanding that the path will be built a set way.  You can go to any one of many restaurants, order a cheeseburger and you are guaranteed to get a hamburger patty, a slice of cheese and a bun at the least.  Each of those items might be different, Brioche vs Potato bun, Swiss vs Pepper Jack, etc., but the idea is the same.  In each of the zones we want consistency within that zone so that we can easily find the information we need.  If there are different levels of folder nesting or naming conventions, then information can get lost or be harder to find.

What about the users?

The users of the Data Lake are an additional consideration affecting the best organization.  Some users are consumers, others are producers, and some do both. Producers include Data Engineers, Data Scientists, ML Engineers, etc., who are adding data and value to the Data Lake. The consumers are often operations, marketing, Data Scientists, or even robots to communicate decisions based on data piped into the Data Lake.  Your producers will likely care more about the source of the data and view the Data Lake from that perspective.  Consumers will want data grouped by type so they can ask for all the social media reviews, for example. Perhaps it sounds like it could be a nightmare to keep both groups happy as their organizational needs clash.

The right tools

The good news is tools exist to help enrich the metadata of your Data Lake.  These tools offer different lenses through which to view the Data Lake and thereby find the data you are interested in using.  Azure has Purview, in AWS it is Glue Data Catalog, and it’s Data Catalog in GCP.  There are also offerings separate from the cloud if you have an on-premises Data Lake.

Data Lake technology (and technology in general) is changing at a faster and faster pace, which can make setting up a Data Lake seem daunting.  The good news is that you don’t have to get it right the first time!  Predicting the needs of the future is the trickiest challenge of all, but if you allow room to grow your Data Lake will be productive.

Here are a few tricks that will help promote flexibility in your Data Lake so that it can grow with your company.

I strongly advocate for tooling and abstraction that pulls the implementation details of the Data Lake one layer away from analysis or pipeline code.  The extra step is an investment which will pay dividends when, not if, you change structure, cloud provider, etc.

Truly understand where you are today and where you are headed.  Then you can evaluate new cases methodically and determine where they fall in line with your current practices.  If the new case is truly novel, then establish a new zone in your Data Lake.

Realize that a Data Lake is not meant for all your data or reporting.  There is no need to pass everything through the Data Lake or try to make the Data Lake hold all things.  Other tools, like a Data Warehouse, should be used alongside a Data Lake.

Conclusion

Data Lakes can turn into Data Swamps, but also Data Swamps can be remediated into beautiful Data Lakes. Following the three pillars of Data Lake organization will keep your lake from becoming a swamp. Structure provides the framing, *intuitiveness* puts the microwave in the kitchen, and consistency means the rightmost handle on the faucet always turns on the cold water.