Data lakes — centralized raw data repositories — emerged in the late 2000’s as a simple and cost-effective solution to Big Data. As monolithic storage centers, they can be used to contain everything, from CSVs to emails, to images, to JSON, with little-to-no upfront preparation at the time of storage induction. As an alternative to data warehousing, Lakes promise cost savings and a user friendly interface. For those drowning in Big Data, a cool, calm lake can seem like the perfect solution.
But according to a panel comprised of two CIOs, three senior data scientists, and one former AWS architect, Data Lakes are typically more akin to a swamp than a clear body of water. Despite the hype surrounding data lakes, in practice the vast majority of them subscribe to the “garbage in, garbage out” axiom. While it can be appealing to essentially dump data from disparate domains in its native format with little-to-no indexing, the upfront benefits of this strategy typically result in long-term costs: analysis and information surfacing is difficult, and requires employees who are fluent in data flow technologies like Spark and Flume.
Despite their potential, data lakes in practice tend to be cost ineffective, difficult to manage, and ultimately a poor solution to problems of big data.
What’s a Library Without Genres and Alphabetization?
Data lakes are similar to libraries: if classified correctly, surfacing knowledge is simple and easy. You enter a series of identifiers (author name, genre, etc.), and can get the precise location and stock level for a book. Most data lakes in enterprise settings today, though, are more like a large pile of books, with no genre organization, alphabetization, or structure. Because data doesn’t have to be altered at the point of induction, many companies fail to categorize it, letting the data stream into the lake with no indexing point.
Indeed, it can be tempting for many leaders to turn to data lakes in moments of big data panic. With more and more data streaming in, from IoT devices, social sites, sales and marketing systems, and of course, internal documents systems, data lakes present a quick and agile solution to ensure that all of that data is kept in its original form, and that nothing is lost.
The problems arise, though, when we try and use that data if it’s not indexed. Unlike other storage systems, like data warehouses, data is not profiled, excluded, or simplified before it is stored in a data lake. Because of this, two things often occur:
- Companies end up storing more data than they otherwise would have, making the data lake ultimately cost-ineffective.
- Surfacing and analyzing critical business information is time consuming and requires niche data flow knowledge.
The former AWS architect noted that, from a performance perspective, data lakes are a scalable solution to big data — going up to terabytes or petabytes with a data lake is significantly cheaper than other storage methods (Data warehousing or Fog). The problem is not with the lake itself, but with how we use it. “I’ve seen a lot of data lakes turn into data graveyards,” noted a NewtonX data scientist. “Companies struggle with integration, with setting rules for use, and with analysis, especially if they’ve been using data lakes without a clear intention. At a certain point, they realize they’re paying to store a bunch of information that they will never use, or be able to use without significant investment.”
Swim or Drown: How to do Data Lakes the Smart Way
Data lakes themselves are not to blame for the swamps that they often devolve into. With the right team and the right intentions, they can be an affordable, secure, simple repository for analysis. When done correctly, each element in the data lake has a unique identifier and metatags that indicate the data’s history and reliability. This allows businesses to query the data lake and easily surface relevant data to be analyzed. To further the library metaphor, this version of a data like is like a well-organized and indexed library: when you search for the “Political Science” section, all books within this genre are easily findable, and every book within the genre is organized according to a known logic.
According to the data scientists on the NewtonX panel, data lakes have an accompanying premise, that often leads to their demise: that end users can manipulate and analyze data, thereby democratizing access within an organization. In reality, though, the assumption that an end user would have the technical know-how to effectively do this is dubious at best, which is why data lakes in practice so often fail. “It’s a false promise,” declared one data scientist. “When you take an IT challenge and give it to business-facing employees, it’s not surprising that many of them handle the information incorrectly.”
Data lakes are not a quick and easy solution for the technically-wary. If they are to be used effectively to give business insights, they need a robust indexing system, and a dedicated high-tech manager (if not multiple). The CIOs on the NewtonX panel emphasized that all things considered, if you do data lakes the right way, the cost savings tend to be null, and the true benefit lies in the wealth of untampered data they hold. As the former AWS architect put it, “As long as you have someone who knows how to surface and use that data, you can swim just fine in a data lake.”