Lakehouse And Data Lakes - What Are They?
Databases have traditionally been very structured and ordered.
Data had to be defined in terms of records and fields. Each field had to be numeric or alphanumeric and often the length and size also had to be pre-determined. In short, a database was highly restrictive and could only store and analyze a fixed version of data.
For many years this was good enough, managing stock control in a warehouse where each item has a specific item number, arrival date, expiry date, and so on, is a good example. With these fixed and determined fields, it’s possible to use the database for this purpose, but more recently companies have explored the possibility of including unstructured data in the same pool.
Data warehouses emerged to focus on decision support and business intelligence applications, though they were still not suited for handling unstructured data, semi-structured data, and data with high variety, velocity, and volume. This is where the arrival of data lakes became significant.
Data lakes arrived and could handle raw data in many different formats for data science and machine learning.
This was also facilitated by the price of storage declining dramatically. Data lakes have many faults though — the quality of the data is not enforced and because there is no control over the consistency of the data, it’s very difficult to manage batch or streaming tasks — there will be too many errors.
So data specialists have stitched these worlds together. At a basic level, this works, but it can result in duplicate data that requires more storage and greater security risks. A two-tier architecture and the use of a data lake is the answer.
The Extract, Transform, and Load (ETL) process is important here. The data is ETLd from the various operational databases and dropped into a larger data lake. Often the data is not well organized, but it will be transformed into a format that is compatible with machine learning tools. It’s likely that another ETL process will be required to pull data from the lake so it can be analyzed for business intelligence or analytics.
It’s clear that this is an improvement on the basic data warehouse, but the multiple ETL processes can result in stale or duplicate data — this type of data requires regular maintenance.
A lakehouse is intended to address the shortcomings and limitations of data lakes by combining the best attributes of data lakes and data warehouses. Lakehouses are enabled by a new open and standardized system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes.
I found a useful definition of the key features of a lakehouse here. Summarizing the key points you should expect to find all these attributes:
- Transaction support: support for transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL
- Schema enforcement and governance: the Lakehouse should have a way to support schema enforcement and evolution, supporting DW schema architectures such as star/snowflake-schemas
- BI support: Lakehouses enable using BI tools directly on the source data
- Storage is decoupled from compute: in practice, this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes
- Openness: the storage formats they use are open and standardized, such as Parquet
- Support for diverse data types ranging from unstructured to structured data: the lakehouse can be used to store, refine, analyze, and access data types needed for many new data applications
- Support for diverse workloads: including data science, machine learning, and SQL and analytics
- End-to-end streaming: real-time reports are the norm in many enterprises
This is not a comprehensive list of attributes. Larger enterprise systems may require more features, but this list demonstrates just how far the lakehouse concept has evolved from the structured database or even the data warehouse.
Click here for more information on data engineering expertise at IBA Group.