Generally, organizations require a SQL warehouse to manage their analytical workspace and BI workloads, necessitating a consistent SQL endpoint for extracting data to create dashboards. To enhance performance efficiency, especially for time-sensitive BI tasks, internal tables are sometimes created by copying data from a Data Lake or storage directly into the database.
In both scenarios, tables, whether external or internal, need to be maintained within the database. These tables can be created directly in a SQL database by implementing transformation logic in SQL stored procedures or in PySpark with the assistance of a Spark pool. Typically, for simple transformations, it's advisable to implement them in SQL, as future logic modifications can be easily accommodated with minimal effort, ensuring the accurate updating of internal tables.
However, complexities arise with transformations requiring PySpark. In such cases, a Spark pool needs to be created, mapped with the Data Lake, and the raw data transformed into the desired format within the Data Lake. Subsequently, an SQL external table in dedicated SQL database is created on top of this transformed data which is in data lake for BI workloads.
Issues arise when multiple experiments are conducted on raw or transformed data, for instance, when the same raw data is utilized for both ML and Analytical/BI purposes. If ML processes identify the need to correct transformation logic, and these corrections are made in the Data Lake and stored in a location not mapped to the external table in the dedicated SQL database, data silos are created. Additionally, BI processes may struggle to ascertain the latest data version they need to reference, especially if the individual responsible for updating the transformation logic fails to update all external tables with the latest logic.
For instance, suppose we have a "TransformedData" table in Spark pointing to a specific location in the Data Lake, and an external table is created at the same path within the dedicated SQL warehouse.
Now some data engineer, modified the transformation Logic and stored it to different path :
After a data engineer modifies the transformation logic and stores it in a different path, the data in the dedicated SQL database still points to the old location. This requires intervention from someone to update the database to reflect the new data location in order for it to function properly.
Furthermore, if someone creates an internal table, it means that the data needs to be reloaded to ensure that the BI processes can function correctly. This problem persists until the introduction of a new concept known as the DataLake house, which helps to address these issues and streamline data management processes.
Despite the challenges mentioned, this approach can still be viable in certain scenarios:
1. Structured Transformations: When transformations are relatively simple and can be efficiently executed using SQL, it makes sense to leverage SQL stored procedures. This approach can provide better performance and easier maintenance compared to complex PySpark transformations.
2. SQL Database Integration: If your organization heavily relies on SQL databases for BI workloads and analytics, maintaining internal tables within the database can streamline data access and improve query performance.
3. Organization in BI aligned: Dedicated SQL databases and SQL-based transformations can be tailored specifically for BI workloads, optimizing performance and query response times. This ensures that BI analysts and decision-makers have access to reliable and timely data for their analytical needs. Means primary focus is to align Business Intelligence (BI) processes separately from Machine Learning (ML) initiatives while ensuring efficient BI workloads.
So, what is a lakehouse ?
It's essentially having a consistent SQL endpoint directly for your transformed data stored as Spark catalog-based tables in the data lake. This eliminates the need to create external or internal tables in a dedicated SQL database for the transformed data sourced from the data lake. Metadata layers, like the open source Delta Lake, sit on top of open file formats (e.g. Parquet files) and track which files are part of different table versions to offer rich management features like ACID-compliant transactions. The metadata layers enable other features common in data lakehouses, time travel to old table versions, schema enforcement and evolution, as well as data validation. Performance is key for data lakehouses to become the predominant data architecture used by businesses today as it's one of the key reasons that data warehouses exist in the two-tier architecture. While data lakes using low-cost object stores have been slow to access in the past, new query engine designs enable high-performance SQL analysis. So basically you will have structured,semi-scructured & unstructured data all at same place.
How typical overall Microsoft architecture could looks like:
Now that you grasp the concept, we'll be utilizing Spark tables. Consequently, the same level of access control previously implemented in the dedicated SQL pool will now need to be applied to these Spark tables. Presently, on the market, there are products such as Databricks and MS Fabric that have already adopted the lakehouse paradigm, incorporating delta lake concepts to ensure tables are ACID compliant. Each of these technology stacks employs its own method for access control, with Databricks utilizing Unity catalog and MS Fabric employing access control similar to Dedicated SQL Databases.
In the end I have tried to bring upon small comparison based on above scenario, that may help you to choose required tech stack for your unified data architecture.
|
SYNAPSE
|
DATABRICKS
|
MS FABRIC
|
Storage
|
Data warehouse and
Datalake
|
Lakehouse
|
Lakehouse
|
Purpose
|
More efficient in SQL
workloads and moderate in PySpark
|
More efficient in
Pyspark workloads
|
More efficient in SQL
workloads and moderate in PySpark
|
Cost
|
Processing is cost
effective, Storage is costly
|
Processing is costly
& Storage is cost effective
|
Processing &
Storage is cost effective
|
Query
|
Super Fast
|
Fast
|
Fast
|
When to use ?
|
Data lie on-prem or
cloud, your need is more BI aligned and need faster query results with decent
support to ML/Data science,
Need everything on Azure stack
|
Data lie on cloud or
Data lake, your need no silos and more collaboration between ML/Data science
and BI
NOTE: you can still use
tools like ADF together with Databricks to pull on-prem data
|
Data lie on-prem or
cloud, your need no silos and more collaboration between ML/Data science and
BI,
Need everything on Azure stack
|
Comments
Post a Comment