In the ever-evolving landscape of data engineering and analytics, several trends are reshaping how organizations handle their information assets. Key shifts include recognizing data as a strategic asset at the C-level and emphasizing cost-efficient systems. The concept of a lakehouse, which combines the best of data lakes and data warehouses, has gained market validation. It offers scalability, flexibility, and analytics capabilities while maintaining data governance.
However, the true transformation occurs with the shift to Databricks—a platform that has recently empowered organizations with serverless SQL capabilities. Databricks Serverless SQL clusters and the DBT (Data Build Tool) play pivotal roles, and here, we will learn how you can harness their power to create efficient, scalable, and cost-effective data pipelines.
DBT changed how transformations were performed earlier by relying heavily on data engineers to build data pipelines or workflows with several notebooks. It now allows data analysts and engineers with SQL skills to leverage software engineering best practices, from developing, documenting, testing, and deploying to the destination with ease.
The relationship between DBT and Databricks has a history, dating back to the 2020 Data and AI Summit(Europe), where discussions revolved around integrating DBT with Databricks and Delta. Recent improvements have enhanced simplicity, user-friendliness, and a better experience, resulting in faster production-grade outcomes and scalability.
Combining the power of DBT with a Serverless SQL warehouse for near real-time streaming
Ever since SQL warehouses were introduced in Databricks, there have been some evident query performances with DBT pushing SQL commands as they are optimized specially for the SQL compared to regular compute in Databricks. With the recent growth of Databricks Serverless SQL, there have been some significant jumps we can leverage, such as built-in predictive IO and intelligent autoscaling.
Know more about the different types of SQL warehouses.
Click here
Key differences between the Serverless SQL warehouse and the Classic & Pro SQL warehouses
Serverless SQL spin-up time is so much faster, with a max of 5-10 seconds, and guarantees consistency in the execution time, whereas both classic and pro compute take a few minutes to start up. Then, the execution starts, making it available for many use cases like streaming data and close to real BI insights, all at no additional cost of running the clusters continuously.
Cost-effective - If each cluster type is already up, the run times are almost comparable, but the additional cost is associated with keeping the clusters running for classic and pro when idle.
Also, classic and pro compute run on the selected cloud provider, so there would be two kinds of charges applied (one for Databricks Unit (DBU) usage and the other for running instances on the cloud provider), whereas the serverless option is all managed by Databricks and pricing is pretty good considering it does not have any cloud costs.
Click here to learn about the pricing details of Databricks SQL warehouse.
Let’s delve into the specifics related to DBT and Databricks. If you are working on the Databricks interface, DBT Cloud is available on Databricks Partner Connect, which is very easy to integrate. For someone working on DBT CLI, there is a dbt-databricks adapter, which builds upon the foundation of dbt-spark. Recent updates to this package leverage Databricks’ latest features, such as the Photon compute engine, instant scalability via Serverless SQL Warehouses, and the Unity Catalog governance model. Additionally, features like streaming tables and materialized views enhance the overall capabilities of this integration.