Understanding Databricks Cluster Types in Unity Catalog

Data Governance

Date : 10/03/2023

Data Governance

Date : 10/03/2023

Understanding Databricks Cluster Types in Unity Catalog

Dive into the diverse cluster types within Unity Catalog. Essential for data engineers, learn practical uses, implementation challenges, and workarounds.

Maulik Divakar Dixit

AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, Databricks Champion

Img-Reboot
Like the blog

Table of contents

Understanding Databricks Cluster Types in Unity Catalog

Table of contents

Understanding Databricks Cluster Types in Unity Catalog

Img-Reboot

Unity Catalog workspace offers 3 types of clusters to run data pipelines in Databricks

  • Singler User : This is the most secure way of accessing data in Databricks with unity catalog enabled workspace. Single user is a cluster designed to be used by a single user. The permission that the user has with respect to external locations and files work with a single user cluster.
  • Shared : This is the cluster type that is shared across users. This works on unity catalog. This cluster has some limitations that are explained later in the blog.
  • No Isolation Shared : This cluster type does work only with legacy hive metastore to enable legacy data access and processing for objects in local hive metastore. The permissions set on the unity objects do not work in this case.x

Please see details in the link -

Create a cluster - Azure Databricks | Microsoft Learn

From a Unity Catalog point of view, the single user and shared cluster modes are the only cluster modes that can be used.

The shared cluster mode has some limitations which are important to understand and cases where a single user cluster helps work around the limitations.

Below are a few limitations that we faced with shared cluster mode.

  1. Dataframe created on external location in unity catalog and then trying to create a view gives error that user does not have SELECT permission on file
  2. Trying to save file to external location using dataframe.write.partitionBy(*partition_by).format("delta").save(delta_file_path) causes failures
  3. Function like input_file_name() which basically tags the source filename against each records gave null results
  4. Storing json files to datalake using json.dumps() did not work
  5. UDF and fernet encryption were not working with shared cluster mode
  6. File system commands like %fs ls does not work

These limitations were found when interacting with external locations and datalake. To overcome it we switched to a single user cluster mode initially to create the external delta tables. Once we were able to create Delta Tables (Bronze tables) we were able to use shared cluster mode for building Silver and Gold layers.

So for the project we did the following:

  1. For the data ingestion team we created a personal cluster for each member with a very small compute and let them use the personal cluster.
  2. We also used a single user cluster to orchestrate our jobs from Azure Data Factory and Databricks Workflows by creating an application user and making the user part of the metastore admin group.
  3. To meet SLAs for small jobs we used interactive clusters. Since interactive cluster does not support service principal as a user, we created a generic user and used that to run the jobs in a single user mode.

The future Databricks roadmap plan we understand is to introduce new cluster types that can remove limitations imposed by the shared cluster mode. However, the single user cluster in production with orchestrated data pipelines works without any issues

This is a short blog but thought it was important to bring out and explain the cluster types and what works in what scenario and how we can work around them.

Watch this space to learn about Unity Catalog's data governance capabilities in Chapter 5 of this blog series.

Maulik Divakar Dixit

AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, Databricks Champion

Topic Tags


Img-Reboot

Detailed Case Study

A data migration success story for a +$8B global consumer goods firm

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.

Img-Reboot

Detailed Case Study

MIGRATING LEGACY APPLICATIONS TO A MODERN SUPPLY CHAIN PLATFORM FOR A LEADING $15 BILLION WATER, SANITATION, AND INFECTION PREVENTION SOLUTIONS PROVIDER

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.


Next Topic

Setting up Unity Catalog and Object Organization



Next Topic

Setting up Unity Catalog and Object Organization


0
Shares

2725
Reads

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

×
Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article
×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

x