Databricks Cluster Types Explained in Unity Catalog

Data Governance

Date : 10/03/2023

Data Governance

Date : 10/03/2023

Databricks Cluster Types Explained in Unity Catalog

Explore Databricks cluster types within Unity Catalog for practical uses and overcoming implementation challenges.

Maulik Divakar Dixit

AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering,
Databricks Champion
Databricks MVP

Databricks Cluster Types

Like the blog

Table of contents

Databricks Cluster Types Explained in Unity Catalog

What is a databricks cluster?
Unity Catalog workspace offers 3 types of clusters to run data pipelines in Databricks
Challenges and Limitations with Shared Clusters

Like the blog

Table of contents

Databricks Cluster Types Explained in Unity Catalog

What is a databricks cluster?
Unity Catalog workspace offers 3 types of clusters to run data pipelines in Databricks
Challenges and Limitations with Shared Clusters

Databricks Cluster Types

What is a databricks cluster?

A Databricks cluster is a cloud-based setup that helps process large datasets. It uses Apache Spark to perform tasks like data cleaning, analysis, and machine learning. Clusters can be customized with varying compute power, making them ideal for efficiently handling diverse data engineering and science workloads.

Unity Catalog workspace offers 3 types of clusters to run data pipelines in Databricks

Singler User : This is the most secure way of accessing data in Databricks with unity catalog enabled workspace. Single user is a cluster designed to be used by a single user. The permission that the user has with respect to external locations and files work with a single user cluster.
Shared : This is the cluster type that is shared across users. This works on unity catalog. This cluster has some limitations that are explained later in the blog.
No Isolation Shared : This cluster type does work only with legacy hive metastore to enable legacy data access and processing for objects in local hive metastore. The permissions set on the unity objects do not work in this case.x

Please see details in the link -

Create a cluster - Azure Databricks | Microsoft Learn

Challenges and Limitations with Shared Clusters

From a Unity Catalog point of view, the single user and shared cluster modes are the only cluster modes that can be used.

The shared cluster mode has some limitations which are important to understand and cases where a single user cluster helps work around the limitations.

Below are a few limitations that we faced with shared cluster mode

Dataframe created on external location in unity catalog and then trying to create a view gives error that user does not have SELECT permission on file
Trying to save file to external location using dataframe.write.partitionBy(*partition_by).format("delta").save(delta_file_path) causes failures
Function like input_file_name() which basically tags the source filename against each records gave null results
Storing json files to datalake using json.dumps() did not work
UDF and fernet encryption were not working with shared cluster mode
File system commands like %fs ls does not work

These limitations were found when interacting with external locations and datalake. To overcome it we switched to a single user cluster mode initially to create the external delta tables. Once we were able to create Delta Tables (Bronze tables) we were able to use shared cluster mode for building Silver and Gold layers.

So for the project we did the following:

For the data ingestion team we created a personal cluster for each member with a very small compute and let them use the personal cluster.
We also used a single user cluster to orchestrate our jobs from Azure Data Factory and Databricks Workflows by creating an application user and making the user part of the metastore admin group.
To meet SLAs for small jobs we used interactive clusters. Since interactive cluster does not support service principal as a user, we created a generic user and used that to run the jobs in a single user mode.

The future Databricks roadmap plan we understand is to introduce new cluster types that can remove limitations imposed by the shared cluster mode. However, the single user cluster in production with orchestrated data pipelines works without any issues

This is a short blog but thought it was important to bring out and explain the cluster types and what works in what scenario and how we can work around them.

Watch this space to learn about Unity Catalog's data governance capabilities in Chapter 5 of this blog series.

Maulik Divakar Dixit

AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, <br>Databricks Champion<br>Databricks MVP

Topic Tags

Databricks Unity Catalog

Data Governance

Data Migration

Next Topic

Databricks Unity Catalog: A Step by Step Guide

Continue reading

Next Topic

Databricks Unity Catalog: A Step by Step Guide

Continue reading

our categories

Telecom, Media, Technology

Travel & Hospitality

Healthcare & Life Sciences

Banking & Financial Services

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

recommended articles

A Practical Guide for Building Marketing Measurement in Clean Rooms

Blog

A Practical Guide for Building Marketing Measurement in Clean Rooms

AI Data Governance - Leading the Way to Enterprise-Scale AI Success

Blog

AI Data Governance - Leading the Way to Enterprise-Scale AI Success

Effective Unity Catalog Migration - A Guide

Blog

Effective Unity Catalog Migration - A Guide

×

Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article

×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.