Setting up Unity Catalog and Object Organization

Data Governance

Date : 09/26/2023

Data Governance

Date : 09/26/2023

Setting up Unity Catalog and Object Organization

Explore a step-by-step guide to implement Unity Catalog in Databricks, integrate with Azure for centralized governance, and understand the role of metastore admin.

Maulik Divakar Dixit

AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, Databricks Champion

Img-Reboot
Like the blog

Table of contents

Setting up Unity Catalog and Object Organization

Table of contents

Setting up Unity Catalog and Object Organization

Img-Reboot

To set up Unity Catalog for Databricks, it is important to first set up a storage account that will host the Unity Catalog metastore. The metastore is like a Hive metastore except that it is global in nature, which means that workspaces enabled for Unity Catalog can access the metastore.

The steps and details are mentioned in the Databricks documentation to give a high-level view of the steps on Azure Databricks. Setup may vary slightly across AWS and GCP

Step 1 – Setup a storage account preferably in your production subscription, since this is where objects in metastore will be created

Step 2 – Enable Databricks access connector from cloud marketplace and assign it a Managed Identity

Step 3 – Give access connector access to the storage account with blob contributor access

Step 4 – Login to Databricks admin console (accounts.azuredatabricks.net) as a global administrator

Step 5 – Assign Databricks administration role to another user to manage Databricks

Step 6 – Create a metastore in the region and assign metastore administration role to user/groups

Step 7 – Attach workspaces to metastore to make the workspace Unity Catalog enabled workspace

Step 8 – Setup sync of users/groups and service principals to admin console through SCIM connector

Step 9 – Assign users/groups to the workspaces from the administration console

Step 10 – Login to the workspace to create external locations in case data needs to be stored as external tables

Again, all these steps are detailed in the documentation. The important things to note here are

Please see the details -

Get started using Unity Catalog - Azure Databricks | Microsoft Learn

Get started using Unity Catalog | Databricks on AWS

Important Considerations

There is only one metastore per region.

What this means is that all workspace in the same region can only be connected to the metastore. In case of scenarios where Databricks workspaces are across regions, it is recommended to migrate the workspace to the metastore region to enable it to share the metastore and objects. If this is not a feasible approach, then a separate metastore needs to be created in the other region and metadata across regions should be shared through the Lakehouse Platform’s Delta Sharing. This is set up by the Databricks administrator who has the rights to create recipients and shares. 

The Databricks account administrator manages the metastores, setting up the metastores in different regions and user/group assignment to the admin console and workspaces.

There is a new account console to manage workspaces, users, groups and principals

The new account console is powerful since it now gives the ability to organizations to manage all workspaces, users, groups, principals through the console. Organizations can enable the SCIM solution to create groups in their identity management solution and sync up to the account console from where they can be assigned to workspaces.

Metastore admin to manage objects in metastore

A metastore administrator role can be created and assigned to a group. The initial setup of the catalog and schemas can be done by the metastore administrator and then appropriate access can be granted to application teams to create objects under the catalog. The ownership can be centralized and managed by metastore administration or decentralized as an initial catalog can be created and management delegated to application/domain teams.

So now the metastore is set up, it is important to understand the object organization in the unity catalog. The object organization is particularly important as there is one metastore per region and the DEV, QA and PROD objects for all workspaces will reside in the metastore.

Below is a sample organization for a domain, noting that this is only an example and object organization can be created by environments, teams, sandbox etc. However, through Tredence's experience, we suggest this as the preferred organization where objects are organized by domains/applications.

This structure gives flexibility to organizations to build objects centrally and assign permissions at catalog, database and object levels. This is a new and powerful capability delivered into the already powerful and scalable Databricks Lakehouse platform. We will cover this in the governance part of the unity catalog.

Below are a few options to create the Environments under SDLC

Below is an example of how we can organize the catalogs by domains which enables govern data access by domains

Below is a snapshot of how it appears in Databricks

To summarize, Unity Catalog is a centralized object storage and governance solution for an organization that empowers organizations to understand which workspaces are used by which user/groups, govern access to workspace and access to objects and data in the metastore. 

Now that you have a fair idea about the set up and object organization, stay tuned for the next chapter, where we discuss different cluster types.

 

Maulik Divakar Dixit

AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, Databricks Champion

Topic Tags


Img-Reboot

Detailed Case Study

A data migration success story for a +$8B global consumer goods firm

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.

Img-Reboot

Detailed Case Study

MIGRATING LEGACY APPLICATIONS TO A MODERN SUPPLY CHAIN PLATFORM FOR A LEADING $15 BILLION WATER, SANITATION, AND INFECTION PREVENTION SOLUTIONS PROVIDER

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.


Next Topic

What is Unity Catalog and Why is it a Game Changer?



Next Topic

What is Unity Catalog and Why is it a Game Changer?


0
Shares

3323
Reads

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

×
Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article
×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

x