To set up Unity Catalog for Databricks, it is important to first set up a storage account that will host the Unity Catalog metastore. The metastore is like a Hive metastore except that it is global in nature, which means that workspaces enabled for Unity Catalog can access the metastore.
How To Set Up Unity Catalog?
The steps and details are mentioned in the Databricks documentation to give a high-level view of the steps on Azure Databricks. Set up may vary slightly across AWS and GCP
Step 1 – Set up a storage account preferably in your production subscription, since this is where objects in metastore will be created
Step 2 – Enable Databricks access connector from cloud marketplace and assign it a Managed Identity
Step 3 – Give access connector access to the storage account with blob contributor access
Step 4 – Login to Databricks admin console (accounts.azuredatabricks.net) as a global administrator
Step 5 – Assign Databricks administration role to another user to manage Databricks
Step 6 – Create a metastore in the region and assign metastore administration role to user/groups
Step 7 – Attach workspaces to metastore to make the workspace Unity Catalog enabled workspace
Step 8 – Set up sync of users/groups and service principals to admin console through SCIM connector
Step 9 – Assign users/groups to the workspaces from the administration console
Step 10 – Login to the workspace to create external locations in case data needs to be stored as external tables
Again, all these steps are detailed in the documentation. The important things to note here are
Please see the details -
Get started using Unity Catalog - Azure Databricks | Microsoft Learn
Get started using Unity Catalog | Databricks on AWS
Important Considerations While Setting Up Databricks Unity Catalog
There is only one metastore per region.
What this means is that all workspace in the same region can only be connected to the metastore. In case of scenarios where Databricks workspaces are across regions, it is recommended to migrate the workspace to the metastore region to enable it to share the metastore and objects. If this is not a feasible approach, then a separate metastore needs to be created in the other region and metadata across regions should be shared through the Lakehouse Platform’s Delta Sharing. This is set up by the Databricks administrator who has the rights to create recipients and shares.
The Databricks account administrator manages the metastores, setting up the metastores in different regions and user/group assignment to the admin console and workspaces.
Lead the Curve to Enterprise Data Intelligence with Tredence UnityGO! - Unity Catalog Migration Accelerator.
There is a new account console to manage workspaces, users, groups and principals
The new account console is powerful since it now gives the ability to organizations to manage all workspaces, users, groups, principals through the console. Organizations can enable the SCIM solution to create groups in their identity management solution and sync up to the account console from where they can be assigned to workspaces.
Metastore admin to manage objects in metastore
A metastore administrator role can be created and assigned to a group. The initial set up of the catalog and schemas can be done by the metastore administrator and then appropriate access can be granted to application teams to create objects under the catalog. The ownership can be centralized and managed by metastore administration or decentralized as an initial catalog can be created and management delegated to application/domain teams.
So now the metastore is set up, it is important to understand the object organization in the unity catalog. The object organization is particularly important as there is one metastore per region and the DEV, QA and PROD objects for all workspaces will reside in the metastore.
Below is a sample organization for a domain, noting that this is only an example and object organization can be created by environments, teams, sandbox etc. However, through Tredence's experience, we suggest this as the preferred organization where objects are organized by domains/applications.
This structure gives flexibility to organizations to build objects centrally and assign permissions at catalog, database and object levels. This is a new and powerful capability delivered into the already powerful and scalable Databricks Lakehouse platform. We will cover this in the data governance part of the unity catalog.
Below are a few options to create the Environments under SDLC
Below is an example of how we can organize the catalogs by domains which enables govern data access by domains
Below is a snapshot of how it appears in Databricks
To summarize, Unity Catalog is a centralized object storage and governance solution for an organization that empowers organizations to understand which workspaces are used by which user/groups, govern access to workspace and access to objects and data in the metastore.
Now that you have a fair idea about the set up and object organization, stay tuned for the next chapter, where we discuss different cluster types.
Learn how to speed up unity catalog migration process leveraging Tredence UnityGO! - our Unity Catalog migration accelerator.
AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, Databricks Champion
Topic Tags
Detailed Case Study
A data migration success story for a +$8B global consumer goods firm
Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.
Detailed Case Study
MIGRATING LEGACY APPLICATIONS TO A MODERN SUPPLY CHAIN PLATFORM FOR A LEADING $15 BILLION WATER, SANITATION, AND INFECTION PREVENTION SOLUTIONS PROVIDER
Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.