Building a Zero Trust Architecture with System Tables on Unity Catalog

Data Engineering

Date : 06/25/2024

Data Engineering

Date : 06/25/2024

Building a Zero Trust Architecture with System Tables on Unity Catalog

Learn how to build a Zero Trust Architecture using System Tables on Unity Catalog. Includes operational auditing, tracking sensitive data, and monitoring billing.

Priyanshu Kumar

AUTHOR - FOLLOW
Priyanshu Kumar
Senior Manager, Data Engineering

Building a Zero Trust Architecture with System Tables on Unity Catalog
Like the blog

Table of contents

Building a Zero Trust Architecture with System Tables on Unity Catalog

Table of contents

Building a Zero Trust Architecture with System Tables on Unity Catalog

Building a Zero Trust Architecture with System Tables on Unity Catalog

Last year, around May 2023, Databricks launched a much-awaited centralized operational data store, quickly garnered all the attention for the right reasons. While everyone was excited to finally capture the end-to-end entity relationship for all objects – tables, notebooks, workflows, DLT pipelines, dashboards, and models- the auditing capabilities truly made it a backend store for monitoring all the operation aspects. Now, we can track every action a user performs on the objects mentioned above from inside the workspace or outside. That is the bigger breakthrough for creating a near-perfect Zero Trust Architecture (ZTA).

Before we delve deep into ZTA, let us first understand the overall capabilities of system tables in a bit of depth. As we write, below tables1 are available today in the “system” catalog:

Category Description Table Name
Audit logs Captures all audit events from workspaces.  system.access.audit
Clusters Captures full history of configuration changes over time for any cluster. system.compute.clusters
SQL warehouse events Captures events related to SQL warehouses.  system.compute.warehouse_events
Billable usage Captures DBU consumption across your account. system.billing.usage
Pricing Captures historical log of SKU pricing. system.billing.list_prices
Table lineage Captures read or write events on a Unity Catalog table or path. system.access.table_lineage
Column lineage Captures read or write events on a column system.access.column_lineage
Node types Captures the currently available node types with their basic hardware information. system.compute.node_types
Marketplace funnel events Includes consumer impression and funnel data for your listings. system.marketplace.listing_
funnel_events
Marketplace listing access Includes consumer info for completed request data or get data events on your listings. system.marketplace.listing_
access_events
Predictive optimization Tracks the operation history of the predictive optimization feature. system.storage.predictive_
optimization_operations_history

For this blog, we will restrict our analysis to the five tables below, which serve as the base for operational auditing.

  1. system.access.audit
  2. system.compute.clusters
  3. system.compute.warehouse_events
  4. system.billing.usage
  5. system.billing.list_prices

Operational Auditing

The “Audit” table in the “access” schema has the entire universe of operations logged at the workspace level. Any interaction with any object is tracked in this table. As of today, Databricks tracks 409 distinct actions (like login, downloads, command execution etc.) across thirty-five unique services (like unity catalog, DBSQL, notebooks, jobs, etc.) Let us look at some interesting analysis this table can help with.

Tracking sensitive data usage in the past 90 days using table name

1. From Job


select event_time, user_identity.email,action_name, "Jobs" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='jobs' and action_name="runCommand"
and request_params["commandText"] like '%<>%'

2. From Notebook


select event_time, user_identity.email,action_name, "Notebook" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='notebook' and action_name="runCommand"
and request_params["commandText"] like '%<>%'

3. From DBSQL


select event_time, user_identity.email,action_name, "Databricks SQL" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='databrickssql' and action_name="commandSubmit"
and request_params["commandText"] like '%<>%'

4. From Catalog Explorer


select event_time, user_identity.email,action_name, "Unity Catalog Explorer" as Entity, request_params["jobId"] as job_id
from `system`.access.audit
where event_date > now() - interval '90 day' and 
action_name IN ('createTable','getTable','deleteTable')
and request_params["full_name_arg"] =  '<>'

Standard output – 

Tracking sensitive data usage in the past 90 days using tags

If the tables or columns have tags information which can classify them as sensitive, then it makes monitoring your entire lakehouse even simpler.

Example: If a table has been marked with the tag “sensitive:yes” or a column with a tag “pii:yes” then this tag information can be used to see who has accessed all such tables by modifying the previous queries slightly, as shown below.


select  action_name,event_time, user_identity.email,
from `system`.access.audit
where event_date > now() - interval '90 day' and 
action_name IN ('createTable','getTable','deleteTable')
and request_params["full_name_arg"] IN (
select concat_ws('.',catalog_name,schema_name,table_name) from system.information_schema.column_tags where tag_name = "pii" and tag_value ="yes"
UNION
select concat_ws('.',catalog_name,schema_name,table_name) from system.information_schema.table_tags where tag_name = "sensitive" and tag_value ="yes"
)

Currently, there is no built-in way to automatically tag tables/columns by scanning the sample data like few governance tools available in the market. However, there are open-source libraries like Presidio or Databricks GenAI model “ai_classify” that automate this programmatically.

Tracking privileged users/groups

More often than not, we provide users/groups some elevated access to support a POC or use case for interim. However, those get missed tracking in the long run, which results in some users having more privileged access than needed. There is a very easy way to track that at all securable object levels – metastore, catalog, schema, table, external location, storage credentials, volumes, and Delta shares. 

Sample query to track privileges across all catalogs and schema and sort them by users/group having maximum privileges – 


SELECT grantee, 'catalog' AS securable_type, count(*) AS total FROM system.information_schema.catalog_privileges GROUP BY 1, 2
union ALL
SELECT grantee, 'schema' AS securable_type, count(*) AS total FROM system.information_schema.schema_privileges GROUP BY 1, 2
order by 2,3 desc

[ This can be further drilled down to exact privileges by using the column ‘privilege_type’ ]

Tracking unauthorized logins per user


SELECT event_date, ifnull(user_identity.email, request_params.user) AS email,  collect_set(action_name) AS action_names, collect_set(response.error_message) AS error_messages, collect_set(response.status_code) AS response_codes, count(*) AS total FROM system.access.audit 
WHERE action_name IN ('aadBrowserLogin', 'aadTokenLogin', 'certLogin', 'jwtLogin', 'login', 'oidcBrowserLogin', 'samlLogin', 'tokenLogin') AND response.status_code IN (401, 403) 
GROUP BY 1,2 ORDER BY total DESC

Tracking unauthorized data access requests


select event_date,user_identity.email,source_ip_address, response.error_message,audit_level,action_name  FROM system.access.audit WHERE service_name = 'unityCatalog' AND response.status_code IN (401, 403)

Tracking hard-coded passwords written in notebooks using a regex pattern.


select event_time, user_identity.email,service_name, request_params.commandLanguage as Language, request_params.commandText as command, request_params.notebookId as Notebook_id
from system.access.audit where action_name="runCommand"
and request_params["commandText"] rlike '(?|(password)|(pswd)|(psswd))'

Tracking downloads from the workspace


SELECT event_date, user_identity.email, collect_set(service_name) AS service_names, collect_set(action_name) AS action_names, count(*) AS total FROM system.access.audit WHERE action_name IN ('downloadPreviewResults', 'downloadLargeResults', 'filesGet', 'getModelVersionDownloadUri', 'getModelVersionSignedDownloadUri') OR (action_name = 'workspaceExport') OR (action_name = 'downloadQueryResult') GROUP BY 1, 2

Billing Monitoring

“Billing” schema has two tables as of now– “usage” and “list_prices.”

Usage information gives vital information to track your consumption pattern for compute. Illustrative examples are shown below.

Track scheduled jobs cost


select 
YearMonth, sku_name, sum(usage_quantity) as Total_DBU_Consumed ,sum(list_cost) as Total_List_Cost
from
(
  select
    usage.sku_name,
    date_format(usage_date, 'yyyy-MM') as YearMonth,
    usage_quantity,
    list_prices.pricing.default * usage_quantity as list_cost
  from
    system.billing.usage
    left join system.billing.list_prices on usage.sku_name = list_prices.sku_name
  where
    usage_metadata.job_id is not Null)
group by YearMonth, sku_name


Track scheduled jobs cost on all-purpose compute

Interactive (All Purpose) compute is meant to be used during the development cycle. Once a solution is developed, it should be moved to job clusters as best practice to reduce costs. The below query can help estimate the cost incurred by jobs running on interactive clusters.


  with created_jobs as (
  select
    get_json_object(response.result, '$.job_id') as job_id
  from
    system.access.audit
  where
    action_name = 'create' and service_name = 'jobs' and response.status_code = 200
),
deleted_jobs as (
  select request_params.job_id
  from system.access.audit
  where
    action_name = 'delete' and service_name = 'jobs' and response.status_code = 200
)
select sku_name, sum(dbus) as total_dbus, sum(list_cost) as total_list_cost from
(select a.sku_name, d.pricing.default,
a.usage_quantity as dbus,
a.usage_quantity * d.pricing.default as list_cost
from
system.billing.usage a
left join
(select * from created_jobs b where b.job_id not in (select job_id from deleted_jobs )) active_jobs
on a.usage_metadata.job_id = active_jobs.job_id
left join system.billing.list_prices d on a.sku_name = d.sku_name
WHERE contains(a.sku_name, 'ALL_PURPOSE')
)GROUP by all
    -- left join deleted_jobs c on b.workspace_id = c.workspace_id
  -- and b.job_id = c.job_id


DBU pricing 

To view the standard rates offered across all regions and for all SKUs, namely all-purpose cluster, job cluster, SQL warehouses, DLT, and model training, you can use the below query and substitute the SKU you are interested in.


SELECT price_start_timesku_name, currency_code, pricing.default as cost FROM system.billing.list_prices where sku_name = "PREMIUM_JOBS_COMPUTE" and price_end_time IS NULL

You can also track the historical price changes using the price_start_time and price_end_time columns.

Be aware that the data does not reflect the discounts applied on top of your account. Hoping to see that Databricks comes up with a unique way of addressing this in the future.  

Feature to look out for

While system tables have created a good momentum for auditing, monitoring and lineage capabilities, we are curious to see if Databricks can bring some features around :

  1. Tracking lineage information of hive_metastore based tables.
  2. Automated classifier for sensitive data in table at table and column level
  3. Tracking the pricing as per the discounts applied on customer’s subscription.
  4. Tracking Delta Live Tables clusters

References:

  1. https://docs.databricks.com/en/admin/system-tables/index.html
Priyanshu Kumar

AUTHOR - FOLLOW
Priyanshu Kumar
Senior Manager, Data Engineering

Topic Tags


Img-Reboot

Detailed Case Study

A data migration success story for a +$8B global consumer goods firm

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.

Img-Reboot

Detailed Case Study

MIGRATING LEGACY APPLICATIONS TO A MODERN SUPPLY CHAIN PLATFORM FOR A LEADING $15 BILLION WATER, SANITATION, AND INFECTION PREVENTION SOLUTIONS PROVIDER

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.


Next Topic

4 Ways Generative AI is Revolutionizing the CPG Industry



Next Topic

4 Ways Generative AI is Revolutionizing the CPG Industry


0
Shares

286
Reads

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

×
Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article
×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.