On This Page

In 2026, Databricks Compute is no longer something users build and manage themselves. Instead of creating clusters and picking machine sizes, Cluster works as a smart, governed engine built into the platform. It automatically starts, scales, and optimizes resources while applying security and data access rules through Unity Catalog. For example, when a data analyst runs a SQL query on a sales table, Databricks automatically uses the right Cluster, ensures the analyst only sees permitted data, and scales performance as needed—all without the analyst worrying about infrastructure settings.

Earlier, in 2020’s, teams had to spend a lot of time deciding how big a cluster should be, choosing instance types, and tuning performance to control costs. Now the focus has shifted to governance and safety. In 2026, organizations ask whether cluster is secure, cost‑controlled, and compliant by default. For instance, a data engineering job processing customer data will automatically follow company policies for data access, logging, and cost limits. This unified approach brings data, cluster, and identity under one control plane, making everyday work easier for users while giving administrators clear control and visibility from the start.

This blog covers:

  • Modern cluster access modes (Dedicated, Standard, Serverless)
  • Unity Catalog storage abstractions (Volumes, External Locations)
  • Next-gen optimization (Liquid Clustering, Predictive Optimization)
  • Real-world limitations encountered in production at Tredence
  • Practical migration guidance

1. From Clusters to Governed Clusters

The Old World (Pre‑2024)

Earlier, a Databricks admin’s daily work was heavily focused on cluster management. This included selecting the right instance types, configuring autoscaling, setting idle timeouts, and frequently debugging Spark configurations to keep jobs stable and performant. Security was not built into the platform by default and instead was added later using cluster‑level ACLs. Cost control was largely reactive, meaning overspending was usually discovered only after cloud bills were generated.

In 2023, cluster setup typically involved manually defining every aspect of a cluster, as shown below:

# 2023: Manual cluster configuration

cluster_config = {

    "cluster_name": "analytics-team-cluster",

    "spark_version": "13.3.x-scala2.12",

    "node_type_id": "Standard_DS3_v2",

    "autoscale": {"min_workers": 2, "max_workers": 8},

    "autotermination_minutes": 30,

    "azure_attributes": {"availability": "ON_DEMAND_AZURE"} }

2023’s were offering 3 types of unity catalog workspace:

  • Singler User : This is the most secure way of accessing data in Databricks with unity catalog enabled workspace. Single user is a cluster designed to be used by a single user. The permission that the user has with respect to external locations and files work with a single user cluster.
  • Shared : This is the cluster type that is shared across users. This works on unity catalog. This cluster has some limitations that are explained later in the blog.
  • No Isolation Shared : This cluster type does work only with legacy hive metastore to enable legacy data access and processing for objects in local hive metastore. The permissions set on the unity objects do not work in this case.x

Challenges with This Approach

This model created several ongoing challenges. Every team typically requested its own dedicated cluster, which led to large amounts of idle cluster and rapidly escalating costs.

Security became fragmented, with a mix of cluster ACLs and table ACLs that did not work together cleanly across environments.

Finally, there was no centralized audit trail to answer basic governance questions such as who accessed which data, using which cluster, and under what permissions—making compliance and accountability difficult at scale.

The New World (2025–2026)

Between 2025 and 2026, Databricks cluster became much simpler for users because the platform now manages it automatically. Instead of people configuring clusters, choosing machines, or worrying about scaling, Unity Catalog acts as a central control system for data, cluster, and user access. Administrators define clear policies in advance to control who can run what, how secure it is, and how much it can cost, while serverless compute runs by default. Users just run their queries or jobs, and Databricks takes care of provisioning, scaling, security, and cost limits in the background in a predictable and governed way. For example, in 2026, an admin defines a standard analytics cluster policy like this, without any cluster configuration exposed to users:

-- 2026: Compute is policy-defined, not user-configured

-- Serverless is the default - no cluster configuration needed

CREATE COMPUTE POLICY analytics_standard

  DEFINITION (

    "access_mode" = "STANDARD",

    "serverless" = true,

    "max_dbu_per_hour" = 100

  );

 

 

Cluster-First vs Governed First

 

Please see the configuration process in this link – Compute Configuration

 

2. The Three Compute Access Modes

A Unity Catalog–enabled workspace offers three compute access modes that control who can use compute, which data they can access, what languages they are allowed to run, and how isolation is enforced. These modes are not just an administrative detail; they directly affect how data engineers and ML engineers work in production. If the wrong access mode is chosen, teams often face confusing permission errors, unexpected data access behavior, silent data quality problems, or pipeline failures that are difficult to troubleshoot without understanding how isolation and governance really work.

A. Dedicated Access Mode (formerly Single User)

Dedicated access mode is like giving one person or one team their own locked room to work in, but with the building’s security rules still fully enforced. For example, imagine a data science team running GPU‑heavy model training: they get a dedicated Databricks cluster just for their group, so no one else can interfere and they can freely use Scala, R, custom Spark configs, or GPUs. However, when the job reads a customer table or writes results to storage, Databricks uses each user’s real identity (or the group’s identity) and checks permissions through Unity Catalog every time, just like a shared environment. This is different from old single‑user clusters, which often ran under one generic service account and couldn’t truly track or restrict who accessed what. The compute is private, the security is still centralized and strict—you get maximum flexibility without losing governance or auditability.

Below is an example of how an admin might configure dedicated compute for an ML engineering team:

# Dedicated compute for ML team (admin-configured)

dedicated_config = {

    "cluster_name": "ml-engineering-gpu",

    "spark_version": "16.4.x-gpu-ml-scala2.12",

    "node_type_id": "Standard_NC6s_v3",

    "data_security_mode": "SINGLE_USER",

    "single_user_name": "ml-engineering-group",

    "num_workers": 4

}

Key behavior: when a cluster is assigned to a group, Databricks only allows what both the user and the group are allowed to access—not everything either one can access. Think of it like airport security: even if you have a boarding pass for 50 destinations, if you enter through a gate meant for a specific tour group, you can only go where that tour group is allowed. So if your personal account has access to 50 tables, but the group is allowed only 10 of them, you’ll see only those 10 while using that cluster. Many teams expect the permissions to be combined (user + group), but Unity Catalog intentionally uses this stricter “common-only” rule to prevent someone from secretly gaining extra access just by using a shared group or compute.

Where we used:

We started by using Standard (shared) compute for everything, thinking it was the default and safest choice. But very quickly, our data ingestion pipelines failed with access errors.

The reason: Standard mode is meant for working with governed tables, not raw storage files. It blocks direct access to cloud storage paths like ADLS containers.

The big takeaway was this: any part of a pipeline that directly reads from or writes to cloud storage needs Dedicated mode, because it requires full filesystem access. But once the data is written into Unity Catalog–registered tables, everything downstream can safely and smoothly run on Standard mode. In short, use Dedicated mode for raw files in cloud storage, and switch to Standard mode once the data is governed and inside Unity Catalog.

Below are a few limitations that we faced with Dedicated Access Mode:

Cost hit us immediately:

Dedicated clusters are isolated by design, so each one incurs full compute cost even when idle. Creating many small Dedicated clusters is often more expensive than sharing a few well‑governed ones.

Our fix: We consolidated to 2 shared Dedicated clusters — one for interactive development (auto-terminates in 15 min) and one for orchestrated jobs (spins up on demand via Lakeflow Jobs).

Group permission intersection :

When a Dedicated cluster is assigned to a group, Unity Catalog enforces the intersection of user and group permissions. This prevents accidental privilege escalation but can feel restrictive if not understood upfront.

Our fix: Document the intersection behavior explicitly in our onboarding guide. For engineers who need broader access for exploration, we let them spin up personal Dedicated clusters (small compute, strict auto-termination)

Service principal limitation for interactive clusters

Interactive Dedicated clusters require a human identity because they support notebooks and manual work. Service principals are designed for automation and therefore work only with ephemeral job clusters.

Our fix: For automated jobs, we use job clusters (ephemeral, created per run) which DO support service principals. For interactive testing, we created a generic human user account (ingestion-test@tredence.com) as a workaround — not ideal, but pragmatic

Cluster startup time broke our SLAs

Dedicated clusters offer flexibility but take several minutes to start due to provisioning and initialization. For strict SLAs, this startup cost must be planned for or avoided with always‑on or pooled compute.

Our fix: For time-sensitive small jobs, we kept an always-on interactive cluster with minimal compute. For larger jobs, we accepted the startup cost and built it into SLA calculations. We also explored Databricks Cluster Pools to keep warm instances ready, reducing startup to ~90 seconds

No cluster sharing means duplicated warm-up costs

Each Dedicated cluster loads libraries, compiles code, and builds cache independently. While this improves isolation, it makes Dedicated mode inefficient for exploratory or collaborative day‑to‑day work.

Our lesson: Dedicated mode is for production pipelines and specific use cases, not for general team development. Day-to-day exploration should use Standard or Serverless.

B. Standard Access Mode (formerly Shared) — Recommended Default

Standard mode is the Databricks-recommended default for 80%+ of workloads. Multiple users share compute, and Unity Catalog enforces table-level permissions transparently. Each user sees only data they are authorized to access.

How it works internally: Standard mode runs a process-level isolation layer within the Spark executors. Each user's code executes in a sandboxed environment where the JVM intercepts all data access calls and validates them against UC grants. This is why certain low-level operations (direct file system access, custom JVM libraries, arbitrary system calls) are restricted — they could bypass the isolation boundary.

-- Standard mode in action: same compute, different access

-- Marketing analyst:

SELECT * FROM tredence_test.gold_demo.campaign_results;  -- Works (has grant)

-- Same analyst tries finance data:

SELECT * FROM tredence_test.gold_demo.revenue_actuals;   -- PERMISSION_DENIED

-- Enforced by UC even on shared compute!

How we landed on Standard mode as our workhorse:

After the early experiment of using Dedicated mode for almost everything, we quickly ran into two problems: confusion and unexpected costs. While Dedicated clusters gave us flexibility, running one per team—or worse, per person—wasn’t sustainable. With more than 25 engineers and analysts to support, we needed a way to let everyone work together without watching costs spiral out of control. That’s when we shifted our focus to Standard (shared) mode. It wasn’t a magic fix, and it took some time to understand its boundaries, but it ultimately became the practical foundation for scaling our platform efficiently.

The turning point came when we realized the Bronze-to-Silver boundary was where Standard mode shines. Our Bronze tables were already UC-registered Delta tables (created by Dedicated mode ingestion pipelines). From that point forward, every transformation — joins, aggregations, window functions, data quality checks — was purely table-to-table. No filesystem access needed. No custom JVM libraries. Just SQL and PySpark operating on governed tables.

We moved our entire Silver and Gold layer processing to a single Standard cluster shared across:

  • 8 data engineers building Silver transformations
  • 6 analytics engineers building Gold aggregation tables
  • 11 business analysts running ad-hoc queries and building dashboards

All 25 users on one cluster, each seeing only the data their UC grants allowed. One analyst in marketing couldn't accidentally query the HR salary table. One engineer in finance couldn't see the marketing campaign data. Zero manual ACL management — all controlled through UC grants at the catalog/schema/table level.

# What our Silver layer pipeline looks like on Standard mode

# Pure table-to-table - no filesystem paths, no external locations

df_bronze = spark.read.table("tredence_test.bronze_demo.raw_internet_sales")

 

df_silver = (

    df_bronze

    .filter(col("order_date").isNotNull())

    .withColumn("profit_amount", col("sales_amount") - col("total_product_cost"))

    .withColumn("revenue_tier",

        when(col("sales_amount") > 1000, "High")

        .when(col("sales_amount") > 200, "Medium")

        .otherwise("Low")

    )

)

 

df_silver.write.mode("overwrite").saveAsTable(

    "tredence_test.silver_demo.fact_internet_sales_enriched"

)

# Works perfectly on Standard mode - governed table in, governed table out

 

Below are a few limitations that we faced and how we resolved:

No R or Scala support killed two workstreams initially:

The lack of R and Scala support in Standard mode caused real problems for us early on. Our statistical modeling team had built more than 40 R scripts over three years for forecasting and expected to run them on the shared cluster like before. Once we moved to Unity Catalog, none of them worked—R simply isn’t supported on Standard compute. This forced two weeks of urgent work to rewrite the most important models in Python using tools like statsmodels and prophet. A few lower‑priority R scripts are still waiting to be migrated. The lesson was clear: check your language dependencies before moving to UC. If your teams rely on R or Scala, either plan for Dedicated clusters or budget time upfront for rewriting the code.

Geospatial Libraries Installed but Failed at Runtime

Our team needed GDAL, GeoPandas, and Shapely for route optimization, but on Standard mode the installs looked successful and later failed during imports with confusing missing‑library errors. These libraries depend on system‑level components, which Standard compute does not allow—so the failure appeared only at runtime, not during installation.

Fix: We moved the team to a Dedicated cluster and used an init script to install the required system libraries before installing Python packages. If a Python package depends on system libraries, expect silent failures on Standard mode and plan Dedicated compute upfront.

File Listing Failed on Standard Mode (%fs ls, dbutils.fs.ls)

Our monitoring notebooks were designed to check landing zones every 15 minutes—verifying whether files arrived, how many were present, and their sizes—using %fs ls and dbutils.fs.ls("abfss://..."). After moving to Standard mode, all of these checks started failing with permission errors because shared compute does not allow direct access to external cloud storage paths.

Fix: Instead of rewriting the notebook logic, we restructured our landing zones as Unity Catalog Volumes and updated the file paths to /Volumes/.... Once we did that, the same file‑listing commands worked perfectly. The key lesson was simple: on Standard mode, operational checks must go through UC‑governed objects, not raw storage paths.

The end result : After fixing all issues it was a big win. Standard mode now runs about 80% of our daily workloads, supporting 25 users on just two shared clusters, and our monthly compute cost dropped by around 60% compared to the earlier Dedicated‑per‑team setup. The key insight was that Standard mode isn’t restrictive by itself—it works extremely well once data is inside Unity Catalog–governed tables. Most limitations only show up at the boundary between raw external storage and UC-managed data, not in day‑to‑day analytics.

Please refer the following link to read more about Standard computing and column masking options

https://learn.microsoft.com/en-us/azure/databricks/compute/standard-overview/

C. Serverless Compute — The Implicit Default

Serverless lets you run code without managing any clusters — you just write code and execute it.
Since 2025, it offers two execution modes. Performance‑optimized mode starts in seconds using warm resources, best for interactive workloads but more expensive. Standard mode takes about 4–6 minutes to start, is used for scheduled jobs, and can be up to 70% cheaper

How it works internally: Serverless compute runs on Databricks-managed infrastructure. You have zero control over node types, instance sizes, or worker counts. The platform decides everything based on workload characteristics. It observes your query patterns and pre-warms resources accordingly. The trade-off: you get instant startup and zero management, but lose the ability to tune performance for edge cases.

# Serverless - NO cluster configuration needed

df = spark.read.table("tredence_test.gold_demo.customer_sales_summary")

result = df.groupBy("customer_segment").agg(

    sum("total_revenue").alias("segment_revenue"),

    count("*").alias("customer_count")

)

display(result)

# Compute provisioned, scaled, and terminated automatically

 

How we ended-up adopting serverless for the clients ?

Serverless wasn't part of our original architecture plan. We had Dedicated for ingestion, Standard for transformation — the system was working. Then our finance director asked the question that changed everything: "Why are we paying for a cluster that sits idle 18 hours a day when analysts only query it during business hours?"

The math was brutal. Our Standard analytics cluster ran 24/7 (auto-termination kept getting disabled because analysts complained about 5-minute restarts during meetings). At Standard_E8s_v3 with 4 workers, that was ~$2,800/month in compute alone — even though actual query activity was concentrated into 6 hours per day. The cluster sat idle the other 18 hours.

We piloted Serverless for our 11 business analysts first. The experience was transformative:

  • Zero startup wait. Analysts open a notebook or dashboard, run a query, get results in seconds. No "waiting for cluster..." spinner. No Slack messages asking "is the cluster up?"
  • True pay-per-use. We went from $2,800/month (always-on cluster) to ~$900/month (actual query compute only). 68% cost reduction for the same workload.
  • No admin overhead. Nobody managing autoscaling configs, idle timeouts, or node type selection. The platform handles everything.

# What our analysts' daily workflow looks like now

# They don't even think about "compute" - they just write queries

# Morning revenue check (runs in <3 seconds on serverless)

df = spark.sql("""

    SELECT order_date, SUM(sales_amount) as daily_revenue,

    COUNT(DISTINCT CustomerKey) as unique_customers

    FROM tredence_test.gold_demo.customer_sales_summary

    WHERE order_date >= current_date() - INTERVAL 7 DAYS

    GROUP BY order_date

    ORDER BY order_date DESC """)

display(df)

# Ad-hoc deep dive (runs in ~12 seconds - platform auto-scales)

df_segment = spark.sql("""

             SELECT customer_segment, revenue_tier,

             AVG(total_revenue) as avg_revenue,

             PERCENTILE_APPROX(total_revenue, 0.5) as median_revenue

             FROM tredence_test.gold_demo.customer_sales_summary

             GROUP BY customer_segment, revenue_tier

             """)

display(df_segment)

 

After the analyst pilot succeeded, we extended Serverless to:

  • All SQL warehouse queries — dashboards, scheduled reports, Genie spaces
  • Lightweight ETL jobs — Gold layer aggregations under 50GB that don't need custom configs
  • CI/CD validation runs — unit tests and data quality checks on PR merge

 

Below are the limitations we faced in serverless:

Performance variability nearly cost us a client SLA:

We moved a key executive dashboard to Serverless to save costs, and it worked well at first.
But during peak hours, the same query suddenly slowed down from ~10 seconds to almost a minute, nearly breaking an SLA during a board meeting. Nothing changed in code or data—the slowdown came from shared infrastructure contention in Serverless at busy times.

Fix: We pre‑computed the metrics into a Gold table using a scheduled Lakeflow Job.
Now the dashboard reads a small, ready-made table and runs consistently in under 2 seconds, even during peak load.

# Instead of: dashboard queries a large table in real-time

# We now: pre-compute the aggregation on schedule

 

# Lakeflow Job (runs at 6 AM daily, serverless standard mode)

df_summary = spark.sql("""

    SELECT business_unit, report_date,

           SUM(revenue) as total_revenue, COUNT(*) as txn_count

    FROM tredence_test.silver_demo.fact_transactions

    WHERE report_date >= current_date() - INTERVAL 90 DAYS

    GROUP BY business_unit, report_date

""")

df_summary.write.mode("overwrite").saveAsTable(

    "tredence_test.gold_demo.exec_dashboard_summary"

)

 

# Dashboard reads the small pre-computed table (<2s, always)

# SELECT * FROM tredence_test.gold_demo.exec_dashboard_summary

Our ML team couldn't use Serverless at all:

Serverless simply didn’t work for our ML team—this wasn’t something we could tweak or optimize around. They needed GPUs, CUDA-enabled libraries like XGBoost, custom conda environments, and full MLflow tracking with ADLS artifacts.
Realised: Serverless supports none of these today, which made it a complete blocker for model training work. We learned to draw a clear line: Serverless is great for data analysis and light transformations.
Anything involving real ML training, custom libraries, or specialized hardware stays on Dedicated clusters

Spark configs silently ignored — a 2-day mystery:

One of our data engineers migrated a complex pipeline from Standard to Serverless (for the cost savings on scheduled jobs). The pipeline processed 4TB of daily transaction data with a high-cardinality GROUP BY. On Standard mode, it ran in 18 minutes with spark.sql.shuffle.partitions=4000. On Serverless, same code: 55 minutes. No error — it completed successfully, just 3x slower. He spent two days profiling before discovering that spark.conf.set("spark.sql.shuffle.partitions", "4000") was silently ignored on Serverless. The platform used its own adaptive determination (which chose 200 partitions — wildly insufficient for 4TB). 

Fix: That pipeline stayed on Standard mode. Our broader lesson: Any pipeline that depends on specific Spark tuning parameters should NOT be moved to Serverless. If your pipeline only works because of a spark.conf.set() call, Serverless will silently degrade it.

Network isolation broke our Oracle source connector:

We had a pipeline that connected to an on‑prem Oracle database through a private endpoint in our Azure VNet, and it worked perfectly on Dedicated and Standard clusters using VNet injection.
When we tried the same pipeline on Serverless, it failed with connection timeouts. The issue wasn’t the code—it was networking. Serverless runs in Databricks‑managed networks with no access to private endpoints, VPNs, or VNet peering, so on‑prem and firewall‑protected sources are unreachable.
Fix: Our solution was to keep Oracle ingestion on Dedicated clusters and use Serverless only for sources that are publicly accessible, like ADLS or Azure SQL with Azure AD auth.

Morning Job Delays Due to Cold Starts in Serverless Standard Mode

We scheduled 15 Gold-layer refresh jobs on Serverless (standard mode) to run at 6 AM, expecting everything to finish before business users logged in. In reality, each job needed a cold start of 4–6 minutes, and when all 15 started at once, several got stuck waiting for compute. As a result, dashboards weren’t fully updated until around 7:15 AM, leading to user complaints about stale data.
Fix: we staggered job start times to reduce the load spike and avoid queueing delays.
For the three most critical jobs, we switched to performance‑optimized Serverless for instant startup, while keeping the rest on standard mode to control costs.

Debugging failed queries was nearly impossible:

On Dedicated or Standard clusters, troubleshooting slow or failed queries is straightforward—you can dive into the Spark UI, inspect executor logs, and analyze task‑level metrics like spills and skew.
In Serverless, most of that visibility is missing. You only get high‑level query profiles and execution time, with no access to executor or stage‑level details.
We faced this when a customer lifetime value aggregation failed with out‑of‑memory errors in about 10% of runs. Without visibility, we couldn’t identify which stage or partition was causing the issue.
Workaround: Instead of data‑driven debugging, we had to rely on intuition, adding skew hints and salting join keys.
The fix worked, but the lack of observability made diagnosing the problem far more frustrating than on traditional clusters.

The End Result of Serverless :  It now handles 100% of our analyst-facing interactive workloads and ~40% of scheduled batch jobs (the lightweight ones). Monthly cost for those workloads dropped from ~$4,200 (always-on clusters) to ~$1,500 (pure usage-based). But we learned the hard way that Serverless is not a universal replacement — it's excellent for consumption-heavy, read-mostly, governance-compatible workloads. Anything requiring custom infrastructure, network access, performance guarantees, or deep debugging must stay on managed clusters.

 

Please find the below link to get best practices of serverless computing:

Serverless Computing | Best Practices | System tables for Monitoring

The Layered Compute Strategy

After encountering these challenges across multiple sprints, we adopted a layered compute strategy that aligned each pipeline stage with the appropriate access mode:

 

Design your pipelines so that:

  1. Ingestion (Dedicated): Handles the messy real-world of raw files, custom formats, encryption, and external paths
  2. Transformation (Standard): Works exclusively with UC-governed tables — no filesystem access needed
  3. Consumption (Serverless): Queries governed Gold tables with full audit trail and per-user access control

 

 

3. Unity Catalog Storage: Volumes vs. External Locations

Unity Catalog doesn’t just control access to tables—it governs files, storage paths, and how data is accessed end to end. This marks a big shift away from the old DBFS and mount‑point approach, where storage access was loosely managed.

We learned quickly that this change isn’t a simple path replacement exercise. Moving to UC‑governed storage forces you to rethink how data is organized, accessed, and secured across teams.
In fact, storage migration turned out to be the hardest and most time‑consuming part of our Unity Catalog adoption. Most of our challenges—and lessons learned—came from reworking storage access models, not from changing compute or permissions.

 

The Old Way: Mount Points (No Governance)

# 2023: Mount points - EVERY user gets access to ALL data

dbutils.fs.mount(

    source="wasbs://raw-data@storageaccount.blob.core.windows.net",

    mount_point="/mnt/raw-data",

    extra_configs={"fs.azure.account.key.storageaccount.blob.core.windows.net": dbutils.secrets.get("scope", "key")}

)

# Any user reads anything - no audit, no permission check

df = spark.read.csv("/mnt/raw-data/sensitive/payroll/salaries.csv")

Why this was a problem – Our Experience

Six months before migrating to Unity Catalog, we had a serious compliance scare that exposed major gaps in our storage governance. A junior analyst was able to browse a mounted data lake path and accidentally access sensitive HR payroll data, including executive salaries. While there was no malicious intent, the incident escalated quickly and reached security leadership the same day.

The root cause was our old mount‑point model: one flat namespace with no directory‑level restrictions, no file‑level auditability, and poorly understood credentials embedded long ago. We couldn’t prove what data was accessed, who else might have seen it, or even fully revoke access without removing the analyst from the workspace entirely. The compliance investigation took weeks, and we were forced to assume worst‑case exposure.

That incident became the turning point. Leadership mandated eliminating mount points in production and pushed us to adopt Unity Catalog to gain proper storage isolation, auditing, and access control

The New Way: Unity Catalog Volumes

Volumes provide a fully governed way to work with files under Unity Catalog. They appear as regular file systems, FUSE‑mounted at paths like /Volumes/<catalog>/<schema>/<volume>/, so existing code can read and write files naturally. The key difference is governance: access to Volumes is granted using SQL permissions, not hidden credentials or mount points. Every file operation—reads, writes, deletes—is audited and tied to a user or service principal. This gives teams the flexibility of filesystem access while enforcing strong security, visibility, and compliance through Unity Catalog.

How it works internally:

When you access a path like /Volumes/tredence_test/bronze_demo/landing_files/, Databricks doesn’t simply read from storage directly. Under the hood, it routes the request through a FUSE layer that is fully governed by Unity Catalog. Your identity is first authenticated, and then Unity Catalog checks whether you have explicit permissions on that specific volume.

If access is allowed, the storage request is executed using short‑lived credentials issued by Unity Catalog, not long‑lived keys. At the same time, every operation—whether it’s a spark.read, dbutils.fs.ls, or even a standard file open—is logged with the user identity, timestamp, and action.

This means unauthorized exploration is blocked immediately, and even failed attempts are recorded. If the same junior analyst tried accessing sensitive files today, they would see a PERMISSION_DENIED error, and the access attempt would be captured in the audit logs for full traceability.

If that junior analyst tried the same exploration today, they'd get a PERMISSION_DENIED error, and the attempt itself would be logged in the audit trail.

# 2026: Governed file access

df = spark.read.csv("/Volumes/tredence_test/bronze_demo/landing_files/sales_data.csv")

df.write.parquet("/Volumes/tredence_test/silver_demo/staging/enriched_sales/")

-- Grant granular access

CREATE VOLUME tredence_test.bronze_demo.landing_files;

GRANT READ VOLUME ON VOLUME tredence_test.bronze_demo.landing_files TO `data-engineering`;

GRANT WRITE VOLUME ON VOLUME tredence_test.bronze_demo.landing_files TO `ingestion-sp`;

How we migrated to Volumes — challenges we faced:

Our migration plan seemed straightforward: create Volumes for each landing zone, update paths in notebooks, test, deploy. We estimated 2 weeks. It took 6 weeks. Here's why:

Challenge 1: Legacy Python libraries couldn't resolve Volume paths:

We ran into an unexpected issue after moving file paths from old DBFS mounts to Unity Catalog Volumes. Our data engineering team had a custom XML parser that relied on an external system tool (xmlstarlet) and expected standard file paths like /dbfs/mnt/.... When we updated the paths to /Volumes/..., the parser suddenly stopped working—the external command couldn’t find the files.

The root cause was subtle: Volumes are exposed through a special filesystem layer that Databricks manages. That layer works fine for Spark and normal Python code, but external system binaries launched through subprocess don’t automatically see it. From their point of view, the files simply weren’t there.

Our workaround was straightforward. Before invoking the external tool, we copied the files from the Volume into a local temporary directory and ran the parser there. The key lesson for us was that Volumes work best with Spark‑native and Python‑level file access, while older tools that depend on system binaries may need small adjustments to keep working.

Challenge 2: Write performance was noticeably slower on Volumes

When we tested writing data to Unity Catalog Volumes, we noticed a clear performance difference. Writing about 500MB of Parquet data to a Volume was roughly 40% slower compared to writing directly to ADLS using abfss://. The reason wasnt inefficiencyit was governance. Every file write to a Volume goes through permission checks, credential vending, and audit logging, which adds overhead.

This overhead became very noticeable in our IoT pipeline, where we were generating tens of thousands of small files every hour. For that kind of high‑throughput, file‑heavy ingestion, Volumes simply weren’t the right fit. Our solution was to write directly to external tables backed by ADLS, which avoids the per‑file governance cost.

The takeaway for us was clear: Volumes are excellent for governed access, exploration, and moderate file landing use cases. But for large‑scale, high‑frequency ingestion pipelines, external tables offer better performance and scalability.

Challenge 3: The 7-day deletion retention on managed Volumes surprised us:

We had a daily cleanup job that removed files from the landing zone once ingestion finished. In the old mount‑point setup, deleting files immediately freed up storage, so costs stayed predictable. After moving to managed Volumes, we assumed the behavior would be the same—but it wasn’t.

Managed Volumes keep deleted files for a fixed 7‑day retention period to protect against accidental deletions. Even though the files looked “deleted,” they were still consuming storage in the background. Within two weeks, this caused our storage costs to jump by about 40%. There was no way to shorten or disable this retention.

Our solution was to change how we used Volumes. We moved high‑churn landing zones to External Volumes, where deletions are immediate and fully under our control. Managed Volumes are now reserved for staging or shared areas where short‑term retention isn’t an issue. The key lesson: managed safety features are powerful, but you need to align them carefully with data lifecycle and cost expectations.

Challenge 4: Volume path limit of 1024 characters broke nested folder structures:

We ran into a subtle but serious issue with file paths when our source system delivered data using very deep folder hierarchies. The directory structure kept nesting metadata like region, date, and batch ID into folders, and some paths became extremely long. In a few cases, the path length crossed system limits, and files simply failed to appear—without any errors or warnings during the write.

This made the problem hard to detect and risky in production. After investigating, we realized the path depth itself was the culprit. Our fix was to simplify the structure by flattening the folders and moving most of the metadata into the filename instead. Once we did that, file writes became reliable again.

The lesson for us was simple: while Volumes support normal filesystem paths, overly deep directory structures can silently break writes. Keeping folder hierarchies shallow and encoding context into filenames is safer and easier to operate at scale.

Please refer this link to know more about catalog volume : Catalog Volume

External Locations: Governance Layer for Cloud Storage

External locations are how Unity Catalog connects to cloud storage that you own and manage. They register a specific storage path with Unity Catalog so that access to the data can be controlled, audited, and secured using UC permissions. Any time you want Unity Catalog to govern data that lives outside Databricks‑managed storage—whether you’re creating external tables or external volumes—you must first define an external location. In practice, external locations are the foundation that allows strong governance without giving up control of your underlying storage

The Governance Chain

The relationship between these UC objects forms a governance chain:

Storage Credential (Azure Managed Identity / Service Principal)

    └── External Location (registered path in cloud storage)

            ── External Table (structured data at sub-path)

            └── External Volume (file access at sub-path)

Each level adds governance. The Storage Credential holds the authentication secret. The External Location binds a path to that credential. Tables and Volumes inherit governance from their parent external location. This layered architecture means:

  • Rotating a storage key = update one Storage Credential, all dependent objects continue working
  • Restricting a team = revoke their grant on the External Location, all child tables/volumes become inaccessible
  • Auditing access = UC traces the chain from table → external location → credential for every access

Setting up external locations and its failures:

By the time we started moving to Unity Catalog, our Azure Data Lake had already been running for about three years. During that time, the folder structure grew naturally as new teams, projects, and deadlines came along—there was no single plan behind it. It worked well enough in the early days, but once we began introducing Unity Catalog and its external location boundaries, the cracks started to show. What we thought would be a straightforward migration quickly turned into a deeper exercise in understanding, cleaning up, and sometimes rethinking how our storage was organized. That’s when the real migration story began.

Week 1: The naive attempt :  Our platform admin tried to register each team's folder as a separate external location:

It turned out a colleague had registered abfss://datalake@tredencewestus2.dfs.core.windows.net/ (the root container) as an external location two days earlier for testing. That single root registration blocked all child paths from being registered as separate locations. We didn't discover this for 3 days because the error message didn't clearly state which existing location was causing the overlap.

Week 2: The cleanup disaster:

In week two of the migration, we tried to clean things up by dropping the root external location so teams could register their own, more targeted locations. What we didn’t expect was the immediate fallout. Four production external tables broke instantly, and pipelines began failing with EXTERNAL_LOCATION_NOT_FOUND errors.

The problem wasn’t obvious at first because the tables themselves were still there. What we broke was the access chain. External tables rely on the external location that covers their storage path for credentials and permissions. Once we dropped that external location, the tables became unreadable even though no one had explicitly deleted them. That incident was a hard lesson: external locations are not just configuration objects—they are critical dependencies, and removing them can silently break production workloads.

-- What we SHOULDN'T have done

DROP EXTERNAL LOCATION datalake_root;

-- Result: 4 external tables immediately became inaccessible

-- Error: "Cannot access storage path: no valid external location found"

 

-- Emergency rollback (re-create immediately)

CREATE EXTERNAL LOCATION datalake_root

  URL 'abfss://datalake@tredencewestus2.dfs.core.windows.net/'

  WITH (STORAGE CREDENTIAL tredence_azure_credential);

-- Tables came back online. Crisis averted after 20 minutes of downtime.

Our Takeaway:

We learned the hard way that external locations are not just setup objects—you can’t safely remove them without understanding what depends on them. External tables and volumes rely entirely on the external location that covers their storage path for credentials and access. Dropping an external location doesn’t delete those objects, but it instantly makes them unusable by breaking the access chain.

Now, before deleting or modifying any external location, we always audit dependencies first. This simple check has saved us from multiple production outages:

-- Find external tables that may depend on an external location

SELECT

  table_catalog,

  table_schema,

  table_name,

  table_type,

  data_source_format

FROM system.information_schema.tables

WHERE table_type = 'EXTERNAL';

We then cross‑check each table’s storage path against the external location we plan to change. The rule is simple: no dependency identified, no deletion performed. This discipline became a core part of our Unity Catalog operating practices

 

The architectural decision we made: 

After 2 weeks of attempting workarounds, we accepted that our storage layout was incompatible with granular external locations. We chose the pragmatic path:

  1. Register ONE external location at the container root: abfss://datalake@tredencewestus2.dfs.core.windows.net/
  2. Create external tables at specific sub-paths, granting access per-table
  3. Create external Volumes for file-level access at specific sub-paths, granting access per-Volume
  4. Accept that the external location itself is a broad authentication bridge, and UC tables/volumes provide the authorization granularity

 

 

-- Our final architecture:

-- One broad external location (authentication bridge)

CREATE EXTERNAL LOCATION tredence_datalake

  URL 'abfss://datalake@tredencewestus2.dfs.core.windows.net/'

  WITH (STORAGE CREDENTIAL tredence_azure_credential);

 

-- Fine-grained access via tables (authorization layer)

CREATE TABLE tredence_test.bronze_demo.finance_transactions

  LOCATION 'abfss://datalake@tredencewestus2.dfs.core.windows.net/data/raw/finance/transactions/';

GRANT SELECT ON TABLE tredence_test.bronze_demo.finance_transactions TO `finance-team`;

 

CREATE TABLE tredence_test.bronze_demo.marketing_campaigns 

  LOCATION 'abfss://datalake@tredencewestus2.dfs.core.windows.net/data/raw/marketing/campaigns/';

GRANT SELECT ON TABLE tredence_test.bronze_demo.marketing_campaigns TO `marketing-team`;

-- Finance team CANNOT see marketing data even though both tables

-- are under the same external location. UC table grants enforce isolation.

 

The final outcome of storage migration:

6 weeks instead of the planned 2, one compliance scare (the catalyst), one 20-minute production outage (dropped external location), one 4-hour outage (credential rotation), and a completely restructured mental model about how file access works. But today: every file access is audited, every path is governed, and when auditors ask "who accessed this data?" we can answer in 30 seconds via the UC audit log. The pain was worth it.

4. From Z-Order to Liquid Clustering

Traditionally, teams used scheduled OPTIMIZE … ZORDER BY jobs to manually reorganize data and keep query performance predictable. This approach worked, but it required constant tuning, regular job runs, and a good understanding of access patterns—which added operational overhead and cost. Liquid Clustering replaces this legacy model with a self‑managing approach: instead of explicitly re‑ordering data on a schedule, the platform continuously and automatically adapts data layout based on how the table is queried. The shift moves performance optimization from manual maintenance to built‑in intelligence, reducing the need for repetitive OPTIMIZE jobs while delivering more consistent performance with less administrative effort.

The Old Way: Static Partitioning + Z-Order

-- 2023: Fixed partitions + manual Z-Order scheduling

CREATE TABLE sales.fact_orders (

    order_id BIGINT, customer_id INT, order_date DATE, amount DECIMAL(10,2)

) USING DELTA

PARTITIONED BY (order_year INT);

 

-- Had to run on schedule - forget it and performance degrades silently

OPTIMIZE sales.fact_orders ZORDER BY (customer_id, order_date);

Problems: Choosing partition columns = predicting future queries. Changing them = full table rewrite. Z-Order only helped new files. Entire process was manual and error-prone.

 

The New Way: Liquid Clustering

Liquid Clustering introduces a more flexible and modern approach to data layout by replacing both static partitioning and manual Z‑Ordering with a single, adaptive mechanism. Instead of locking in partition choices upfront, you define clustering keys that the platform uses to continuously optimize how data is organized based on access patterns. The key advantage is flexibility—clustering keys can be added or changed at any time without rewriting the entire table. This removes the need for periodic OPTIMIZE jobs and reduces maintenance overhead, while still delivering consistent query performance as workloads evolve.

-- 2026: Liquid Clustering (no partitioning needed)

CREATE TABLE tredence_test.silver_demo.fact_internet_sales_enriched (

    business_key STRING,

    CustomerKey INT,

    order_date DATE,

    profit_amount DECIMAL(38,2),

    revenue_tier STRING

) USING DELTA

CLUSTER BY (CustomerKey, order_date);

 

-- Query patterns changed? Just alter - no table rewrite!

ALTER TABLE tredence_test.silver_demo.fact_internet_sales_enriched

CLUSTER BY (order_date, revenue_tier);

 

-- Or let Databricks choose keys automatically (DBR 15.4+)

ALTER TABLE tredence_test.silver_demo.fact_internet_sales_enriched

CLUSTER BY AUTO;

Critical Limitations of Liquid Clustering:

Liquid Clustering comes with some important constraints that teams need to plan for.

First, it can’t be enabled on already partitioned tables—you must migrate the data using a CREATE TABLE AS SELECT or a DEEP CLONE, which can be time‑consuming for large datasets.

Second, enabling Liquid Clustering requires a Delta protocol upgrade, which may break compatibility with older readers like Trino, Presto, or open‑source Spark versions

Lastly, existing historical data is not automatically reclustered; only new data follows the new layout unless you explicitly run a full OPTIMIZE to reorganize past data. These limitations don’t negate the benefits, but they do require careful migration planning.

 

-- Migration path (cannot do ALTER on existing partitioned table)

CREATE TABLE new_table CLUSTER BY (customer_id) AS SELECT * FROM existing_partitioned_table;

-- Recluster historical data (expensive for large tables)

OPTIMIZE tredence_test.silver_demo.fact_internet_sales_enriched FULL;

Our Approach at Tredence:

Like the pattern described in Databricks Cluster Types Explained:

  1. Data ingestion team — Dedicated (single user) cluster per member with small compute for Bronze table creation from external locations
  2. Orchestrated production jobs — Dedicated cluster with service principal for Azure Data Factory / Lakeflow Jobs pipelines
  3. Silver/Gold processing — Standard (shared) mode once data is in UC-governed tables
  4. Interactive analytics — Serverless compute for all ad-hoc work

Please find the best practices of Unity Catalog: Unity Catalog Best Practices

Final Summary:

Databricks in 2026 is no longer a "Spark platform with clusters." It is a governed data execution environment where:

Compute is abstracted
Users no longer think in terms of clusters or nodes. They choose intent (interactive, batch, scheduled), and Databricks automatically provisions the right compute behind the scenes.

Storage is regulated
Data access is no longer based on open mounts or hidden credentials. Unity Catalog governs tables, files, paths, and volumes with fine‑grained permissions and full auditability.

Optimization is predictive
Performance tuning has shifted from manual actions like partitioning and Z‑Order jobs to built‑in, adaptive mechanisms that respond to real usage patterns over time.

Flexibility is intentional, not implicit
Different workloads deliberately run on different execution modes (Dedicated, Standard, Serverless), each chosen for clear reasons around cost, performance, and governance—not accidental configuration drift.

Together, this makes Databricks less of a “Spark platform” and more of a governed data execution environment designed for scale, security, and operational simplicity

Teams that fight the platform — retaining legacy cluster-level control, mount-point access, and manual optimization — will swim against an increasingly strong current. Teams that design with Unity Catalog will find Databricks simpler, safer, and more scalable than ever.

The trade-off is real: you exchange fine-grained control for consistency, manual optimization for platform intelligence, and configuration flexibility for governance correctness. For 95% of enterprise workloads, this is an excellent trade. For the remaining 5% (distributed ML training, custom query engines), Dedicated mode remains the escape hatch.

 


Topics

Databricks Unity Catalog Data Governance Data Migration
LinkedIn X/Twitter Facebook
×

Start a Conversation

Our team will get back to you shortly.