Databricks Lakeflow Salesforce Connector: Automate Data Ingestion at Scale

If you've ever tried to bring Salesforce data into a modern data platform, you know the challenges: complex authentication, schema drift, table naming constraints, and the constant need to maintain parity between source and target systems. Traditional ETL approaches require custom code, constant maintenance, and struggle to keep up with changes in the source system.

That's exactly the problem we set out to solve using Databricks' native Salesforce ingestion pipeline with an API-driven approach.

The Problem: Hidden Complexity in Salesforce Data Integration

At a high level, the problem sounds simple, but implementing it is a different story:

Connect to Salesforce environments securely
Ingest multiple tables and views reliably
Handle schema changes automatically
Maintain latest replica of source data
Deliver governed data in Unity Catalog

Organizations rely on Salesforce as their source of truth for sales, marketing, and customer data. But getting that data into a Data Lakeflow for analytics and AI is surprisingly difficult. Data teams face challenges with:

Challenge	Impact
Manual connection setup	Days lost to configuration and troubleshooting
Schema drift in source tables	Broken pipelines and missing data
Table naming restrictions	Cannot rename tables for dimensional modeling
Complex authentication	Security risks and access management overhead
No automatic schema evolution	Pipeline failures when source adds columns
Inconsistent data freshness	Stale data for downstream analytics

We wanted a way to bridge that gap, automating Salesforce ingestion with zero custom code for schema management, while maintaining full governance and control over the destination tables.

Phase 1: Native Salesforce Connector with UI-Based Pipeline

Databricks provides a built-in ingestion pipeline feature with connectors for multiple sources including Salesforce, SQL Server, Google, and others. The UI-based approach is the fastest way to get started.

Setting Up the Connection

First, you need a connection to your Salesforce environment. The database admin team creates this connection using:

Login credentials
Server details
Authentication URL
Client ID

Note: Connection details are sensitive and should be managed securely with appropriate access controls.

Creating the Pipeline

Once the connection is ready, you create a pipeline:

Select Connection: Choose the Salesforce connection from available options in your workspace
Configure Target: Specify the target catalog in Unity Catalog (must be a managed table, not external storage)
Define Source: Select tables or views from Salesforce that your connection has access to

The three-namespace structure in Databricks:

Catalog → Schema → Table

Selecting Columns and Handling Schema Evolution

A critical decision point is column selection:

Selection Type	Behavior	Best For
Select All Columns	Automatically adds new source columns to target; no pipeline modification needed	Production pipelines where source schema changes frequently
Select Specific Columns	New source columns require pipeline modification; manual intervention needed	Stable sources with well-defined, unchanging schemas

Key Insight: If you select all columns, schema merge happens automatically. When a new column is added in Salesforce, it will be automatically populated to your target without any code changes or pipeline modifications.

Scheduling and Orchestration

The UI allows you to set up schedules:

Simple daily runs
Specific time triggers
Notification configurations for pipeline status

Critical Limitation: Table Name Constraints

The UI limitation: Whatever table name exists in Salesforce becomes the table name in your destination. You cannot rename tables to:

Segregate facts and dimensions
Apply consistent naming conventions
Create business-friendly table names

Example: A Salesforce table named "Account" cannot be renamed to "dim_account" in the target.

Phase 2: Streaming Tables and Change Data Capture

What Gets Created?

The pipeline creates a managed streaming table, not a standard managed Delta table.

Why streaming table?

You must specify key columns in the dataset
Based on these key columns, the pipeline detects changes
It always brings the latest record, with no history maintenance

How Change Data Capture Works

The pipeline implements CDC Type 1 (overwrite):

Scenario	Behavior
Record Update	Existing record in target (e.g., ID=1) is updated with latest source values
Record Delete	The record is removed from the target if deleted in the source
New Record	Record is inserted into target

Important: This maintains only the latest replica of source data. It does not maintain historical changes (CDC Type 2). For audit trails or slowly changing dimensions, additional logic would be needed.

Data Flow

Salesforce Source → Ingestion Pipeline (Serverless) → Managed Streaming Table (Unity Catalog)

Critical technical constraint: All Salesforce ingestion pipelines run on serverless compute only. You cannot use job clusters for this connector.

Phase 3: API-Driven Pipeline Creation for Advanced Control

The UI approach works for basic use cases, but when you need advanced capabilities like renaming tables, adding filters, and custom configurations, you need the Databricks API approach.

Why API-Driven?

Capability	UI Mode	API Mode
Rename destination tables	❌ Not supported	✅ Supported
Add custom filters	❌ Limited	✅ Full support
Modify table names for dimensional modeling	❌ Not possible	✅ Fully configurable
Custom notifications	✅ Basic	✅ Advanced
JSON-based pipeline definition	❌ Not available	✅ Native

The API Implementation Pattern

We implemented the API approach using a notebook with these components:

1. Pipeline Definition as JSON

The pipeline is defined entirely as a JSON payload:

json

{

"name": "salesforce_ingestion_pipeline",

"serverless": true,

"ingestion_definition": {

"connection_name": "salesforce_connection",

"source": {

"schema": "Salesforce_Schema",

"object": "Account"

"destination": {

"catalog": "ingestion_platform",

"schema": "merchandising",

"table_name": "dim_account" // Renamed from source!

}

"notifications": [...],

"schedule": {...}

}

2. Key Configuration Elements

Element	Description
serverless: true	Mandatory for Salesforce ingestion
connection_name	Reference to pre-configured Salesforce connection
source.object	Salesforce table or view to ingest
destination.table_name	Custom name in target (API-only feature)
notifications	Alert configurations for pipeline status

3. API Deployment

Once the JSON payload is ready, we post it to the Databricks API:

python

# POST request to Databricks API

# Headers include authentication tokens

# Body contains pipeline definition JSON

# Endpoint: /api/2.0/pipelines

response = requests.post(

f"{databricks_workspace_url}/api/2.0/pipelines",

headers=authentication_headers,

json=pipeline_definition

)

The API authenticates using appropriate tokens and creates the pipeline based on the definition. This approach is fully programmable and can be integrated into CI/CD pipelines.

Integration with External Schedulers

Rather than using Databricks' built-in scheduler, we integrated with Control-M (or any enterprise scheduler) to:

Maintain consistent orchestration across all data pipelines

Leverage existing monitoring and alerting infrastructure

Provide unified observability across ingestion sources

Where to Use — Ideal Use Cases

Salesforce Data Lakeflow Foundation

Ingest all Salesforce objects (Account, Contact, Opportunity, Lead, Case) into Unity Catalog as the foundation for enterprise analytics and AI.

Real-Time Customer 360

Maintain latest replica of customer data from Salesforce for real-time personalization, segmentation, and customer analytics.

Sales Performance Analytics

Ingest opportunity and pipeline data for sales forecasting, territory analysis, and quota tracking with governed data in Delta Lake.

Marketing Attribution

Combine Salesforce campaign data with other sources for multi-touch attribution and ROI analysis.

Compliance and Auditing

Create governed, versioned copies of Salesforce data for regulatory compliance, audit trails, and historical analysis.

AI-Powered Sales Intelligence

Feed fresh Salesforce data into AI models for lead scoring, churn prediction, and next-best-action recommendations.

Where NOT to Use — Limitations & Boundaries

CDC Type 1 Only

This pipeline maintains only the latest replica, with no history. For slowly changing dimensions (SCD Type 2) or audit history, you need additional change data capture and history tables.

Serverless Only Constraint

Salesforce ingestion pipelines run exclusively on serverless compute. You cannot use job clusters, which may impact cost modeling and compute customization.

No UI Table Renaming

The UI does not support renaming tables. This is a significant limitation for teams with dimensional modeling standards. API is the only workaround.

Schema Evolution with Selective Columns

If you select specific columns (not all), new source columns require manual pipeline modification. Automatic schema merge works only with "select all columns."

Delete Propagation

Deletes in the source propagate to the target. This may be undesirable for compliance or audit use cases where you need to retain records even after source deletion.

Latency Considerations

Batch-based ingestion with scheduling (not real-time streaming). For sub-second latency requirements, consider Salesforce Streaming API instead.

Aspect	Suitable ✅	Not Suitable ❌
Ingestion Pattern	Batch, scheduled	Real-time streaming (< 1 sec)
Change Capture	CDC Type 1 (latest replica)	CDC Type 2 (history tracking)
Table Naming	API mode supports renaming	UI mode restricts naming
Compute	Serverless only	Job clusters
Schema Evolution	Auto with "select all"	Manual with column selection
Deletion Handling	Propagates to target	Retain deleted records

Results & Implementation Insights

What We Built

API-based pipeline creation for Salesforce ingestion with full control over table naming
Serverless streaming tables maintaining latest replica of source data
Integration with Control-M for enterprise scheduling and monitoring
Unity Catalog governance with managed tables and proper access controls

Key Capabilities Delivered

Capability	Implementation
Table Renaming	API mode with custom destination table names
Automatic Schema Evolution	"Select all columns" approach
Schedule Management	External scheduler (Control-M) integration
Authentication	Secure connection managed by DB admin team
Data Freshness	Configurable based on schedule

Pipeline Characteristics

Compute: Serverless (mandatory for Salesforce connector)
Storage: Managed streaming tables in Unity Catalog

Change Capture: CDC Type 1 (latest replica only)

Scale: Supports multiple tables per single pipeline

What We Learned: Challenges and Solutions

Issue	Root Cause	Fix
No destination table renaming	UI limitation in native connector	Switched to API-based pipeline creation with JSON payload
No automatic schema merge with selective columns	Design choice in connector	Used "select all columns" approach for automatic evolution
Serverless compute confusion	Unclear documentation	Documented as mandatory for Salesforce ingestion
Delete propagation concerns	CDC Type 1 behavior	Added downstream history tables for audit requirements
Scheduling flexibility limitations	Built-in scheduler constraints	Integrated with Control-M enterprise scheduler
Authentication complexity	Multiple authentication methods	Centralized connection management by DB admin team

Why This Pattern Matters for Any Organization

By combining Databricks' native Salesforce connector with API-driven pipeline creation, we've built a scalable, governed integration pattern that goes beyond simply moving data:

From Siloed CRM to Unified Lakeflow

Salesforce data becomes part of the governed data estate, joining other enterprise data sources for comprehensive analytics.

From Manual Schema Management to Automated Evolution

Schema changes in Salesforce no longer break pipelines. Automatic evolution handles new columns seamlessly.

From Rigid Naming to Flexible Modeling

API mode unlocks the ability to rename tables for dimensional modeling, fact/dimension segregation, and consistent naming conventions.

From Point-to-Point Integration to Platform Capability

Salesforce ingestion becomes a reusable service that multiple teams can leverage, with proper governance and observability.

Key Takeaways & Best Practices

Architecture Decisions

Always use "select all columns" for production pipelines to benefit from automatic schema evolution
Use API mode when you need table renaming - UI mode cannot satisfy dimensional modeling requirements
Plan for CDC Type 1 behaviour - add history tables separately if audit trails are required
Integrate with enterprise schedulers for consistent orchestration across all data pipelines
Centralize connection management - let DB admin team handle credentials and access

When to Choose UI vs. API

Choose UI Mode When...	Choose API Mode When...
Source table names are acceptable as-is	You need to rename tables for modeling
Quick prototyping and testing	Production pipelines with standards
Simple single-table ingestion	Complex multi-table with custom filters
Limited engineering resources	Full CI/CD integration required
No dimensional modeling requirements	Fact/dimension segregation needed

Production Recommendations

Validate CDC behavior: ensure delete propagation aligns with your use case
Monitor pipeline performance - serverless compute has different characteristics than job clusters
Implement data quality checks on target tables after each ingestion run
Document table naming conventions for API-created pipelines
Test schema evolution with "select all columns" before production deployment
Set up notifications for pipeline failures and schema changes

Conclusion

The Lakeflow - Salesforce Connector pattern demonstrates that Databricks' native ingestion capabilities combined with API-driven pipeline creation provide a powerful path from CRM data to governed Lakeflow tables. What traditionally required custom Python scripts, complex change data capture logic, and constant schema management can now be accomplished with configuration-driven pipelines.

The key innovation is the combination of automatic schema evolution (with "select all columns"), serverless execution, and the flexibility of API-based deployment for advanced use cases like table renaming. This pattern is directly applicable to any organization needing to bring Salesforce data into Databricks for analytics and AI.

However, this approach has clear boundaries: CDC Type 1 only (latest replica), serverless compute only, and UI limitations around table naming. Understanding these limitations is critical for setting the right expectations and designing appropriate downstream patterns for history tracking and audit compliance.

The API-based workaround for table renaming using JSON payloads and Databricks API represents a reusable pattern for extending native connectors beyond UI limitations. This same pattern can be applied to other connectors (SQL Server, Google, etc.) where similar constraints exist.

The complete solution: connection management, pipeline definition, API deployment, and scheduler integration is production-ready and can be adapted for any Salesforce data ingestion use case.

It all starts with one powerful idea:

"Let the platform handle the ingestion—and let the Lakeflow handle the governance."

On This Page

Lakeflow Salesforce Connector: Automating Enterprise Data Ingestion at Scale