On This Page

If you've ever tried to bring Salesforce data into a modern data platform, you know the challenges: complex authentication, schema drift, table naming constraints, and the constant need to maintain parity between source and target systems. Traditional ETL approaches require custom code, constant maintenance, and struggle to keep up with changes in the source system. 

That's exactly the problem we set out to solve using Databricks' native Salesforce ingestion pipeline with an API-driven approach. 

The Problem: Hidden Complexity in Salesforce Data Integration 

At a high level, the problem sounds simplebut implementing it is a different story: 

  • Connect to Salesforce environments securely
  • Ingest multiple tables and views reliably
  • Handle schema changes automatically
  • Maintain latest replica of source data
  • Deliver governed data in Unity Catalog 

Organizations rely on Salesforce as their source of truth for sales, marketing, and customer data. But getting that data into a Data Lakeflow for analytics and AI is surprisingly difficult. Data teams face challenges with: 

Challenge 

Impact 

Manual connection setup 

Days lost to configuration and troubleshooting 

Schema drift in source tables 

Broken pipelines and missing data 

Table naming restrictions 

Cannot rename tables for dimensional modeling 

Complex authentication 

Security risks and access management overhead 

No automatic schema evolution 

Pipeline failures when source adds columns 

Inconsistent data freshness 

Stale data for downstream analytics 

We wanted a way to bridge that gapautomating Salesforce ingestion with zero custom code for schema managementwhile maintaining full governance and control over the destination tables. 

ShapePhase 1: Native Salesforce Connector with UI-Based Pipeline 

Databricks provides a built-in ingestion pipeline feature with connectors for multiple sources including Salesforce, SQL Server, Google, and others. The UI-based approach is the fastest way to get started. 

 

 

Setting Up the Connection 

First, you need a connection to your Salesforce environment. The database admin team creates this connection using: 

  • Login credentials
  • Server details
  • Authentication URL
  • Client ID 

Note: Connection details are sensitive and should be managed securely with appropriate access controls. 

 

Creating the Pipeline 

Once the connection is ready, you create a pipeline: 

  1. Select Connection: Choose the Salesforce connection from available options in your workspace
  2. Configure Target: Specify the target catalog in Unity Catalog (must be a managed table, not external storage)
  3. Define Source: Select tables or views from Salesforce that your connection has access to 

The three-namespace structure in Databricks: 

Catalog → Schema → Table 

 

Selecting Columns and Handling Schema Evolution 

A critical decision point is column selection: 

Selection Type 

Behavior 

Best For 

Select All Columns 

Automatically adds new source columns to target; no pipeline modification needed 

Production pipelines where source 

 schema changes frequently 

Select Specific Columns 

New source columns require pipeline modification; manual intervention needed 

Stable sources with well-defined,  

unchanging schemas 

Key Insight:If you select all columns, schema merge happens automatically. When a new column is added in Salesforce, it will be automatically populated to your target without any code changes or pipeline modifications. 

Scheduling and Orchestration 

The UI allows you to set up schedules: 

  • Simple daily runs
  • Specific time triggers
  • Notification configurations for pipeline status 

Critical Limitation: Table Name Constraints 

The UI limitation: Whatever table name exists in Salesforce becomes the table name in your destination. You cannot rename tables to: 

  • Segregate facts and dimensions
  • Apply consistent naming conventions
  • Create business-friendly table names 

Example: A Salesforce table named "Account" cannot be renamed to "dim_account" in the target. 

Phase 2: Streaming Tables and Change Data Capture 

What Gets Created? 

The pipeline creates a managed streaming table, not a standard managed Delta table. 

Why streaming table? 

  • You must specify key columns in the dataset
  • Based on these key columns, the pipeline detects changes
  • It always brings the latest record, with no history maintenance  

How Change Data Capture Works 

The pipeline implements CDC Type 1 (overwrite): 

Scenario 

Behavior 

Record Update 

Existing record in target (e.g., ID=1) is updated with latest source values 

Record Delete 

The record is removed from the target if deleted in the source 

New Record 

Record is inserted into target 

Important: This maintains only the latest replica of source data. It does not maintain historical changes (CDC Type 2). For audit trails or slowly changing dimensions, additional logic would be needed. 

Data Flow 

Salesforce Source → Ingestion Pipeline (Serverless) → Managed Streaming Table (Unity Catalog) 

 

Critical technical constraint: All Salesforce ingestion pipelines run on serverless compute only. You cannot use job clusters for this connector. 

Phase 3: API-Driven Pipeline Creation for Advanced Control 

The UI approach works for basic use cases, but when you need advanced capabilities like renaming tables, adding filters, and custom configurations, you need the Databricks API approach. 

Why API-Driven? 

Capability 

UI Mode 

API Mode 

Rename destination tables 

 Not supported 

 Supported 

Add custom filters 

 Limited 

 Full support 

Modify table names for dimensional modeling 

 Not possible 

 Fully configurable 

Custom notifications 

 Basic 

 Advanced 

JSON-based pipeline definition 

 Not available 

 Native 

The API Implementation Pattern 

We implemented the API approach using a notebook with these components: 

1. Pipeline Definition as JSON 

The pipeline is defined entirely as a JSON payload: 

json 

{ 

  "name": "salesforce_ingestion_pipeline", 

  "serverless": true, 

  "ingestion_definition": { 

    "connection_name": "salesforce_connection", 

    "source": { 

      "schema": "Salesforce_Schema", 

      "object": "Account" 

    }, 

    "destination": { 

      "catalog": "ingestion_platform", 

      "schema": "merchandising", 

      "table_name": "dim_account"  // Renamed from source! 

    } 

  }, 

  "notifications": [...], 

  "schedule": {...} 

} 

2. Key Configuration Elements 

Element 

Description 

serverless: true 

Mandatory for Salesforce ingestion 

connection_name 

Reference to pre-configured Salesforce connection 

source.object 

Salesforce table or view to ingest 

destination.table_name 

Custom name in target (API-only feature) 

notifications 

Alert configurations for pipeline status 

3. API Deployment 

Once the JSON payload is ready, we post it to the Databricks API: 

python 

# POST request to Databricks API 

# Headers include authentication tokens 

# Body contains pipeline definition JSON 

# Endpoint: /api/2.0/pipelines 

 

response = requests.post( 

    f"{databricks_workspace_url}/api/2.0/pipelines", 

    headers=authentication_headers, 

    json=pipeline_definition 

) 

The API authenticates using appropriate tokens and creates the pipeline based on the definition. This approach is fully programmable and can be integrated into CI/CD pipelines. 

Integration with External Schedulers 

Rather than using Databricks' built-in scheduler, we integrated with Control-M (or any enterprise scheduler) to: 

  • Maintain consistent orchestration across all data pipelines 
  • Leverage existing monitoring and alerting infrastructure 

  • Provide unified observability across ingestion sources 

Where to Use — Ideal Use Cases 

Salesforce Data Lakeflow Foundation 

Ingest all Salesforce objects (Account, Contact, Opportunity, Lead, Case) into Unity Catalog as the foundation for enterprise analytics and AI. 

Real-Time Customer 360 

Maintain latest replica of customer data from Salesforce for real-time personalization, segmentation, and customer analytics. 

Sales Performance Analytics 

Ingest opportunity and pipeline data for sales forecasting, territory analysis, and quota tracking with governed data in Delta Lake. 

Marketing Attribution 

Combine Salesforce campaign data with other sources for multi-touch attribution and ROI analysis. 

Compliance and Auditing 

Create governed, versioned copies of Salesforce data for regulatory compliance, audit trails, and historical analysis. 

AI-Powered Sales Intelligence 

Feed fresh Salesforce data into AI models for lead scoring, churn prediction, and next-best-action recommendations. 

Where NOT to Use — Limitations & Boundaries 

CDC Type 1 Only 

This pipeline maintains only the latest replica, with no history. For slowly changing dimensions (SCD Type 2) or audit history, you need additional change data capture and history tables. 

Serverless Only Constraint 

Salesforce ingestion pipelines run exclusively on serverless compute. You cannot use job clusters, which may impact cost modeling and compute customization. 

No UI Table Renaming 

The UI does not support renaming tables. This is a significant limitation for teams with dimensional modeling standards. API is the only workaround. 

Schema Evolution with Selective Columns 

If you select specific columns (not all), new source columns require manual pipeline modification. Automatic schema merge works only with "select all columns." 

Delete Propagation 

Deletes in the source propagate to the target. This may be undesirable for compliance or audit use cases where you need to retain records even after source deletion. 

Latency Considerations 

Batch-based ingestion with scheduling (not real-time streaming). For sub-second latency requirements, consider Salesforce Streaming API instead. 

Aspect 

Suitable  

Not Suitable  

Ingestion Pattern 

Batch, scheduled 

Real-time streaming (< 1 sec) 

Change Capture 

CDC Type 1 (latest replica) 

CDC Type 2 (history tracking) 

Table Naming 

API mode supports renaming 

UI mode restricts naming 

Compute 

Serverless only 

Job clusters 

Schema Evolution 

Auto with "select all" 

Manual with column selection 

Deletion Handling 

Propagates to target 

Retain deleted records 

 

Results & Implementation Insights 

What We Built 

  • API-based pipeline creation for Salesforce ingestion with full control over table naming
  • Serverless streaming tablesmaintaining latest replica of source data
  • Integration with Control-M for enterprise scheduling and monitoring
  • Unity Catalog governance with managed tables and proper access controls 

Key Capabilities Delivered 

Capability 

Implementation 

Table Renaming 

API mode with custom destination table names 

Automatic Schema Evolution 

"Select all columns" approach 

Schedule Management 

External scheduler (Control-M) integration 

Authentication 

Secure connection managed by DB admin team 

Data Freshness 

Configurable based on schedule 

Pipeline Characteristics 

  • Compute: Serverless (mandatory for Salesforce connector) 

  • Storage: Managed streaming tables in Unity Catalog 

  • Change Capture: CDC Type 1 (latest replica only) 

  • Scale:Supports multiple tables per single pipeline 

What We Learned: Challenges and Solutions 

Issue 

Root Cause 

Fix 

No destination table renaming 

UI limitation in native connector 

Switched to API-based pipeline creation with JSON payload 

No automatic schema merge with selective columns 

Design choice in connector 

Used "select all columns" approach for automatic evolution 

Serverless compute confusion 

Unclear documentation 

Documented as mandatory for Salesforce ingestion 

Delete propagation concerns 

CDC Type 1 behavior 

Added downstream history tables for audit requirements 

Scheduling flexibility limitations 

Built-in scheduler constraints 

Integrated with Control-M enterprise scheduler 

Authentication complexity 

Multiple authentication methods 

Centralized connection management by DB admin team 

 

Why This Pattern Matters for Any Organization 

By combining Databricks' native Salesforce connector with API-driven pipeline creation, we've built a scalable, governed integration pattern that goes beyond simply moving data: 

From Siloed CRM to Unified Lakeflow 

Salesforce data becomes part of the governed data estate, joining other enterprise data sources for comprehensive analytics. 

From Manual Schema Management to Automated Evolution 

Schema changes in Salesforce no longer break pipelines. Automatic evolution handles new columns seamlessly. 

From Rigid Naming to Flexible Modeling 

API mode unlocks the ability to rename tables for dimensional modeling, fact/dimension segregation, and consistent naming conventions. 

From Point-to-Point Integration to Platform Capability 

Salesforce ingestion becomes a reusable service that multiple teams can leverage, with proper governance and observability. 

Key Takeaways & Best Practices 

Architecture Decisions 

  • Always use "select all columns" for production pipelines to benefit from automatic schema evolution
  • Use API mode when you need table renaming - UI mode cannot satisfy dimensional modeling requirements
  • Plan for CDC Type 1 behaviour add history tables separately if audit trails are required
  • Integrate with enterprise schedulers for consistent orchestration across all data pipelines
  • Centralize connection management - let DB admin team handle credentials and access 

When to Choose UI vs. API 

Choose UI Mode When... 

Choose API Mode When... 

Source table names are acceptable as-is 

You need to rename tables for modeling 

Quick prototyping and testing 

Production pipelines with standards 

Simple single-table ingestion 

Complex multi-table with custom filters 

Limited engineering resources 

Full CI/CD integration required 

No dimensional modeling requirements 

Fact/dimension segregation needed 

Production Recommendations 

  • Validate CDC behavior: ensure delete propagation aligns with your use case
  • Monitor pipeline performance - serverless compute has different characteristics than job clusters
  • Implement data quality checkson target tables after each ingestion run
  • Document table naming conventions for API-created pipelines
  • Test schema evolution with "select all columns" before production deployment
  • Set up notificationsfor pipeline failures and schema changes 

Conclusion 

The Lakeflow - Salesforce Connector pattern demonstrates that Databricks' native ingestion capabilities combined with API-driven pipeline creation provide a powerful path from CRM data to governed Lakeflow tables. What traditionally required custom Python scripts, complex change data capture logic, and constant schema management can now be accomplished with configuration-driven pipelines. 

The key innovation is the combination of automatic schema evolution (with "select all columns"), serverless execution, and the flexibility of API-based deployment for advanced use cases like table renaming. This pattern is directly applicable to any organization needing to bring Salesforce data into Databricks for analytics and AI. 

However, this approach has clear boundaries: CDC Type 1 only (latest replica), serverless compute only, and UI limitations around table naming. Understanding these limitations is critical for setting the right expectations and designing appropriate downstream patterns for history tracking and audit compliance. 

The API-based workaround for table renaming using JSON payloads and Databricks API represents a reusable pattern for extending native connectors beyond UI limitations. This same pattern can be applied to other connectors (SQL Server, Google, etc.) where similar constraints exist. 

The complete solutionconnection management, pipeline definition, API deployment, and scheduler integration is production-ready and can be adapted for any Salesforce data ingestion use case. 

It all starts with one powerful idea: 

"Let the platform handle the ingestion—and let the Lakeflow handle the governance."


Topics

Databricks Lakeflow Salesforce Connector Data Ingestion Data Engineering ETL Automation
LinkedIn X/Twitter Facebook
×

Start a Conversation

Our team will get back to you shortly.