If you've ever tried to bring Salesforce data into a modern data platform, you know the challenges: complex authentication, schema drift, table naming constraints, and the constant need to maintain parity between source and target systems. Traditional ETL approaches require custom code, constant maintenance, and struggle to keep up with changes in the source system.
That's exactly the problem we set out to solve using Databricks' native Salesforce ingestion pipeline with an API-driven approach.
The Problem: Hidden Complexity in Salesforce Data Integration
At a high level, the problem sounds simple, but implementing it is a different story:
- Connect to Salesforce environments securely
- Ingest multiple tables and views reliably
- Handle schema changes automatically
- Maintain latest replica of source data
- Deliver governed data in Unity Catalog
Organizations rely on Salesforce as their source of truth for sales, marketing, and customer data. But getting that data into a Data Lakeflow for analytics and AI is surprisingly difficult. Data teams face challenges with:
|
Challenge |
Impact |
|
Manual connection setup |
Days lost to configuration and troubleshooting |
|
Schema drift in source tables |
Broken pipelines and missing data |
|
Table naming restrictions |
Cannot rename tables for dimensional modeling |
|
Complex authentication |
Security risks and access management overhead |
|
No automatic schema evolution |
Pipeline failures when source adds columns |
|
Inconsistent data freshness |
Stale data for downstream analytics |
We wanted a way to bridge that gap, automating Salesforce ingestion with zero custom code for schema management, while maintaining full governance and control over the destination tables.
Phase 1: Native Salesforce Connector with UI-Based Pipeline
Databricks provides a built-in ingestion pipeline feature with connectors for multiple sources including Salesforce, SQL Server, Google, and others. The UI-based approach is the fastest way to get started.
Setting Up the Connection
First, you need a connection to your Salesforce environment. The database admin team creates this connection using:
- Login credentials
- Server details
- Authentication URL
- Client ID
Note: Connection details are sensitive and should be managed securely with appropriate access controls.
Creating the Pipeline
Once the connection is ready, you create a pipeline:
- Select Connection: Choose the Salesforce connection from available options in your workspace
- Configure Target: Specify the target catalog in Unity Catalog (must be a managed table, not external storage)
- Define Source: Select tables or views from Salesforce that your connection has access to
The three-namespace structure in Databricks:
Catalog → Schema → Table
Selecting Columns and Handling Schema Evolution
A critical decision point is column selection:
|
Selection Type |
Behavior |
Best For |
|
Select All Columns |
Automatically adds new source columns to target; no pipeline modification needed |
Production pipelines where source schema changes frequently |
|
Select Specific Columns |
New source columns require pipeline modification; manual intervention needed |
Stable sources with well-defined, unchanging schemas |
Key Insight: If you select all columns, schema merge happens automatically. When a new column is added in Salesforce, it will be automatically populated to your target without any code changes or pipeline modifications.
Scheduling and Orchestration
The UI allows you to set up schedules:
- Simple daily runs
- Specific time triggers
- Notification configurations for pipeline status
Critical Limitation: Table Name Constraints
The UI limitation: Whatever table name exists in Salesforce becomes the table name in your destination. You cannot rename tables to:
- Segregate facts and dimensions
- Apply consistent naming conventions
- Create business-friendly table names
Example: A Salesforce table named "Account" cannot be renamed to "dim_account" in the target.
Phase 2: Streaming Tables and Change Data Capture
What Gets Created?
The pipeline creates a managed streaming table, not a standard managed Delta table.
Why streaming table?
- You must specify key columns in the dataset
- Based on these key columns, the pipeline detects changes
- It always brings the latest record, with no history maintenance
How Change Data Capture Works
The pipeline implements CDC Type 1 (overwrite):
|
Scenario |
Behavior |
|
Record Update |
Existing record in target (e.g., ID=1) is updated with latest source values |
|
Record Delete |
The record is removed from the target if deleted in the source |
|
New Record |
Record is inserted into target |
Important: This maintains only the latest replica of source data. It does not maintain historical changes (CDC Type 2). For audit trails or slowly changing dimensions, additional logic would be needed.
Data Flow
Salesforce Source → Ingestion Pipeline (Serverless) → Managed Streaming Table (Unity Catalog)
Critical technical constraint: All Salesforce ingestion pipelines run on serverless compute only. You cannot use job clusters for this connector.
Phase 3: API-Driven Pipeline Creation for Advanced Control
The UI approach works for basic use cases, but when you need advanced capabilities like renaming tables, adding filters, and custom configurations, you need the Databricks API approach.
Why API-Driven?
|
Capability |
UI Mode |
API Mode |
|
Rename destination tables |
❌ Not supported |
✅ Supported |
|
Add custom filters |
❌ Limited |
✅ Full support |
|
Modify table names for dimensional modeling |
❌ Not possible |
✅ Fully configurable |
|
Custom notifications |
✅ Basic |
✅ Advanced |
|
JSON-based pipeline definition |
❌ Not available |
✅ Native |
The API Implementation Pattern
We implemented the API approach using a notebook with these components:
1. Pipeline Definition as JSON
The pipeline is defined entirely as a JSON payload:
json
{
"name": "salesforce_ingestion_pipeline",
"serverless": true,
"ingestion_definition": {
"connection_name": "salesforce_connection",
"source": {
"schema": "Salesforce_Schema",
"object": "Account"
},
"destination": {
"catalog": "ingestion_platform",
"schema": "merchandising",
"table_name": "dim_account" // Renamed from source!
}
},
"notifications": [...],
"schedule": {...}
}
2. Key Configuration Elements
|
Element |
Description |
|
serverless: true |
Mandatory for Salesforce ingestion |
|
connection_name |
Reference to pre-configured Salesforce connection |
|
source.object |
Salesforce table or view to ingest |
|
destination.table_name |
Custom name in target (API-only feature) |
|
notifications |
Alert configurations for pipeline status |
3. API Deployment
Once the JSON payload is ready, we post it to the Databricks API:
python
# POST request to Databricks API
# Headers include authentication tokens
# Body contains pipeline definition JSON
# Endpoint: /api/2.0/pipelines
response = requests.post(
f"{databricks_workspace_url}/api/2.0/pipelines",
headers=authentication_headers,
json=pipeline_definition
)
The API authenticates using appropriate tokens and creates the pipeline based on the definition. This approach is fully programmable and can be integrated into CI/CD pipelines.
Integration with External Schedulers
Rather than using Databricks' built-in scheduler, we integrated with Control-M (or any enterprise scheduler) to:
- Maintain consistent orchestration across all data pipelines
-
Leverage existing monitoring and alerting infrastructure
-
Provide unified observability across ingestion sources
Where to Use — Ideal Use Cases
Salesforce Data Lakeflow Foundation
Ingest all Salesforce objects (Account, Contact, Opportunity, Lead, Case) into Unity Catalog as the foundation for enterprise analytics and AI.
Real-Time Customer 360
Maintain latest replica of customer data from Salesforce for real-time personalization, segmentation, and customer analytics.
Sales Performance Analytics
Ingest opportunity and pipeline data for sales forecasting, territory analysis, and quota tracking with governed data in Delta Lake.
Marketing Attribution
Combine Salesforce campaign data with other sources for multi-touch attribution and ROI analysis.
Compliance and Auditing
Create governed, versioned copies of Salesforce data for regulatory compliance, audit trails, and historical analysis.
AI-Powered Sales Intelligence
Feed fresh Salesforce data into AI models for lead scoring, churn prediction, and next-best-action recommendations.
Where NOT to Use — Limitations & Boundaries
CDC Type 1 Only
This pipeline maintains only the latest replica, with no history. For slowly changing dimensions (SCD Type 2) or audit history, you need additional change data capture and history tables.
Serverless Only Constraint
Salesforce ingestion pipelines run exclusively on serverless compute. You cannot use job clusters, which may impact cost modeling and compute customization.
No UI Table Renaming
The UI does not support renaming tables. This is a significant limitation for teams with dimensional modeling standards. API is the only workaround.
Schema Evolution with Selective Columns
If you select specific columns (not all), new source columns require manual pipeline modification. Automatic schema merge works only with "select all columns."
Delete Propagation
Deletes in the source propagate to the target. This may be undesirable for compliance or audit use cases where you need to retain records even after source deletion.
Latency Considerations
Batch-based ingestion with scheduling (not real-time streaming). For sub-second latency requirements, consider Salesforce Streaming API instead.
|
Aspect |
Suitable ✅ |
Not Suitable ❌ |
|
Ingestion Pattern |
Batch, scheduled |
Real-time streaming (< 1 sec) |
|
Change Capture |
CDC Type 1 (latest replica) |
CDC Type 2 (history tracking) |
|
Table Naming |
API mode supports renaming |
UI mode restricts naming |
|
Compute |
Serverless only |
Job clusters |
|
Schema Evolution |
Auto with "select all" |
Manual with column selection |
|
Deletion Handling |
Propagates to target |
Retain deleted records |
Results & Implementation Insights
What We Built
- API-based pipeline creation for Salesforce ingestion with full control over table naming
- Serverless streaming tables maintaining latest replica of source data
- Integration with Control-M for enterprise scheduling and monitoring
- Unity Catalog governance with managed tables and proper access controls
Key Capabilities Delivered
|
Capability |
Implementation |
|
Table Renaming |
API mode with custom destination table names |
|
Automatic Schema Evolution |
"Select all columns" approach |
|
Schedule Management |
External scheduler (Control-M) integration |
|
Authentication |
Secure connection managed by DB admin team |
|
Data Freshness |
Configurable based on schedule |
Pipeline Characteristics
-
Compute: Serverless (mandatory for Salesforce connector)
-
Storage: Managed streaming tables in Unity Catalog
-
Change Capture: CDC Type 1 (latest replica only)
-
Scale: Supports multiple tables per single pipeline
What We Learned: Challenges and Solutions
|
Issue |
Root Cause |
Fix |
|
No destination table renaming |
UI limitation in native connector |
Switched to API-based pipeline creation with JSON payload |
|
No automatic schema merge with selective columns |
Design choice in connector |
Used "select all columns" approach for automatic evolution |
|
Serverless compute confusion |
Unclear documentation |
Documented as mandatory for Salesforce ingestion |
|
Delete propagation concerns |
CDC Type 1 behavior |
Added downstream history tables for audit requirements |
|
Scheduling flexibility limitations |
Built-in scheduler constraints |
Integrated with Control-M enterprise scheduler |
|
Authentication complexity |
Multiple authentication methods |
Centralized connection management by DB admin team |
Why This Pattern Matters for Any Organization
By combining Databricks' native Salesforce connector with API-driven pipeline creation, we've built a scalable, governed integration pattern that goes beyond simply moving data:
From Siloed CRM to Unified Lakeflow
Salesforce data becomes part of the governed data estate, joining other enterprise data sources for comprehensive analytics.
From Manual Schema Management to Automated Evolution
Schema changes in Salesforce no longer break pipelines. Automatic evolution handles new columns seamlessly.
From Rigid Naming to Flexible Modeling
API mode unlocks the ability to rename tables for dimensional modeling, fact/dimension segregation, and consistent naming conventions.
From Point-to-Point Integration to Platform Capability
Salesforce ingestion becomes a reusable service that multiple teams can leverage, with proper governance and observability.
Key Takeaways & Best Practices
Architecture Decisions
- Always use "select all columns" for production pipelines to benefit from automatic schema evolution
- Use API mode when you need table renaming - UI mode cannot satisfy dimensional modeling requirements
- Plan for CDC Type 1 behaviour - add history tables separately if audit trails are required
- Integrate with enterprise schedulers for consistent orchestration across all data pipelines
- Centralize connection management - let DB admin team handle credentials and access
When to Choose UI vs. API
|
Choose UI Mode When... |
Choose API Mode When... |
|
Source table names are acceptable as-is |
You need to rename tables for modeling |
|
Quick prototyping and testing |
Production pipelines with standards |
|
Simple single-table ingestion |
Complex multi-table with custom filters |
|
Limited engineering resources |
Full CI/CD integration required |
|
No dimensional modeling requirements |
Fact/dimension segregation needed |
Production Recommendations
- Validate CDC behavior: ensure delete propagation aligns with your use case
- Monitor pipeline performance - serverless compute has different characteristics than job clusters
- Implement data quality checks on target tables after each ingestion run
- Document table naming conventions for API-created pipelines
- Test schema evolution with "select all columns" before production deployment
- Set up notifications for pipeline failures and schema changes
Conclusion
The Lakeflow - Salesforce Connector pattern demonstrates that Databricks' native ingestion capabilities combined with API-driven pipeline creation provide a powerful path from CRM data to governed Lakeflow tables. What traditionally required custom Python scripts, complex change data capture logic, and constant schema management can now be accomplished with configuration-driven pipelines.
The key innovation is the combination of automatic schema evolution (with "select all columns"), serverless execution, and the flexibility of API-based deployment for advanced use cases like table renaming. This pattern is directly applicable to any organization needing to bring Salesforce data into Databricks for analytics and AI.
However, this approach has clear boundaries: CDC Type 1 only (latest replica), serverless compute only, and UI limitations around table naming. Understanding these limitations is critical for setting the right expectations and designing appropriate downstream patterns for history tracking and audit compliance.
The API-based workaround for table renaming using JSON payloads and Databricks API represents a reusable pattern for extending native connectors beyond UI limitations. This same pattern can be applied to other connectors (SQL Server, Google, etc.) where similar constraints exist.
The complete solution: connection management, pipeline definition, API deployment, and scheduler integration is production-ready and can be adapted for any Salesforce data ingestion use case.
It all starts with one powerful idea:
"Let the platform handle the ingestion—and let the Lakeflow handle the governance."
LinkedIn