Transformation & Data Modeling

Data Catalog

raw

Structure of raw ingestion tables primarily loaded via Airbyte and custom loaders.

Airbyte Google Ads: Raw Google Ads data.
Airbyte Criteo: Raw Criteo advertising data.
Airbyte Bing Ads: Raw Bing Ads data.
Airbyte Marketing Budget: Marketing budget data.
Airbyte theCRM: Raw CRM data ingestion.
Web Lead Loader: Ingestion of web leads.
GA4 Loader: Google Analytics 4 event data.
Facebook Ads Loader: Raw Facebook Ads data.
Inventory Loader: Vehicle inventory data.
Airbyte Trusted Feed: Trusted feed data ingestion.
IDS Loader: IDS data ingestion.
Blueshift Loader: Blueshift marketing data.
Blueshift User Import: User import data from Blueshift.
Rudderstack Loader: Event stream data from Rudderstack.
Braze Loader: Braze engagement data.
Airbyte GSC: Google Search Console data.

cleaned

Definitions of core transformation tables.

analytics_attribution Rolls up raw GA4 events into sessions, devices, conversions, and vehicles.
the_crm Rolls up CRM activities into customers, leads, and opportunities.
web_inventory_import Adds location data to website inventory for import to Braze.
trusted_feed_for_braze_import Makes the trusted feed able to be imported into Braze.
lead_funnel Combines website data with ad spend and CRM data.
ids_cleanse Cleans IDS data and adds location information.
blueshift_cleanse Cleans Blueshift data, adds location information, and consolidates duplicate fields.
braze_cleanse Cleans Braze current exports.
google_search_console Cleans GSC data and adds location information.

rudderstack_prod

Production database for RudderStack identity resolution and event storage.

user_profiles The primary output of the RudderStack Profiles project. This schema contains:
- Identity Graph: Tables resolving rudder_id to a canonical USER_MAIN_ID, linking users across devices and sessions.
- Feature Views: Computed user attributes and behavioral features defined in the project configuration.
- Cohorts: Segments of users based on defined criteria (e.g., _ALL, _COHORT_NAME).
For more details, see the RudderStack Warehouse Output Documentation.
website_events Raw event stream data ingested from website sources. Used as input for identity resolution.

For more details on the schema, see the RudderStack Warehouse Schema Documentation.
rvcomplete_mobile_app Raw event stream data ingested from the RV Complete mobile application.
_rudderstack Internal system tables and metadata used by RudderStack for state management and processing.

Workflows & Pipelines

List of Scheduled Jobs

You can see the daily timing of transformations here: Transformation Timing Spreadsheet

Into Databricks

Job	Time (PST)	Average TTF
Airbyte Google Ads	1:15 AM	20 mins
Airbyte Criteo	1:45 AM	3 mins
Airbyte Bing Ads	2:00 AM	4 mins
Airbyte Marketing Budget	2:10 AM	4 mins
Airbyte theCRM	2:20 AM	20 mins
Web Lead Loader	3:00 AM	2 mins
GA4 Loader	11:30 AM	12 mins
Facebook Ads Loader	11:50 AM	2 mins
Inventory Loader	12:30 AM	10 mins
Airbyte Trusted Feed	12:00 AM	4 mins
IDS Loader	3:30 AM	10 mins
Blueshift Loader	5:00 AM	31 mins
Blueshift User Import	5:45 AM	12 mins
Rudderstack Loader	12:30 AM, 6:30 AM, 12:30 PM, 6:30 PM	2 mins
Braze Loader	11:00 AM	6 mins
Airbyte GSC	7:00 AM	10 mins

Pipelines

Job	Time (PST)	Average TTF
analytics_attribution	12:00 PM	20 mins
the_crm	6:30 AM	10 mins
web_inventory_import	12:50 AM	2 mins
trusted_feed_for_braze_import	12:10 AM	2 mins
ids_cleanse	4:00 AM	10 mins
blueshift_cleanse	6:15 AM	5 mins
braze_cleanse	11:15 AM	-
google_search_console	7:30 AM	4 mins
lead_funnel	1:00 PM	6 mins
purchase_events_to_braze	2:30 AM	2 mins

External Jobs

Job	Time (PST)	Average TTF
Rudderstack Profiles Project	12:45 PM	40 mins
Rudderstack Audience to Braze	1:30 PM	-
Braze: Inventory Import	3:00 AM	5 mins
Braze: Locations Import	12:20 PM	5 mins
Braze: Purchase Events Import	2:45 PM	5 mins

Delta Live Tables (DLT)

Delta Live Tables (DLT), also known as Lakeflow Spark Declarative Pipelines, is a framework for building reliable, maintainable, and testable data processing pipelines. It automatically orchestrates your data processing tasks and manages your infrastructure.

Key Concepts

Flows: The foundational data processing unit. A flow reads data from a source, applies logic, and writes to a target. It supports both streaming and batch semantics.
Streaming Tables: Managed tables that serve as streaming targets. They can accept append-only data and are ideal for ingestion.
Materialized Views: Managed tables that are precomputed based on a query. They are updated incrementally where possible, making them efficient for transformations and aggregations.
Pipelines: The unit of deployment. A pipeline groups flows, tables, and views together and manages their execution order and dependencies.

For more information, see the Lakeflow Spark Declarative Pipelines concepts documentation.

Materialized Views in Detail

Materialized views are a declarative pipeline object that includes a query defining the view, a flow to update it, and cached results.

Benefits:

Incremental Processing: Only processes changed data when possible, reducing cost and latency.
Automatic Maintenance: Databricks manages the underlying tables and ensures consistency.
Performance: Queries against materialized views are faster as results are precomputed.

For further reading, see the Materialized views documentation.