Skip to content

Transformation & Data Modeling

Structure of raw ingestion tables primarily loaded via Airbyte and custom loaders.

  • Airbyte Google Ads: Raw Google Ads data.
  • Airbyte Criteo: Raw Criteo advertising data.
  • Airbyte Bing Ads: Raw Bing Ads data.
  • Airbyte Marketing Budget: Marketing budget data.
  • Airbyte theCRM: Raw CRM data ingestion.
  • Web Lead Loader: Ingestion of web leads.
  • GA4 Loader: Google Analytics 4 event data.
  • Facebook Ads Loader: Raw Facebook Ads data.
  • Inventory Loader: Vehicle inventory data.
  • Airbyte Trusted Feed: Trusted feed data ingestion.
  • IDS Loader: IDS data ingestion.
  • Blueshift Loader: Blueshift marketing data.
  • Blueshift User Import: User import data from Blueshift.
  • Rudderstack Loader: Event stream data from Rudderstack.
  • Braze Loader: Braze engagement data.
  • Airbyte GSC: Google Search Console data.

Definitions of core transformation tables.

  • analytics_attribution Rolls up raw GA4 events into sessions, devices, conversions, and vehicles.

  • the_crm Rolls up CRM activities into customers, leads, and opportunities.

  • web_inventory_import Adds location data to website inventory for import to Braze.

  • trusted_feed_for_braze_import Makes the trusted feed able to be imported into Braze.

  • lead_funnel Combines website data with ad spend and CRM data.

  • ids_cleanse Cleans IDS data and adds location information.

  • blueshift_cleanse Cleans Blueshift data, adds location information, and consolidates duplicate fields.

  • braze_cleanse Cleans Braze current exports.

  • google_search_console Cleans GSC data and adds location information.

Production database for RudderStack identity resolution and event storage.

  • user_profiles The primary output of the RudderStack Profiles project. This schema contains:

    • Identity Graph: Tables resolving rudder_id to a canonical USER_MAIN_ID, linking users across devices and sessions.
    • Feature Views: Computed user attributes and behavioral features defined in the project configuration.
    • Cohorts: Segments of users based on defined criteria (e.g., _ALL, _COHORT_NAME).

    For more details, see the RudderStack Warehouse Output Documentation.

  • website_events Raw event stream data ingested from website sources. Used as input for identity resolution.

    For more details on the schema, see the RudderStack Warehouse Schema Documentation.

  • rvcomplete_mobile_app Raw event stream data ingested from the RV Complete mobile application.

  • _rudderstack Internal system tables and metadata used by RudderStack for state management and processing.

You can see the daily timing of transformations here: Transformation Timing Spreadsheet

JobTime (PST)Average TTF
Airbyte Google Ads1:15 AM20 mins
Airbyte Criteo1:45 AM3 mins
Airbyte Bing Ads2:00 AM4 mins
Airbyte Marketing Budget2:10 AM4 mins
Airbyte theCRM2:20 AM20 mins
Web Lead Loader3:00 AM2 mins
GA4 Loader11:30 AM12 mins
Facebook Ads Loader11:50 AM2 mins
Inventory Loader12:30 AM10 mins
Airbyte Trusted Feed12:00 AM4 mins
IDS Loader3:30 AM10 mins
Blueshift Loader5:00 AM31 mins
Blueshift User Import5:45 AM12 mins
Rudderstack Loader12:30 AM, 6:30 AM, 12:30 PM, 6:30 PM2 mins
Braze Loader11:00 AM6 mins
Airbyte GSC7:00 AM10 mins
JobTime (PST)Average TTF
analytics_attribution12:00 PM20 mins
the_crm6:30 AM10 mins
web_inventory_import12:50 AM2 mins
trusted_feed_for_braze_import12:10 AM2 mins
ids_cleanse4:00 AM10 mins
blueshift_cleanse6:15 AM5 mins
braze_cleanse11:15 AM-
google_search_console7:30 AM4 mins
lead_funnel1:00 PM6 mins
purchase_events_to_braze2:30 AM2 mins
JobTime (PST)Average TTF
Rudderstack Profiles Project12:45 PM40 mins
Rudderstack Audience to Braze1:30 PM-
Braze: Inventory Import3:00 AM5 mins
Braze: Locations Import12:20 PM5 mins
Braze: Purchase Events Import2:45 PM5 mins

Delta Live Tables (DLT), also known as Lakeflow Spark Declarative Pipelines, is a framework for building reliable, maintainable, and testable data processing pipelines. It automatically orchestrates your data processing tasks and manages your infrastructure.

  • Flows: The foundational data processing unit. A flow reads data from a source, applies logic, and writes to a target. It supports both streaming and batch semantics.
  • Streaming Tables: Managed tables that serve as streaming targets. They can accept append-only data and are ideal for ingestion.
  • Materialized Views: Managed tables that are precomputed based on a query. They are updated incrementally where possible, making them efficient for transformations and aggregations.
  • Pipelines: The unit of deployment. A pipeline groups flows, tables, and views together and manages their execution order and dependencies.

For more information, see the Lakeflow Spark Declarative Pipelines concepts documentation.

Materialized views are a declarative pipeline object that includes a query defining the view, a flow to update it, and cached results.

Benefits:

  • Incremental Processing: Only processes changed data when possible, reducing cost and latency.
  • Automatic Maintenance: Databricks manages the underlying tables and ensures consistency.
  • Performance: Queries against materialized views are faster as results are precomputed.

For further reading, see the Materialized views documentation.