Transformation & Data Modeling
Data Catalog
Section titled “Data Catalog”Structure of raw ingestion tables primarily loaded via Airbyte and custom loaders.
- Airbyte Google Ads: Raw Google Ads data.
- Airbyte Criteo: Raw Criteo advertising data.
- Airbyte Bing Ads: Raw Bing Ads data.
- Airbyte Marketing Budget: Marketing budget data.
- Airbyte theCRM: Raw CRM data ingestion.
- Web Lead Loader: Ingestion of web leads.
- GA4 Loader: Google Analytics 4 event data.
- Facebook Ads Loader: Raw Facebook Ads data.
- Inventory Loader: Vehicle inventory data.
- Airbyte Trusted Feed: Trusted feed data ingestion.
- IDS Loader: IDS data ingestion.
- Blueshift Loader: Blueshift marketing data.
- Blueshift User Import: User import data from Blueshift.
- Rudderstack Loader: Event stream data from Rudderstack.
- Braze Loader: Braze engagement data.
- Airbyte GSC: Google Search Console data.
cleaned
Section titled “cleaned”Definitions of core transformation tables.
-
analytics_attribution Rolls up raw GA4 events into sessions, devices, conversions, and vehicles.
-
the_crm Rolls up CRM activities into customers, leads, and opportunities.
-
web_inventory_import Adds location data to website inventory for import to Braze.
-
trusted_feed_for_braze_import Makes the trusted feed able to be imported into Braze.
-
lead_funnel Combines website data with ad spend and CRM data.
-
ids_cleanse Cleans IDS data and adds location information.
-
blueshift_cleanse Cleans Blueshift data, adds location information, and consolidates duplicate fields.
-
braze_cleanse Cleans Braze current exports.
-
google_search_console Cleans GSC data and adds location information.
rudderstack_prod
Section titled “rudderstack_prod”Production database for RudderStack identity resolution and event storage.
-
user_profiles The primary output of the RudderStack Profiles project. This schema contains:
- Identity Graph: Tables resolving
rudder_idto a canonicalUSER_MAIN_ID, linking users across devices and sessions. - Feature Views: Computed user attributes and behavioral features defined in the project configuration.
- Cohorts: Segments of users based on defined criteria (e.g.,
_ALL,_COHORT_NAME).
For more details, see the RudderStack Warehouse Output Documentation.
- Identity Graph: Tables resolving
-
website_events Raw event stream data ingested from website sources. Used as input for identity resolution.
For more details on the schema, see the RudderStack Warehouse Schema Documentation.
-
rvcomplete_mobile_app Raw event stream data ingested from the RV Complete mobile application.
-
_rudderstack Internal system tables and metadata used by RudderStack for state management and processing.
Workflows & Pipelines
Section titled “Workflows & Pipelines”List of Scheduled Jobs
Section titled “List of Scheduled Jobs”You can see the daily timing of transformations here: Transformation Timing Spreadsheet
Into Databricks
Section titled “Into Databricks”| Job | Time (PST) | Average TTF |
|---|---|---|
| Airbyte Google Ads | 1:15 AM | 20 mins |
| Airbyte Criteo | 1:45 AM | 3 mins |
| Airbyte Bing Ads | 2:00 AM | 4 mins |
| Airbyte Marketing Budget | 2:10 AM | 4 mins |
| Airbyte theCRM | 2:20 AM | 20 mins |
| Web Lead Loader | 3:00 AM | 2 mins |
| GA4 Loader | 11:30 AM | 12 mins |
| Facebook Ads Loader | 11:50 AM | 2 mins |
| Inventory Loader | 12:30 AM | 10 mins |
| Airbyte Trusted Feed | 12:00 AM | 4 mins |
| IDS Loader | 3:30 AM | 10 mins |
| Blueshift Loader | 5:00 AM | 31 mins |
| Blueshift User Import | 5:45 AM | 12 mins |
| Rudderstack Loader | 12:30 AM, 6:30 AM, 12:30 PM, 6:30 PM | 2 mins |
| Braze Loader | 11:00 AM | 6 mins |
| Airbyte GSC | 7:00 AM | 10 mins |
Pipelines
Section titled “Pipelines”| Job | Time (PST) | Average TTF |
|---|---|---|
| analytics_attribution | 12:00 PM | 20 mins |
| the_crm | 6:30 AM | 10 mins |
| web_inventory_import | 12:50 AM | 2 mins |
| trusted_feed_for_braze_import | 12:10 AM | 2 mins |
| ids_cleanse | 4:00 AM | 10 mins |
| blueshift_cleanse | 6:15 AM | 5 mins |
| braze_cleanse | 11:15 AM | - |
| google_search_console | 7:30 AM | 4 mins |
| lead_funnel | 1:00 PM | 6 mins |
| purchase_events_to_braze | 2:30 AM | 2 mins |
External Jobs
Section titled “External Jobs”| Job | Time (PST) | Average TTF |
|---|---|---|
| Rudderstack Profiles Project | 12:45 PM | 40 mins |
| Rudderstack Audience to Braze | 1:30 PM | - |
| Braze: Inventory Import | 3:00 AM | 5 mins |
| Braze: Locations Import | 12:20 PM | 5 mins |
| Braze: Purchase Events Import | 2:45 PM | 5 mins |
Delta Live Tables (DLT)
Section titled “Delta Live Tables (DLT)”Delta Live Tables (DLT), also known as Lakeflow Spark Declarative Pipelines, is a framework for building reliable, maintainable, and testable data processing pipelines. It automatically orchestrates your data processing tasks and manages your infrastructure.
Key Concepts
Section titled “Key Concepts”- Flows: The foundational data processing unit. A flow reads data from a source, applies logic, and writes to a target. It supports both streaming and batch semantics.
- Streaming Tables: Managed tables that serve as streaming targets. They can accept append-only data and are ideal for ingestion.
- Materialized Views: Managed tables that are precomputed based on a query. They are updated incrementally where possible, making them efficient for transformations and aggregations.
- Pipelines: The unit of deployment. A pipeline groups flows, tables, and views together and manages their execution order and dependencies.
For more information, see the Lakeflow Spark Declarative Pipelines concepts documentation.
Materialized Views in Detail
Section titled “Materialized Views in Detail”Materialized views are a declarative pipeline object that includes a query defining the view, a flow to update it, and cached results.
Benefits:
- Incremental Processing: Only processes changed data when possible, reducing cost and latency.
- Automatic Maintenance: Databricks manages the underlying tables and ensures consistency.
- Performance: Queries against materialized views are faster as results are precomputed.
For further reading, see the Materialized views documentation.