Identity & Activation (Rudderstack)
Overview
Section titled “Overview”The Identity and Activation pipeline leverages RudderStack Profiles to create a unified view of the customer. The process involves three main stages:
- Collection: Ingesting raw events and data from various sources (Website, Mobile App, Databricks tables).
- Stitching & Profiling: Unifying user identities into a comprehensive graph and computing user features.
- Activation: Sending the enriched user profiles to downstream destinations like Braze and Databricks via Reverse ETL.
Identity Resolution
Section titled “Identity Resolution”RudderStack Profiles solves the problem of fragmented customer data by stitching together known and unknown identifiers into a single canonical identity (rudder_id).
How the Identity Graph Works
Section titled “How the Identity Graph Works”Customer data often exists in silos:
- Unknown Users: A user browsing anonymously on the website (tracked via
anon_id). - Known Users: A user who has signed up or purchased, identified by
email,phone, oruser_id.
The Identity Graph links these identifiers across devices and sessions. For example:
- User visits website (Anonymous ID:
A1). - User signs up with email (Email:
user@example.com, linked toA1). - User logs in on mobile (Device ID:
D1, linked touser@example.com).
RudderStack unifies A1, user@example.com, and D1 into a single Identity Graph, revealing that these are all the same person.
Impact: This stitching process significantly consolidates user records. For example, a dataset might go from 9.48M raw identifiers to 2.64M unique profiles after stitching.
For more details on how profiles work, see the RudderStack Profiles documentation.
Configuration Guide (Profiles)
Section titled “Configuration Guide (Profiles)”The Profiles project is defined by a collection of YAML files that specify how to ingest data, identifying features, and define the project structure. These files allow you to test logic locally and deploy to production.
Project Structure
Section titled “Project Structure”inputs.yaml: Identifies the tables in Databricks to include in the profile and specifies the columns containing identifying information.profiles.yaml: Defines the features (attributes) to compute for each profile using aggregate SQL statements.pb_project.yaml: The main configuration file that ties everything together, defining entities and ID types.
For a comprehensive guide on the project structure, see the RudderStack Project Structure documentation.
1. inputs.yaml
Section titled “1. inputs.yaml”This file maps your warehouse tables to the identity graph. For each table, you must define:
- The path to the table in Databricks.
- The timestamp column (for ordering events).
- The columns that represent identifiers (e.g., email, phone, external IDs).
Example Configuration:
- name: crm_contacts app_defaults: table: cleaned.the_crm.mv_crm_customers occurred_at_col: activity_date_created ids: - select: "lower(trim(email))" type: email entity: user - select: "phone::STRING" type: phone entity: user
- name: blueshift_users app_defaults: table: cleaned.blueshift.mv_users occurred_at_col: blueshift_join_date ids: - select: "blueshift_uuid::STRING" type: blueshift_user_id entity: user - select: "lower(trim(email))" type: email entity: user2. profiles.yaml
Section titled “2. profiles.yaml”This file defines the Features (traits) of a user. Features are computed by aggregating data from the input tables. You can use SQL logic to define these traits.
Example Configuration:
- entity_var: name: braze_email_subscribe from: inputs/braze_subscriptions select: last_value(email_subscribe) where: email_subscribe IS NOT NULL AND email_address = {{user.main_email}} description: If the user has subscribed to receive email communications
- entity_var: name: email_subscribe select: coalesce({{user.braze_email_subscribe}}, {{user.blueshift_email_subscribe}}) description: If the user has subscribed to receive email communications (Unified)3. pb_project.yaml
Section titled “3. pb_project.yaml”This file configures the high-level entity definitions and rules for identifiers. It allows you to:
- Define the main entity (e.g.,
user). - List all supported ID types.
- Filter/Exclude IDs: Use regex to exclude junk data (e.g., test emails, spam, placeholder values).
Example Configuration:
entities: - name: user id_stitcher: models/user_id_stitcher id_types: - rs_anon_id - ga_user_id - email - phone
id_types: - name: email filters: - type: exclude regex: '.*(test|fake|spam|none|noone|noemail|needemail).*' - type: exclude regex: '^null|na|noreply|email|no|123@.*'Activation (Reverse ETL)
Section titled “Activation (Reverse ETL)”Activation is the process of sending the unified customer profiles and enriched data back to downstream tools for action. We primarily use RudderStack Reverse ETL and Braze Cloud Data Ingestion (CDI).
RudderStack Reverse ETL
Section titled “RudderStack Reverse ETL”This pipeline syncs computed user profiles from the Databricks warehouse to destinations like Braze.
Workflow
Section titled “Workflow”- Create Audience: Define an audience by selecting a target table and applying filters (e.g.,
main_email is set AND email_subscribe is set). - Connect Destination: Link the audience to a destination (e.g., Braze Prod).
- Map Properties: Map the columns from your audience (Warehouse columns) to the fields in the destination (External ID, Custom Attributes).
For more details on activations, see the RudderStack Activation documentation.
Real-time Transformations
Section titled “Real-time Transformations”You can use JavaScript transformations to modify events in real-time before they reach the destination. This is useful for:
- Data Masking: Removing PII (e.g., phone numbers) before sending to analytics tools.
- Enrichment: Calling external APIs to add data to the event.
- Formatting: Restructuring the event payload to match destination requirements.
Example Transformation:
export function transformEvent(event) { // Mask phone number if (event.context && event.context.traits && event.context.traits.phone) { delete event.context.traits.phone; } return event;}For more information, see the RudderStack Transformations overview.
Braze Cloud Data Ingestion (CDI)
Section titled “Braze Cloud Data Ingestion (CDI)”For high-volume, non-user data objects (like Vehicles or Purchase History), we use Braze CDI to sync data directly from Databricks to Braze catalogs.
Use Cases:
- Vehicles: Syncing inventory data.
- Locations: Syncing dealership locations.
- Purchases: Syncing historical purchase data.
Table Formatting Requirements: To successfully ingest data via CDI, the source table in Databricks must be formatted with specific columns:
unique_id: A unique identifier for the item.updated_at: A timestamp used to track changes (CDI only syncs items whereupdated_at> last job start time).payload: A JSON column containing all the data fields to be synced.deleted: (Optional) Boolean flag to indicate if the item should be removed.
For more details, see the Braze Cloud Data Ingestion documentation.
Braze Duplicate User Management
Section titled “Braze Duplicate User Management”A common challenge is handling users who are tracked anonymously before they are identified.
- Scenario: A user browses the site (generating events) but hasn’t logged in. We don’t have their
external_id(RudderStack ID) yet. - Solution: We send these events to Braze as Aliased Users. An aliased user is identified by an
alias_nameandalias_label. - Merging: Braze runs a daily merge job. When the user eventually identifies (e.g., signs up), Braze merges the Aliased User profile with the new External ID profile, preserving history and attributes.