Identity & Activation (Rudderstack)

Overview

The Identity and Activation pipeline leverages RudderStack Profiles to create a unified view of the customer. The process involves three main stages:

Collection: Ingesting raw events and data from various sources (Website, Mobile App, Databricks tables).
Stitching & Profiling: Unifying user identities into a comprehensive graph and computing user features.
Activation: Sending the enriched user profiles to downstream destinations like Braze and Databricks via Reverse ETL.

Identity Resolution

RudderStack Profiles solves the problem of fragmented customer data by stitching together known and unknown identifiers into a single canonical identity (rudder_id).

How the Identity Graph Works

Customer data often exists in silos:

Unknown Users: A user browsing anonymously on the website (tracked via anon_id).
Known Users: A user who has signed up or purchased, identified by email, phone, or user_id.

The Identity Graph links these identifiers across devices and sessions. For example:

User visits website (Anonymous ID: A1).
User signs up with email (Email: user@example.com, linked to A1).
User logs in on mobile (Device ID: D1, linked to user@example.com).

RudderStack unifies A1, user@example.com, and D1 into a single Identity Graph, revealing that these are all the same person.

Impact: This stitching process significantly consolidates user records. For example, a dataset might go from 9.48M raw identifiers to 2.64M unique profiles after stitching.

For more details on how profiles work, see the RudderStack Profiles documentation.

Configuration Guide (Profiles)

The Profiles project is defined by a collection of YAML files that specify how to ingest data, identifying features, and define the project structure. These files allow you to test logic locally and deploy to production.

Project Structure

inputs.yaml: Identifies the tables in Databricks to include in the profile and specifies the columns containing identifying information.
profiles.yaml: Defines the features (attributes) to compute for each profile using aggregate SQL statements.
pb_project.yaml: The main configuration file that ties everything together, defining entities and ID types.

For a comprehensive guide on the project structure, see the RudderStack Project Structure documentation.

1. inputs.yaml

This file maps your warehouse tables to the identity graph. For each table, you must define:

The path to the table in Databricks.
The timestamp column (for ordering events).
The columns that represent identifiers (e.g., email, phone, external IDs).

Example Configuration:

- name: crm_contacts
  app_defaults:
    table: cleaned.the_crm.mv_crm_customers
    occurred_at_col: activity_date_created
    ids:
      - select: "lower(trim(email))"
        type: email
        entity: user
      - select: "phone::STRING"
        type: phone
        entity: user

- name: blueshift_users
  app_defaults:
    table: cleaned.blueshift.mv_users
    occurred_at_col: blueshift_join_date
    ids:
      - select: "blueshift_uuid::STRING"
        type: blueshift_user_id
        entity: user
      - select: "lower(trim(email))"
        type: email
        entity: user

2. profiles.yaml

This file defines the Features (traits) of a user. Features are computed by aggregating data from the input tables. You can use SQL logic to define these traits.

Example Configuration:

- entity_var:
    name: braze_email_subscribe
    from: inputs/braze_subscriptions
    select: last_value(email_subscribe)
    where: email_subscribe IS NOT NULL AND email_address = {{user.main_email}}
    description: If the user has subscribed to receive email communications

- entity_var:
    name: email_subscribe
    select: coalesce({{user.braze_email_subscribe}}, {{user.blueshift_email_subscribe}})
    description: If the user has subscribed to receive email communications (Unified)

3. pb_project.yaml

This file configures the high-level entity definitions and rules for identifiers. It allows you to:

Define the main entity (e.g., user).
List all supported ID types.
Filter/Exclude IDs: Use regex to exclude junk data (e.g., test emails, spam, placeholder values).

Example Configuration:

entities:
  - name: user
    id_stitcher: models/user_id_stitcher
    id_types:
      - rs_anon_id
      - ga_user_id
      - email
      - phone

id_types:
  - name: email
    filters:
      - type: exclude
        regex: '.*(test|fake|spam|none|noone|noemail|needemail).*'
      - type: exclude
        regex: '^null|na|noreply|email|no|123@.*'

Activation (Reverse ETL)

Activation is the process of sending the unified customer profiles and enriched data back to downstream tools for action. We primarily use RudderStack Reverse ETL and Braze Cloud Data Ingestion (CDI).

RudderStack Reverse ETL

This pipeline syncs computed user profiles from the Databricks warehouse to destinations like Braze.

Workflow

Create Audience: Define an audience by selecting a target table and applying filters (e.g., main_email is set AND email_subscribe is set).
Connect Destination: Link the audience to a destination (e.g., Braze Prod).
Map Properties: Map the columns from your audience (Warehouse columns) to the fields in the destination (External ID, Custom Attributes).

For more details on activations, see the RudderStack Activation documentation.

Real-time Transformations

You can use JavaScript transformations to modify events in real-time before they reach the destination. This is useful for:

Data Masking: Removing PII (e.g., phone numbers) before sending to analytics tools.
Enrichment: Calling external APIs to add data to the event.
Formatting: Restructuring the event payload to match destination requirements.

Example Transformation:

export function transformEvent(event) {
  // Mask phone number
  if (event.context && event.context.traits && event.context.traits.phone) {
    delete event.context.traits.phone;
  }
  return event;
}

For more information, see the RudderStack Transformations overview.

Braze Cloud Data Ingestion (CDI)

For high-volume, non-user data objects (like Vehicles or Purchase History), we use Braze CDI to sync data directly from Databricks to Braze catalogs.

Use Cases:

Vehicles: Syncing inventory data.
Locations: Syncing dealership locations.
Purchases: Syncing historical purchase data.

Table Formatting Requirements: To successfully ingest data via CDI, the source table in Databricks must be formatted with specific columns:

unique_id: A unique identifier for the item.
updated_at: A timestamp used to track changes (CDI only syncs items where updated_at > last job start time).
payload: A JSON column containing all the data fields to be synced.
deleted: (Optional) Boolean flag to indicate if the item should be removed.

For more details, see the Braze Cloud Data Ingestion documentation.

Braze Duplicate User Management

A common challenge is handling users who are tracked anonymously before they are identified.

Scenario: A user browses the site (generating events) but hasn’t logged in. We don’t have their external_id (RudderStack ID) yet.
Solution: We send these events to Braze as Aliased Users. An aliased user is identified by an alias_name and alias_label.
Merging: Braze runs a daily merge job. When the user eventually identifies (e.g., signs up), Braze merges the Aliased User profile with the new External ID profile, preserving history and attributes.