Data Activation (Reverse ETL)

Rudderstack

Updating a Transformation

Transformations in RudderStack allow you to intercept events as they move from a Source to a Destination. Using JavaScript, you can modify these events on the fly to filter data, parse information, or enrich events with external data before they reach their final destination.

1. Accessing Transformations

There are two ways to access the transformation editor:

Via the Sidebar: Click the Transformations tab in the left-hand navigation menu to see a list of all existing transformations.
Via the Connections View: If looking at your visual data pipeline, you can simply click on the specific Transformation node located on the line connecting your Source to your Destination.

2. Writing Transformation Logic

The transformation editor uses standard JavaScript. You can write logic to manipulate the event object however you need. Common use cases include:

Filtering (Allow/Block Lists): You can create logic to only allow specific events to pass through. This is useful for controlling costs in downstream tools (e.g., Braze) that charge based on event volume.
Enrichment: You can fetch data from external APIs or add validation data to the event payload.
Parsing: Restructuring the data format to match the requirements of the destination.

3. Testing the Transformation

Before deploying your code, you should validate it using the built-in testing tools located at the bottom of the editor.

Step 1: Import Test Data You do not need to wait for live data to test your code.

Click the Import Event button.
Select a sample event type from the list (e.g., Track Event, Identify Event, Page Event).
This will populate the input window with a sample JSON payload.

Step 2: Run the Test

Once your code is written and your test event is imported, click the Run Test button.
The editor will process the sample event through your JavaScript code.

Step 3: Analyze the Results

Review the output window to ensure the JSON is formatted correctly.
Click the Difference button (if available) or toggle between input/output views to clearly see exactly what changed between the original event and the transformed event.

4. Saving and Deploying

Once you have verified that the transformation logic produces the desired output:

Click the Save Transformation button.
The new event structure will immediately begin applying to data flowing through that connection.

Adding a New Profile Attribute

This process involves updating the configuration code, testing locally, deploying the changes to the RudderStack platform, and mapping the new field for activation.

Phase 1: Code Configuration

You must edit the YAML files in your local rs-profiles repository to define where the data comes from and how it should be calculated.

1. Define the Input Source (`inputs.yaml`)

This file tells RudderStack which raw tables in your data warehouse (e.g., Databricks) to look at.

Open inputs.yaml in your code editor.
Check for existing sources: If the table containing your new data point is already defined, skip to step 2.

Add a new source (if necessary): If you are pulling from a new table, add a new block under inputs:

- name: source_table_name
  table: catalog.schema.table_name
  occurred_at_col: timestamp_column_name
  ids:
    - select: "lower(email_column)"
      type: email
      entity: user
    - select: "phone_column"
      type: phone
      entity: user

Ensure you define the occurred_at_col (for timeline stitching) and the unique identifiers (email/phone) to link this data to the Identity Graph.

2. Define the Feature Logic (`profiles.yaml`)

This file defines the specific attribute you want to attach to the user profile.

Open profiles.yaml.

Add a new entity_var: specific the logic for the new attribute. Common logic includes selecting the first or last value seen.

- entity_var:
    name: new_attribute_name  # The name of the column in the final table
    select: last(source_column_name)
    from: inputs/source_table_name
    description: Description of what this attribute represents

Phase 2: Local Testing & Verification

Before deploying, verify that the configuration compiles and produces the expected output.

Compile the Project: Run the following command in your terminal to check for syntax errors in your YAML:
Terminal window
```
pb compile
```
Ensure this completes without error.
Run the Project (Optional but Recommended): Run the pipeline locally to generate a test table in your development environment:
Terminal window
```
pb run
```
Note: This process can take significant time depending on data volume.
Verify in Warehouse: Navigate to your data warehouse (e.g., Databricks) and check the user_feature_view table (usually in a test schema like profiles_test). Confirm your new_attribute_name column exists and is populated.

Phase 3: Deployment

Once the code is verified, you must push it to the remote repository and trigger a run in RudderStack.

Commit and Push: Commit your changes to inputs.yaml and profiles.yaml and push them to your Git repository.
Terminal window
```
git add .
git commit -m "Added new user attribute"
git push
```
Fetch & Run in RudderStack:
- Log in to the RudderStack dashboard.
- Navigate to Unify -> Profiles -> [Your Project] -> Settings.
- Click Fetch Latest to pull the new code from Git.
- Navigate to the History tab.
- Click Run.
- Wait for the run status to show a green checkmark.

Phase 4: Activation (Sync to Destination)

After the profile run completes, the data is available in the warehouse but needs to be mapped to the destination (Braze).

Navigate to the Audience: Go to Activate -> Audiences and select the relevant audience (e.g., “Eligible for Messaging”).
Update the Schema:
- Click on the Schema tab.
- Click Update.
- Scroll to the bottom and click Map another field.
Map the New Field:
- Warehouse Column: Select your new_attribute_name from the dropdown.
- Destination Field: Type the name of the custom attribute as it should appear in Braze.
- Click Save.
Trigger Sync:
- Go to the Syncs tab.
- Click Sync now.

Once the sync completes, the new custom attribute will be available on the user profiles in Braze.

Debugging Rudderstack Profiles & Identity Graph

Based on the video tutorial, here is the step-by-step guide to diagnosing and fixing incorrect user profile stitching in RudderStack Profiles using Databricks and the pb_project.yaml configuration.

Goal

Diagnose why two distinct users (e.g., usera@gmail.com and another email) are being merged into a single profile—potentially overwriting critical attributes like unsubscribe status—and disconnect them by excluding the shared identifier.

Step 1: Find the `user_main_id` for the Known User

First, you need to identify the RudderStack internal ID (user_main_id) associated with the email address you are investigating.

Open your SQL editor (e.g., Databricks).
Query the user_id_stitcher table to find the record for the specific email address.

SELECT * FROM rudderstack_prod.user_profiles.user_id_stitcher
WHERE other_id = 'usera@gmail.com';

Note: Replace the email with the identifier you are investigating (e.g., phone number, email).

Step 2: Analyze All Identifiers Linked to that Profile

Once you have the user_main_id from the previous step, query the table again to see every identifier stitched to that ID. This helps you find the “bridge” connecting the two users.

Copy the user_main_id from the results in Step 1.
Run the following query:

SELECT * FROM rudderstack_prod.user_profiles.user_id_stitcher
WHERE user_main_id = 'YOUR_COPIED_ID_HERE';

What to look for:

Look at the results list. You will likely see the two different email addresses.
Identify the common identifier shared between them. In this tutorial, both emails were linked to the same phone number.

Step 3: Verify the Link in Source Tables

Confirm that the shared data exists in your raw or cleaned source tables to ensure it wasn’t a processing error.

Query your source table (e.g., cleaned_crm_customers) for the first email.
Query the source table for the second email.

-- Check User A
SELECT * FROM cleaned_the_crm_dev.crm_customers WHERE email = 'usera@gmail.com';

-- Check User B
SELECT * FROM cleaned_the_crm_dev.crm_customers WHERE email = 'other_user@gmail.com';

Result: You verify that User A and User B effectively entered the same phone number, which caused RudderStack to merge them.

Step 4: Exclude the Shared Identifier in Configuration

To break the link, you must tell RudderStack to ignore this specific shared value (the phone number) during the stitching process.

Open your RudderStack Profiles project (likely in VS Code).
Navigate to the pb_project.yaml file.
Locate the entity definition for the identifier type you want to exclude (e.g., name: phone).
Find the id_stitcher_rules section and the filters list.
Add a new exclusion rule for the specific value.

Example YAML Configuration:

entities:
  - name: phone
    id_stitcher_rules:
      filters:
        - type: exclude
          value: "3563456435" # The shared phone number causing the merge

Step 5: Apply and Re-run

Commit and Push your changes to your Git repository (GitHub/GitLab).
Trigger a fresh run of your RudderStack Profiles job.

Outcome: During the next run, RudderStack will ignore that specific phone number for ID stitching. The two email addresses will no longer be linked via that phone number, splitting them into separate profiles and preserving their individual attributes (like unsubscribe status).

Braze Integration

Updating a Catalog (Braze CDI)

1. Overview

Cloud Data Ingestion (CDI) allows Braze to sync data directly from a cloud data warehouse (e.g., Databricks, Snowflake). This is typically used to sync Catalogs (e.g., product inventory, store locations) or Events (e.g., purchase events).

Location in Braze: Data Settings > Cloud Data Ingestion

2. Braze Connection Configuration

To view or edit an existing sync:

Navigate to the Cloud Data Ingestion page.
Select the specific sync you wish to investigate.
Review the Connection Details.

Key Configuration Fields:

Catalog/Source: The database catalog (e.g., cleaned).
Schema: The specific schema within the database (e.g., braze_import).
Table: The table view containing the prepared data (e.g., vehicles or locations).

3. Required Data Structure (Source Table)

For CDI to function correctly, the source table in your data warehouse must follow a specific schema. Braze does not ingest raw columns directly; it requires a packed JSON payload.

Standard Columns

Column Name	Type	Description
`id`	String/Int	The unique identifier for the catalog item or user.
`updated_at`	Timestamp	Critical. Used for watermarking. Braze checks this timestamp to determine if the row has changed since the last sync.
`payload`	String (JSON)	A JSON object containing all item attributes (e.g., description, price, image URL, category).
`deleted`	Boolean	(Optional) If `true`, the item is removed from the Braze Catalog.

The “Deleted” Logic

Do not simply remove a row from the source table to delete it from Braze; the sync will simply ignore it.
To delete an item, the row must exist in the table with deleted = true and a new updated_at timestamp.

4. Sync Logic (How it works)

The integration uses an incremental sync strategy:

The system records the Last Updated At timestamp of the previous successful run.
On the next run, it queries the source table.
It ingests only rows where the updated_at value is greater than the previous Last Updated At value.

5. Updating Data Transformation Logic

If you need to add new attributes to the catalog or change how data is formatted, you must update the upstream ETL process (usually a Databricks Notebook).

Step-by-Step Update Process:

Locate the Table: In your data warehouse (e.g., Databricks), find the table referenced in the Braze Connection Details.
Trace Lineage: Use the “Lineage” tab to find the upstream Job or Notebook that writes to this table.
Edit the Code: Open the notebook to modify the transformation logic.

Common Code Pattern (PySpark Example)

The standard pattern involves selecting your raw columns and packing them into the payload column while setting the updated_at timestamp.

# Pseudo-code example based on video instructions

# 1. Prepare the dataframe with necessary logic
df_transformed = source_df.select(
    col("unique_id_column").alias("id"),
    current_timestamp().alias("updated_at"), # Updates timestamp to now
    # 2. Pack all attribute columns into a JSON string
    to_json(struct(
        col("attribute_1"),
        col("attribute_2"),
        col("price"),
        col("description")
    )).alias("payload")
)

# 3. Write to the table targeted by Braze
df_transformed.write.mode("overwrite").saveAsTable("braze_import.target_table")

6. Execution and Troubleshooting

Once the data table is updated, the sync needs to run to reflect changes in Braze.

Scheduled Syncs

The job will run automatically based on the frequency defined in the Braze settings (e.g., every 15 minutes, hourly).

Manual Sync (Force Update)

If you have made changes and do not want to wait for the schedule:

Go to Data Settings > Cloud Data Ingestion in Braze.
Find your sync job.
Click the Sync Now button (refresh icon).
Monitor the “Sync Logs” at the bottom of the page for Success or Error statuses.