Back to Insights
Data Engineering 9/10/2024 5 min read

Building a Server-Side Event Data Lake: Capturing Raw GTM SC Events in BigQuery with Cloud Run for Audit & Custom Analytics

Building a Server-Side Event Data Lake: Capturing Raw GTM SC Events in BigQuery with Cloud Run for Audit & Custom Analytics

You've mastered the art of building a sophisticated server-side Google Analytics 4 (GA4) pipeline. Your Google Tag Manager (GTM) Server Container, hosted on Cloud Run, is enriching data, enforcing quality, managing granular consent, and routing events to multiple platforms. This robust architecture empowers you with control, accuracy, and compliance for your analytics.

However, a fundamental question often arises for data-driven organizations: Are you truly capturing and owning all of your raw event data, independent of specific analytics tool processing?

While your server-side setup diligently transforms and dispatches events to GA4 (and potentially other marketing platforms), the data sent to these tools is often a processed, transformed, and filtered version of the original client-side event. This optimized data is perfect for reporting and activation, but it might not be sufficient for:

  • Comprehensive Audit Trails: Needing a complete, immutable log of every incoming server-side event for compliance, debugging, or dispute resolution.
  • Custom Analytics & Data Science: Building bespoke dashboards, machine learning models, or deep dives that require the raw, untransransformed event schema.
  • Vendor Agnosticism & Future-Proofing: Decoupling your core event data from any single analytics platform's schema or processing logic.
  • Debugging the Pipeline: Having the ultimate source of truth for comparing what was received vs. what was sent to downstream tools.

The problem, then, is the absence of a raw, immutable event stream stored in your own data warehouse before any GA4-specific transformations or dispatches occur. Relying solely on the GA4 BigQuery export, for instance, means you're getting data after GA4 has processed it, which might include sampling, schema changes, and PII scrubbing. You need the event as your GTM Server Container first sees it.

The Solution: A Server-Side Event Data Lake with GTM SC, Cloud Run, and BigQuery

Our solution introduces an essential foundational layer: immediately capturing every raw server-side event received by your GTM Server Container and persisting it into a dedicated BigQuery Data Lake. This "raw event stream" acts as your single source of truth, untouched by subsequent transformations for GA4 or other platforms.

This is achieved by deploying a lightweight Python service on Cloud Run that acts as a BigQuery streaming inserter. Your GTM Server Container will then utilize a custom tag template to call this Cloud Run service, sending the raw event payload the moment it's received. This custom tag will be configured to fire very early in the GTM SC event lifecycle, ensuring it captures the event in its most pristine state.

Architecture: Adding a Raw Event Ingestion Layer

We'll augment our existing server-side architecture by introducing an initial, high-priority step within the GTM Server Container to divert a copy of the raw incoming event to BigQuery.

graph TD
    A[Browser/Client-Side] -->|1. Raw Event (HTTP Request)| B(GTM Web Container);
    B -->|2. HTTP Request to GTM Server Container Endpoint| C(GTM Server Container on Cloud Run);
    
    subgraph GTM Server Container Initial Processing
        C --> D{3. GTM SC Client Processes Event};
        D --> F[4. Custom Tag: Send Raw Event to BigQuery Service (First Priority)];
        F -->|5. HTTP Request with Raw Event JSON| G[Raw Event Ingestion Service (Python on Cloud Run)];
        G -->|6. Stream Insert Event| H[BigQuery Raw Event Data Lake];
    end
    
    F --> I[7. Continue Normal GTM SC Processing];
    I --> J[8. Data Quality, PII Scrubbing, Consent Evaluation, Enrichment];
    J --> K[9. Dispatch to GA4/Facebook CAPI/Google Ads];
    K --> L[Analytics/Ad Platforms];

Key Steps in the GTM Server Container:

  1. Ingest Raw Event: The GTM SC receives the client-side event.
  2. Capture and Forward (Early): A custom tag/variable captures the entire incoming event payload and immediately sends it to your Raw Event Ingestion Service (Cloud Run). This happens before any subsequent GA4-specific processing, PII scrubbing, or consent checks.
  3. BigQuery Persistence: The Cloud Run service receives the raw JSON payload and streams it into a designated BigQuery table.
  4. Continue Processing: The original event continues its journey through the GTM SC, undergoing transformations, consent checks, and dispatches to GA4, etc.

Core Components & Implementation Steps

1. BigQuery Setup: Your Raw Event Data Lake

Create a dedicated BigQuery dataset and table to store your raw events. A flexible schema is key, as incoming events might vary.

a. Create BigQuery Dataset:

gcloud beta bq --project YOUR_GCP_PROJECT_ID mk --dataset YOUR_BQ_REGION:raw_events_data_lake

b. Create BigQuery Table raw_incoming_events: The most flexible approach is to store the entire raw event payload as a JSON type (if your BigQuery version supports it) or STRING and parse it later. You should also capture key metadata like timestamp, event_name, and client_id for easier querying.

CREATE TABLE `your_gcp_project.raw_events_data_lake.raw_incoming_events` (
    event_timestamp TIMESTAMP NOT NULL,
    event_name STRING,
    client_id STRING,
    payload JSON, -- Use JSON type if available and preferred
    -- If JSON type is not available or you prefer string for older BigQuery versions:
    -- payload STRING,
    -- request_headers JSON, -- To store full request headers as JSON
    gcp_insert_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY event_name, client_id
OPTIONS (
    description = 'Raw incoming server-side events from GTM Server Container before any transformations.'
);

Note: If using payload JSON, ensure your Python service sends a valid JSON string. If using payload STRING, you can store the JSON.stringify() output directly.

2. Python Raw Event Ingestion Service (Cloud Run)

This Flask application will receive the raw event payload from GTM SC and insert it into BigQuery.

main.py example:

import os
import json
from flask import Flask, request, jsonify
from google.cloud import bigquery
import logging
import datetime

app = Flask(__name__)
# Configure basic logging to stdout, which Cloud Run automatically captures
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# --- BigQuery Configuration ---
BIGQUERY_PROJECT_ID = os.environ.get('BIGQUERY_PROJECT_ID', 'your_gcp_project')
BIGQUERY_DATASET_ID = os.environ.get('BIGQUERY_DATASET_ID', 'raw_events_data_lake')
BIGQUERY_TABLE_ID = os.environ.get('BIGQUERY_TABLE_ID', 'raw_incoming_events')
TABLE_FULL_ID = f"{BIGQUERY_PROJECT_ID}.{BIGQUERY_DATASET_ID}.{BIGQUERY_TABLE_ID}"

# Initialize BigQuery client
try:
    client = bigquery.Client(project=BIGQUERY_PROJECT_ID)
    logger.info(f"BigQuery client initialized for project: {BIGQUERY_PROJECT_ID}")
except Exception as e:
    logger.error(f"Error initializing BigQuery client: {e}")
    # Handle this more gracefully in production, potentially crash if essential

@app.route('/ingest-raw-event', methods=['POST'])
def ingest_raw_event():
    """
    Receives raw event data from GTM Server Container and streams it to BigQuery.
    """
    if not request.is_json:
        logger.warning("Request is not JSON. Content-Type: %s", request.headers.get('Content-Type'))
        return jsonify({'error': 'Request must be JSON'}), 400

    try:
        raw_event_data = request.get_json()
        logger.debug("Received raw event data: %s", json.dumps(raw_event_data, indent=2))

        # Extract key fields for easier querying in BigQuery
        event_name = raw_event_data.get('event_name', 'unknown_event')
        client_id = raw_event_data.get('_event_metadata', {}).get('client_id', None)
        # GTM SC 'gtm.start' is in milliseconds, BigQuery TIMESTAMP expects seconds or ISO format
        event_timestamp_ms = raw_event_data.get('gtm.start')
        if event_timestamp_ms:
            event_timestamp = datetime.datetime.fromtimestamp(event_timestamp_ms / 1000.0, tz=datetime.timezone.utc).isoformat(timespec='microseconds')
        else:
            event_timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat(timespec='microseconds')

        # Construct the row to insert
        row_to_insert = {
            "event_timestamp": event_timestamp,
            "event_name": event_name,
            "client_id": client_id,
            # Store the entire raw JSON payload as a string or JSON type
            "payload": json.dumps(raw_event_data) # BigQuery JSON type can also take a string
            # If you want to store request headers:
            # "request_headers": json.dumps(dict(request.headers)) 
        }

        # Stream insert into BigQuery
        # For high-volume, consider batching inserts or using Pub/Sub -> Dataflow
        errors = client.insert_rows_json(TABLE_FULL_ID, [row_to_insert])

        if errors:
            logger.error(f"BigQuery insert errors: {errors}")
            # Consider more robust error handling, e.g., dead-letter queue
            return jsonify({'message': 'Partial success, some rows failed to insert', 'errors': errors}), 200
        else:
            logger.info(f"Successfully ingested event '{event_name}' (Client ID: {client_id[:10] if client_id else 'N/A'}) into BigQuery.")
            return jsonify({'message': 'Event successfully ingested'}), 200

    except Exception as e:
        logger.error(f"Error during raw event ingestion: {e}", exc_info=True)
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    # Cloud Run provides the PORT environment variable
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

requirements.txt:

Flask
google-cloud-bigquery

Deploy the Python service to Cloud Run:

gcloud run deploy raw-event-ingestion-service \\\
    --source . \\\
    --platform managed \\\
    --region YOUR_GCP_REGION \\\
    --allow-unauthenticated \\\
    --set-env-vars BIGQUERY_PROJECT_ID="YOUR_GCP_PROJECT_ID",BIGQUERY_DATASET_ID="raw_events_data_lake",BIGQUERY_TABLE_ID="raw_incoming_events" \\\
    --memory 512Mi \\\
    --cpu 1 \\\
    --timeout 30s # Allow enough time for BigQuery insertion

Important: Note down the URL of this deployed Cloud Run service. The --allow-unauthenticated flag is for simplicity in this example. In a production environment, consider authenticated invocations using GTM Server Container's X-Server-Auth-Token or by explicitly creating a service account and granting it roles/run.invoker permission to the Cloud Run service, then generating a token for the sendHttpRequest in GTM. However, for internal services called by your GTM SC (which is already behind your first-party domain), allow-unauthenticated is a common pattern for simplicity and performance.

You'll also need to ensure the Cloud Run service identity has the roles/bigquery.dataEditor role on your BigQuery dataset to be able to insert rows.

3. GTM Server Container Custom Tag Template

Create a custom tag template in your GTM Server Container that fires early to send the raw event to your Cloud Run service.

Example Custom Tag Template (e.g., "Send Raw Event to BigQuery")

const sendHttpRequest = require('sendHttpRequest');
const JSON = require('JSON');
const log = require('log');
const getEventData = require('getEventData');
const getRequestHeader = require('getRequestHeader'); // To include original request headers if needed

// Configuration fields for the template:
//   - ingestionServiceUrl: Text input for your Cloud Run service URL (e.g., 'https://raw-event-ingestion-service-xxxxx-uc.a.run.app')
//   - sendFullRequestHeaders: Boolean checkbox to include all incoming HTTP request headers

const ingestionServiceUrl = data.ingestionServiceUrl;
const sendFullRequestHeaders = data.sendFullRequestHeaders;

if (!ingestionServiceUrl) {
    log('Raw Event Ingestion Service URL is not configured.', 'ERROR');
    data.gtmOnSuccess(); // Do not block other tags from firing
    return;
}

// Get all event data available at this point
const rawEventPayload = getEventData();

// Add incoming HTTP request headers if configured
if (sendFullRequestHeaders) {
    const requestHeaders = getRequestHeader(); // Returns an object of all headers
    // Note: Some headers might be sensitive (e.g., Authorization), filter if necessary.
    rawEventPayload._request_headers = requestHeaders; 
}

sendHttpRequest(ingestionServiceUrl, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(rawEventPayload),
    timeout: 5000 // 5 seconds timeout
}, (statusCode, headers, body) => {
    if (statusCode >= 200 && statusCode < 300) {
        log('Raw event sent to BigQuery service successfully.', 'INFO');
        data.gtmOnSuccess();
    } else {
        log('Raw event failed to send to BigQuery service:', statusCode, body, 'ERROR');
        data.gtmOnFailure(); // Optionally fail, but usually you want other tags to fire
    }
});

GTM SC Configuration:

  1. Create this as a Custom Tag Template (e.g., Send Raw Event to BigQuery).
  2. Grant necessary permissions: Access event data, Send HTTP requests, Access request headers.
  3. Create a Custom Tag (e.g., Raw Event Logger) using this template.
  4. Configure ingestionServiceUrl with the URL of your Cloud Run service.
  5. Set sendFullRequestHeaders to true if you want a complete audit of HTTP headers.
  6. Crucially, set the trigger for this tag to Initialization - All Pages or All Events and ensure it has the highest priority (lowest firing order number) in your container. This guarantees it runs as early as possible before other tags can modify the event. If you need to include data from a GTM SC client (like _event_metadata.client_id), ensure this tag fires after your client has processed the event, but before other tags. An All Events trigger with priority 0 or -1 is a good starting point.

Benefits of This Server-Side Event Data Lake

  • Complete Raw Data Audit Trail: Every event reaching your server-side endpoint is logged, providing an immutable record for compliance, fraud detection, or detailed debugging.
  • Independent Analytics Layer: Build custom reports, dashboards (e.g., in Looker Studio), or perform advanced SQL analysis in BigQuery, completely independent of GA4's data model or limitations.
  • Data Ownership and Vendor Agnosticism: You own your raw data stream, giving you ultimate control and flexibility to switch or integrate with other analytics/marketing platforms in the future without data loss.
  • Enhanced Debugging and Reconciliation: Compare your BigQuery raw events to what actually landed in GA4 (via its BigQuery export) to pinpoint discrepancies or errors in your GTM SC transformations.
  • Foundation for Data Science & Machine Learning: This raw, granular data is a rich source for building predictive models, LTV predictions, or customer segmentation using BigQuery ML or other tools.
  • Backfill Capabilities: If a downstream system (like GA4) has an outage or a misconfiguration, you have the raw data to potentially backfill or reprocess events later.

Important Considerations

  • PII (Personally Identifiable Information): If your client-side implementation sends raw PII to your GTM Server Container, this raw event data lake will contain that PII.
    • Access Control: Implement strict IAM access controls on your BigQuery dataset/table. Only authorized personnel should have access.
    • Retention Policies: Define data retention policies for this raw data, especially if it contains sensitive information.
    • Separate vs. Scrubbed: This raw data lake is designed to be raw. If your organizational policy dictates that no PII should ever reach BigQuery even for audit, then PII scrubbing (as discussed in previous blogs) would need to occur before this raw ingestion step. This is a crucial design decision. For a true "raw audit" log, it should contain what the GTM SC receives.
  • Cost: BigQuery storage and streaming inserts, along with Cloud Run invocations, will incur costs. Monitor usage and optimize your queries to manage expenses. For extremely high-volume sites, consider Pub/Sub as an intermediary buffer before BigQuery to handle bursts and potential backpressure.
  • Schema Evolution: Raw events can change. Design your BigQuery schema to be flexible (e.g., using JSON or STRING for payload) to accommodate new parameters or nested structures without constant table updates.

Conclusion

Building a server-side event data lake by capturing raw GTM Server Container events into BigQuery is a strategic move for any data-mature organization. It transcends the immediate needs of analytics tools, providing a robust foundation for data ownership, auditability, custom analytics, and future innovation. By integrating GTM SC, Cloud Run, and BigQuery, you create an indispensable backbone for your data engineering efforts, ensuring you always have the most granular and complete record of your digital interactions. Embrace this approach to unlock unparalleled control and value from your server-side data.