Google BigQuery: Enterprise Data Analytics Guide

BigQuery is Google Cloud's fully managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure. This guide covers everything from basic queries to advanced analytics features.

What Makes BigQuery Special?

BigQuery stands out with: - Serverless Architecture: No infrastructure to manage - Petabyte Scale: Query massive datasets in seconds - Real-time Analytics: Stream millions of rows per second - Built-in ML: Train and deploy ML models using SQL - Cost Effective: Pay only for data processed

Getting Started with BigQuery

Setting Up BigQuery

# Enable BigQuery API
gcloud services enable bigquery.googleapis.com

# Create a dataset
bq mk --dataset \
    --location=US \
    --description="My analytics dataset" \
    my_project:analytics_dataset

# Create a table from CSV
bq load \
    --source_format=CSV \
    --autodetect \
    analytics_dataset.sales_data \
    gs://my-bucket/sales_data.csv

# Query the table
bq query --use_legacy_sql=false \
    'SELECT product_name, SUM(revenue) as total_revenue
     FROM `my_project.analytics_dataset.sales_data`
     GROUP BY product_name
     ORDER BY total_revenue DESC
     LIMIT 10'

Basic SQL Queries

-- Create a table
CREATE TABLE `project.dataset.customers` (
    customer_id INT64,
    name STRING,
    email STRING,
    created_at TIMESTAMP,
    total_purchases NUMERIC,
    is_active BOOL
);

-- Insert data
INSERT INTO `project.dataset.customers`
VALUES 
    (1, 'John Doe', 'john@example.com', CURRENT_TIMESTAMP(), 1500.50, true),
    (2, 'Jane Smith', 'jane@example.com', CURRENT_TIMESTAMP(), 2300.75, true);

-- Basic aggregations
SELECT 
    DATE(created_at) as signup_date,
    COUNT(*) as new_customers,
    SUM(total_purchases) as revenue
FROM `project.dataset.customers`
WHERE created_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY signup_date
ORDER BY signup_date DESC;

Advanced BigQuery Features

Partitioned Tables

-- Create partitioned table by date
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS
SELECT * FROM `project.dataset.raw_events`;

-- Query specific partition
SELECT *
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2024-01-15'
    AND event_type = 'purchase';

-- Create partitioned table with expiration
CREATE TABLE `project.dataset.daily_stats`
PARTITION BY date
OPTIONS(
    partition_expiration_days=90,
    description="Daily statistics with 90-day retention"
)
AS
SELECT 
    DATE(timestamp) as date,
    COUNT(*) as events,
    COUNT(DISTINCT user_id) as unique_users
FROM `project.dataset.events`
GROUP BY date;

Window Functions

-- Calculate running totals and rankings
WITH sales_data AS (
    SELECT 
        date,
        product_id,
        revenue,
        SUM(revenue) OVER (
            PARTITION BY product_id 
            ORDER BY date 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) as running_total,
        RANK() OVER (
            PARTITION BY date 
            ORDER BY revenue DESC
        ) as daily_rank
    FROM `project.dataset.daily_sales`
)
SELECT * FROM sales_data
WHERE daily_rank <= 10;

-- Moving averages
SELECT 
    date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as seven_day_avg,
    AVG(daily_revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) as thirty_day_avg
FROM `project.dataset.revenue_summary`;

User-Defined Functions (UDFs)

-- JavaScript UDF
CREATE TEMP FUNCTION parseUserAgent(ua STRING)
RETURNS STRUCT<browser STRING, os STRING, device STRING>
LANGUAGE js AS """
    var parser = new UAParser(ua);
    return {
        browser: parser.getBrowser().name || 'Unknown',
        os: parser.getOS().name || 'Unknown',
        device: parser.getDevice().type || 'desktop'
    };
""";

-- SQL UDF
CREATE FUNCTION `project.dataset.classify_customer`(total_purchases NUMERIC)
RETURNS STRING
AS (
    CASE 
        WHEN total_purchases >= 10000 THEN 'Premium'
        WHEN total_purchases >= 5000 THEN 'Gold'
        WHEN total_purchases >= 1000 THEN 'Silver'
        ELSE 'Bronze'
    END
);

-- Using UDFs
SELECT 
    user_id,
    parseUserAgent(user_agent) as device_info,
    classify_customer(total_purchases) as customer_tier
FROM `project.dataset.user_activity`;

BigQuery ML: Machine Learning in SQL

Training Models

-- Linear regression for sales forecasting
CREATE OR REPLACE MODEL `project.dataset.sales_forecast_model`
OPTIONS(
    model_type='linear_reg',
    input_label_cols=['sales_amount']
) AS
SELECT 
    day_of_week,
    month,
    is_holiday,
    temperature,
    promotion_active,
    sales_amount
FROM `project.dataset.historical_sales`
WHERE date < '2024-01-01';

-- Logistic regression for churn prediction
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(
    model_type='logistic_reg',
    auto_class_weights=TRUE,
    input_label_cols=['churned']
) AS
SELECT 
    days_since_last_purchase,
    total_purchases,
    avg_order_value,
    support_tickets,
    churned
FROM `project.dataset.customer_features`;

-- K-means clustering for customer segmentation
CREATE OR REPLACE MODEL `project.dataset.customer_segments`
OPTIONS(
    model_type='kmeans',
    num_clusters=5,
    standardize_features=TRUE
) AS
SELECT 
    total_purchases,
    purchase_frequency,
    avg_order_value,
    days_as_customer
FROM `project.dataset.customer_metrics`;

Model Evaluation and Prediction

-- Evaluate model performance
SELECT *
FROM ML.EVALUATE(MODEL `project.dataset.sales_forecast_model`,
    (SELECT * FROM `project.dataset.test_data`));

-- Make predictions
SELECT 
    customer_id,
    predicted_churned,
    predicted_churned_probs[OFFSET(1)].prob as churn_probability
FROM ML.PREDICT(MODEL `project.dataset.churn_model`,
    (SELECT * FROM `project.dataset.current_customers`))
WHERE predicted_churned_probs[OFFSET(1)].prob > 0.8;

-- Feature importance
SELECT *
FROM ML.FEATURE_INFO(MODEL `project.dataset.churn_model`)
ORDER BY importance DESC;

Real-time Streaming Analytics

Setting Up Streaming

# Python streaming client
from google.cloud import bigquery

client = bigquery.Client()
table_id = "project.dataset.realtime_events"

# Define schema
schema = [
    bigquery.SchemaField("event_id", "STRING"),
    bigquery.SchemaField("user_id", "STRING"),
    bigquery.SchemaField("event_type", "STRING"),
    bigquery.SchemaField("timestamp", "TIMESTAMP"),
    bigquery.SchemaField("properties", "JSON"),
]

# Stream data
rows_to_insert = [
    {
        "event_id": "evt_123",
        "user_id": "user_456",
        "event_type": "page_view",
        "timestamp": "2024-01-31 10:30:00",
        "properties": {"page": "/products", "referrer": "google.com"}
    }
]

errors = client.insert_rows_json(table_id, rows_to_insert)
if errors:
    print(f"Errors: {errors}")

Real-time Dashboards

-- Create materialized view for real-time metrics
CREATE MATERIALIZED VIEW `project.dataset.realtime_metrics`
PARTITION BY DATE(timestamp)
CLUSTER BY event_type
AS
SELECT 
    TIMESTAMP_TRUNC(timestamp, MINUTE) as minute,
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users,
    APPROX_QUANTILES(
        TIMESTAMP_DIFF(timestamp, LAG(timestamp) OVER (
            PARTITION BY user_id ORDER BY timestamp
        ), SECOND), 100
    )[OFFSET(50)] as median_time_between_events
FROM `project.dataset.realtime_events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY minute, event_type;

-- Query real-time metrics
SELECT *
FROM `project.dataset.realtime_metrics`
WHERE minute >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY minute DESC;

Cost Optimization Strategies

Query Optimization

-- Use approximate functions for large datasets
SELECT 
    APPROX_COUNT_DISTINCT(user_id) as unique_users,
    APPROX_QUANTILES(revenue, 100)[OFFSET(50)] as median_revenue,
    APPROX_TOP_COUNT(product_id, 10) as top_products
FROM `project.dataset.transactions`
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

-- Optimize with WHERE clauses on partitioned columns
SELECT *
FROM `project.dataset.events`
WHERE DATE(timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
    AND user_id = 'specific_user';  -- Clustered column

-- Use TABLESAMPLE for exploratory analysis
SELECT *
FROM `project.dataset.large_table` TABLESAMPLE SYSTEM (1 PERCENT);

Storage Optimization

-- Create clustered tables
CREATE TABLE `project.dataset.optimized_events`
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_type
AS
SELECT * FROM `project.dataset.raw_events`;

-- Set table expiration
ALTER TABLE `project.dataset.temp_analysis`
SET OPTIONS (
    expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
);

-- Archive old data
CREATE TABLE `project.dataset.archived_events`
OPTIONS(
    description="Archived events older than 1 year"
)
AS
SELECT *
FROM `project.dataset.events`
WHERE DATE(timestamp) < DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR);

Integration with Other GCP Services

Dataflow Integration

-- Create external table from Dataflow output
CREATE EXTERNAL TABLE `project.dataset.streaming_data`
OPTIONS (
    format = 'AVRO',
    uris = ['gs://my-bucket/dataflow-output/*.avro']
);

-- Query streaming data with batch data
WITH combined_data AS (
    SELECT * FROM `project.dataset.batch_data`
    UNION ALL
    SELECT * FROM `project.dataset.streaming_data`
)
SELECT 
    DATE(timestamp) as date,
    COUNT(*) as total_events
FROM combined_data
GROUP BY date;

Pub/Sub Integration

# Stream from Pub/Sub to BigQuery
from google.cloud import pubsub_v1, bigquery
import json

subscriber = pubsub_v1.SubscriberClient()
bq_client = bigquery.Client()

def process_message(message):
    data = json.loads(message.data.decode('utf-8'))

    # Transform and insert into BigQuery
    rows = [{
        'event_id': data['id'],
        'timestamp': data['timestamp'],
        'user_id': data['user_id'],
        'event_data': json.dumps(data['properties'])
    }]

    errors = bq_client.insert_rows_json(
        'project.dataset.events', 
        rows
    )

    if not errors:
        message.ack()
    else:
        print(f"Insert errors: {errors}")

subscription_path = subscriber.subscription_path(
    'project-id', 'subscription-name'
)
flow_control = pubsub_v1.types.FlowControl(max_messages=100)

subscriber.subscribe(
    subscription_path, 
    callback=process_message,
    flow_control=flow_control
)

Security and Governance

Data Access Control

-- Create authorized view
CREATE VIEW `project.dataset.customer_view`
OPTIONS(
    description="Authorized view for customer data",
    labels=[("team", "analytics"), ("pii", "true")]
)
AS
SELECT 
    customer_id,
    -- Mask sensitive data
    REGEXP_REPLACE(email, r'(.{2}).*(@.*)', r'\1****\2') as masked_email,
    created_at,
    total_purchases,
    customer_tier
FROM `project.dataset.customers`
WHERE is_active = true;

-- Grant access to view
GRANT `roles/bigquery.dataViewer`
ON TABLE `project.dataset.customer_view`
TO "group:analysts@company.com";

Column-level Security

-- Create policy tags
CREATE POLICY TAG `project.location.taxonomy.pii`
OPTIONS(
    description="Personally Identifiable Information"
);

-- Apply column-level security
ALTER TABLE `project.dataset.customers`
ALTER COLUMN email 
SET OPTIONS(
    policy_tags=["projects/project-id/locations/us/taxonomies/123/policyTags/456"]
);

-- Row-level security
CREATE ROW ACCESS POLICY sales_team_policy
ON `project.dataset.sales_data`
GRANT TO ("group:sales@company.com")
FILTER USING (region = 'US');

Best Practices

Query Performance

Partition and Cluster: Always partition by date and cluster by frequently filtered columns
Avoid SELECT *: Specify only needed columns to reduce data scanned
Use appropriate data types: Use INT64 instead of STRING for numeric values
Leverage caching: Results are cached for 24 hours
Optimize JOIN operations: Filter data before joining

Data Modeling

-- Denormalized design for analytics
CREATE TABLE `project.dataset.sales_fact` AS
SELECT 
    s.sale_id,
    s.sale_date,
    s.amount,
    c.customer_name,
    c.customer_tier,
    p.product_name,
    p.category,
    st.store_name,
    st.region
FROM `project.dataset.sales` s
JOIN `project.dataset.customers` c ON s.customer_id = c.customer_id
JOIN `project.dataset.products` p ON s.product_id = p.product_id
JOIN `project.dataset.stores` st ON s.store_id = st.store_id;

-- Create aggregate tables for common queries
CREATE TABLE `project.dataset.daily_sales_summary`
PARTITION BY sale_date
AS
SELECT 
    sale_date,
    region,
    category,
    COUNT(*) as transaction_count,
    SUM(amount) as total_sales,
    AVG(amount) as avg_sale_amount
FROM `project.dataset.sales_fact`
GROUP BY sale_date, region, category;

Conclusion

BigQuery is a powerful tool for data analytics at any scale. Its serverless architecture, SQL interface, and integration with machine learning make it ideal for modern data workflows. Start with simple queries and gradually explore advanced features like streaming analytics and BigQuery ML.

Next Steps

Explore BigQuery Omni for multi-cloud analytics
Learn about BigQuery BI Engine for interactive dashboards
Implement automated data pipelines with Dataflow
Study advanced SQL optimization techniques
Get certified as a Google Cloud Data Engineer

Remember: BigQuery's strength lies in its ability to analyze massive datasets quickly and cost-effectively. Use partitioning, clustering, and query optimization to maximize performance while minimizing costs.

Google BigQuery: Enterprise Data Analytics Guide

Need Professional Google Cloud Services?

Google BigQuery: Enterprise Data Analytics Guide

What Makes BigQuery Special?

Getting Started with BigQuery

Setting Up BigQuery

Basic SQL Queries

Advanced BigQuery Features

Partitioned Tables

Window Functions

User-Defined Functions (UDFs)

BigQuery ML: Machine Learning in SQL

Training Models

Model Evaluation and Prediction

Real-time Streaming Analytics

Setting Up Streaming

Real-time Dashboards

Cost Optimization Strategies

Query Optimization

Storage Optimization

Integration with Other GCP Services

Dataflow Integration

Pub/Sub Integration

Security and Governance

Data Access Control

Column-level Security

Best Practices

Query Performance

Data Modeling

Conclusion

Next Steps