Serverless Event-Driven Document Processing Pipeline on GCP

This project implements a fully serverless, event-driven document processing pipeline on Google Cloud Platform (GCP).

Architecture

Whenever a file is uploaded to the Cloud Storage ingestion bucket, it triggers a Pub/Sub notification. Pub/Sub pushes the event payload to a Python FastAPI service running on Cloud Run. The processor service downloads the file, performs content-aware text parsing (simulating OCR), extracts metadata (word count, keyword tags, and file details), and streams the metadata directly into a BigQuery table.

graph TD
    User([User / Client]) -->|Upload File| GCS[Cloud Storage Bucket]
    GCS -->|Object Created Notification| PubSubTopic[Pub/Sub Topic]
    PubSubTopic -->|Push Subscription + OIDC Token| CloudRun[Cloud Run: FastAPI Processor]
    CloudRun -->|1. Download File| GCS
    CloudRun -->|2. Process Simulated OCR| CloudRun
    CloudRun -->|3. Stream Metadata| BQ[BigQuery Dataset & Table]

Project Directory Structure

├── README.md
├── processor/
│   ├── Dockerfile
│   ├── main.py
│   ├── requirements.txt
│   └── test_processor.py
└── terraform/
    ├── main.tf
    ├── outputs.tf
    └── variables.tf

Local Development & Testing

You can run the FastAPI processor locally and execute unit tests (which mock GCS and BigQuery APIs) to verify the application logic.

1. Set up Virtual Environment

cd processor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Run Tests

pytest

GCP Deployment Guide

Follow these steps to deploy the entire pipeline to Google Cloud.

Prerequisites

Google Cloud SDK (gcloud CLI) installed and authenticated.
Terraform CLI installed.
An active GCP Project with billing enabled.

Step 1: Authenticate with Google Cloud

Authenticate your local shell with GCP and set your project context:

gcloud auth login
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

Step 2: Create Artifact Registry & Push Container Image

Create a Docker registry in your project, build the processor container image, and push it:

# Define variables
PROJECT_ID="YOUR_PROJECT_ID"
REGION="us-central1"
REPO_NAME="document-processing-repo"
IMAGE_NAME="processor"
TAG="latest"

# 1. Create Artifact Registry repository
gcloud artifacts repositories create $REPO_NAME \
    --repository-format=docker \
    --location=$REGION \
    --description="Docker repository for document processor service"

# 2. Configure Docker authentication for Artifact Registry
gcloud auth configure-docker $REGION-docker.pkg.dev

# 3. Build the container image using Google Cloud Build (runs remotely without local Docker daemon)
cd processor
gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$TAG
cd ..

Step 3: Deploy Infrastructure with Terraform

Now, deploy the GCP resources (GCS bucket, Pub/Sub notifications, BigQuery, IAM roles, and Cloud Run):

cd terraform

# Initialize Terraform
terraform init

# Plan and preview the deployment
# Replace the variables below with your project details
terraform plan \
  -var="project_id=YOUR_PROJECT_ID" \
  -var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
  -var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"

# Apply changes to GCP
terraform apply \
  -var="project_id=YOUR_PROJECT_ID" \
  -var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
  -var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"

Note: The script enables necessary APIs automatically (which may take a moment on the first run).

Step 4: Verification and Testing

To verify that everything is working properly:

Upload a sample document to the ingestion bucket:

# Upload a text file with keywords
echo "This project report outlines the financial invoices and agreement contracts." > test-document.txt
gcloud storage cp test-document.txt gs://YOUR_GLOBALLY_UNIQUE_BUCKET_NAME/test-document.txt

Check Cloud Run Logs: Open the Cloud Run console or run:
```
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=document-processor" --limit 20
```
You should see entries indicating:
- Event received from Pub/Sub
- File downloaded successfully
- Word count processed (10 words) and tags identified (project, report, financial, invoice, contract, agreement)
- Row streamed successfully to BigQuery.

Query BigQuery Table: Run a query to inspect the uploaded metadata:

bq query --use_legacy_sql=false \
  'SELECT filename, word_count, tags, file_type, processed_at FROM `YOUR_PROJECT_ID.document_processing.metadata` LIMIT 10'

Step 5: Clean Up Resources

To tear down the deployed resources and avoid ongoing charges:

terraform destroy \
  -var="project_id=YOUR_PROJECT_ID" \
  -var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
  -var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"

You can also delete the Artifact Registry image repository:

gcloud artifacts repositories delete document-processing-repo --location=us-central1 --quiet

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
processor		processor
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless Event-Driven Document Processing Pipeline on GCP

Architecture

Project Directory Structure

Local Development & Testing

1. Set up Virtual Environment

2. Run Tests

GCP Deployment Guide

Prerequisites

Step 1: Authenticate with Google Cloud

Step 2: Create Artifact Registry & Push Container Image

Step 3: Deploy Infrastructure with Terraform

Step 4: Verification and Testing

Step 5: Clean Up Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Serverless Event-Driven Document Processing Pipeline on GCP

Architecture

Project Directory Structure

Local Development & Testing

1. Set up Virtual Environment

2. Run Tests

GCP Deployment Guide

Prerequisites

Step 1: Authenticate with Google Cloud

Step 2: Create Artifact Registry & Push Container Image

Step 3: Deploy Infrastructure with Terraform

Step 4: Verification and Testing

Step 5: Clean Up Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages