This project implements a fully serverless, event-driven document processing pipeline on Google Cloud Platform (GCP).
Whenever a file is uploaded to the Cloud Storage ingestion bucket, it triggers a Pub/Sub notification. Pub/Sub pushes the event payload to a Python FastAPI service running on Cloud Run. The processor service downloads the file, performs content-aware text parsing (simulating OCR), extracts metadata (word count, keyword tags, and file details), and streams the metadata directly into a BigQuery table.
graph TD
User([User / Client]) -->|Upload File| GCS[Cloud Storage Bucket]
GCS -->|Object Created Notification| PubSubTopic[Pub/Sub Topic]
PubSubTopic -->|Push Subscription + OIDC Token| CloudRun[Cloud Run: FastAPI Processor]
CloudRun -->|1. Download File| GCS
CloudRun -->|2. Process Simulated OCR| CloudRun
CloudRun -->|3. Stream Metadata| BQ[BigQuery Dataset & Table]
├── README.md
├── processor/
│ ├── Dockerfile
│ ├── main.py
│ ├── requirements.txt
│ └── test_processor.py
└── terraform/
├── main.tf
├── outputs.tf
└── variables.tf
You can run the FastAPI processor locally and execute unit tests (which mock GCS and BigQuery APIs) to verify the application logic.
cd processor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpytestFollow these steps to deploy the entire pipeline to Google Cloud.
- Google Cloud SDK (gcloud CLI) installed and authenticated.
- Terraform CLI installed.
- An active GCP Project with billing enabled.
Authenticate your local shell with GCP and set your project context:
gcloud auth login
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_IDCreate a Docker registry in your project, build the processor container image, and push it:
# Define variables
PROJECT_ID="YOUR_PROJECT_ID"
REGION="us-central1"
REPO_NAME="document-processing-repo"
IMAGE_NAME="processor"
TAG="latest"
# 1. Create Artifact Registry repository
gcloud artifacts repositories create $REPO_NAME \
--repository-format=docker \
--location=$REGION \
--description="Docker repository for document processor service"
# 2. Configure Docker authentication for Artifact Registry
gcloud auth configure-docker $REGION-docker.pkg.dev
# 3. Build the container image using Google Cloud Build (runs remotely without local Docker daemon)
cd processor
gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$TAG
cd ..Now, deploy the GCP resources (GCS bucket, Pub/Sub notifications, BigQuery, IAM roles, and Cloud Run):
cd terraform
# Initialize Terraform
terraform init
# Plan and preview the deployment
# Replace the variables below with your project details
terraform plan \
-var="project_id=YOUR_PROJECT_ID" \
-var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
-var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"
# Apply changes to GCP
terraform apply \
-var="project_id=YOUR_PROJECT_ID" \
-var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
-var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"Note: The script enables necessary APIs automatically (which may take a moment on the first run).
To verify that everything is working properly:
-
Upload a sample document to the ingestion bucket:
# Upload a text file with keywords echo "This project report outlines the financial invoices and agreement contracts." > test-document.txt gcloud storage cp test-document.txt gs://YOUR_GLOBALLY_UNIQUE_BUCKET_NAME/test-document.txt
-
Check Cloud Run Logs: Open the Cloud Run console or run:
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=document-processor" --limit 20
You should see entries indicating:
- Event received from Pub/Sub
- File downloaded successfully
- Word count processed (10 words) and tags identified (
project,report,financial,invoice,contract,agreement) - Row streamed successfully to BigQuery.
-
Query BigQuery Table: Run a query to inspect the uploaded metadata:
bq query --use_legacy_sql=false \ 'SELECT filename, word_count, tags, file_type, processed_at FROM `YOUR_PROJECT_ID.document_processing.metadata` LIMIT 10'
To tear down the deployed resources and avoid ongoing charges:
terraform destroy \
-var="project_id=YOUR_PROJECT_ID" \
-var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
-var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"You can also delete the Artifact Registry image repository:
gcloud artifacts repositories delete document-processing-repo --location=us-central1 --quiet