Skip to content

iPablo26/data-sci-ai-cloud-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Serverless Event-Driven Document Processing Pipeline on GCP

This project implements a fully serverless, event-driven document processing pipeline on Google Cloud Platform (GCP).

Architecture

Whenever a file is uploaded to the Cloud Storage ingestion bucket, it triggers a Pub/Sub notification. Pub/Sub pushes the event payload to a Python FastAPI service running on Cloud Run. The processor service downloads the file, performs content-aware text parsing (simulating OCR), extracts metadata (word count, keyword tags, and file details), and streams the metadata directly into a BigQuery table.

graph TD
    User([User / Client]) -->|Upload File| GCS[Cloud Storage Bucket]
    GCS -->|Object Created Notification| PubSubTopic[Pub/Sub Topic]
    PubSubTopic -->|Push Subscription + OIDC Token| CloudRun[Cloud Run: FastAPI Processor]
    CloudRun -->|1. Download File| GCS
    CloudRun -->|2. Process Simulated OCR| CloudRun
    CloudRun -->|3. Stream Metadata| BQ[BigQuery Dataset & Table]
Loading

Project Directory Structure

├── README.md
├── processor/
│   ├── Dockerfile
│   ├── main.py
│   ├── requirements.txt
│   └── test_processor.py
└── terraform/
    ├── main.tf
    ├── outputs.tf
    └── variables.tf

Local Development & Testing

You can run the FastAPI processor locally and execute unit tests (which mock GCS and BigQuery APIs) to verify the application logic.

1. Set up Virtual Environment

cd processor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Run Tests

pytest

GCP Deployment Guide

Follow these steps to deploy the entire pipeline to Google Cloud.

Prerequisites


Step 1: Authenticate with Google Cloud

Authenticate your local shell with GCP and set your project context:

gcloud auth login
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

Step 2: Create Artifact Registry & Push Container Image

Create a Docker registry in your project, build the processor container image, and push it:

# Define variables
PROJECT_ID="YOUR_PROJECT_ID"
REGION="us-central1"
REPO_NAME="document-processing-repo"
IMAGE_NAME="processor"
TAG="latest"

# 1. Create Artifact Registry repository
gcloud artifacts repositories create $REPO_NAME \
    --repository-format=docker \
    --location=$REGION \
    --description="Docker repository for document processor service"

# 2. Configure Docker authentication for Artifact Registry
gcloud auth configure-docker $REGION-docker.pkg.dev

# 3. Build the container image using Google Cloud Build (runs remotely without local Docker daemon)
cd processor
gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$TAG
cd ..

Step 3: Deploy Infrastructure with Terraform

Now, deploy the GCP resources (GCS bucket, Pub/Sub notifications, BigQuery, IAM roles, and Cloud Run):

cd terraform

# Initialize Terraform
terraform init

# Plan and preview the deployment
# Replace the variables below with your project details
terraform plan \
  -var="project_id=YOUR_PROJECT_ID" \
  -var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
  -var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"

# Apply changes to GCP
terraform apply \
  -var="project_id=YOUR_PROJECT_ID" \
  -var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
  -var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"

Note: The script enables necessary APIs automatically (which may take a moment on the first run).


Step 4: Verification and Testing

To verify that everything is working properly:

  1. Upload a sample document to the ingestion bucket:

    # Upload a text file with keywords
    echo "This project report outlines the financial invoices and agreement contracts." > test-document.txt
    gcloud storage cp test-document.txt gs://YOUR_GLOBALLY_UNIQUE_BUCKET_NAME/test-document.txt
  2. Check Cloud Run Logs: Open the Cloud Run console or run:

    gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=document-processor" --limit 20

    You should see entries indicating:

    • Event received from Pub/Sub
    • File downloaded successfully
    • Word count processed (10 words) and tags identified (project, report, financial, invoice, contract, agreement)
    • Row streamed successfully to BigQuery.
  3. Query BigQuery Table: Run a query to inspect the uploaded metadata:

    bq query --use_legacy_sql=false \
      'SELECT filename, word_count, tags, file_type, processed_at FROM `YOUR_PROJECT_ID.document_processing.metadata` LIMIT 10'

Step 5: Clean Up Resources

To tear down the deployed resources and avoid ongoing charges:

terraform destroy \
  -var="project_id=YOUR_PROJECT_ID" \
  -var="bucket_name=YOUR_GLOBALLY_UNIQUE_BUCKET_NAME" \
  -var="image_url=us-central1-docker.pkg.dev/YOUR_PROJECT_ID/document-processing-repo/processor:latest"

You can also delete the Artifact Registry image repository:

gcloud artifacts repositories delete document-processing-repo --location=us-central1 --quiet

About

A repository for my Data Science, AI & Cloud hands-on implementations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors