The current Action for submitting Batch jobs to AWS has a few issues:
After a bunch of research, I propose the following changes:
Switch to Deployments
Rather than polling the Batch job with a running Actions job. We should instead take advantage of GitHub Deployments. This would keep the current strategy for submitting jobs, but would offload status reporting to the Batch job.
This would involve adding an entrypoint script to all jobs which would auth with GitHub (via a GitHub application JWT), then update the deploy status as the Batch job runs.
This completely obviates the need for the polling logic and solves the 6 hour problem. It also allows us to...
Use different deploy environments to manage job types
Instead of controlling job resources by editing the workflow YAML, we can setup different deploy environments, each with an associated job size. This has many advantages:
- It makes it much easier to change the job resources on-the-fly
- It means we can add a dropdown on the workflow dispatch that lets you pick the deploy env (and therefore resources)
- We can gate the largest, most costly jobs behind deploy protections
I envision this as basically a dropdown with the following environments:
- Small (Fargate, 4 vCPU, 8 GB RAM)
- Medium (Fargate, 16 vCPU, 32 GB RAM)
- Large (EC2, 32 vCPU, 64 GB RAM)
- LargeGPU (EC2, same as large + GPU)
Switch to static job queue and compute envs
We currently instantiate the job queue and computer environment for each PR. However, this seems unnecessarily complicated and error-prone. Instead, I propose we create 4 permanent job queues/compute environments, one for each of the environments above.
This way, the only thing we need to Terraform for each workflow is the job definition, as based on the built container and deploy env. We can further simplify the cleanup step to run after receiving a deployment status update from AWS. It would then only need to delete the job definition, since the other resources are static.
The current Action for submitting Batch jobs to AWS has a few issues:
build-and-run-batch-jobto use GitHub webooks to push job state #13.workflow_dispatch) never run the cleanup job, since a PR doesn't get closed to trigger it.After a bunch of research, I propose the following changes:
Switch to Deployments
Rather than polling the Batch job with a running Actions job. We should instead take advantage of GitHub Deployments. This would keep the current strategy for submitting jobs, but would offload status reporting to the Batch job.
This would involve adding an entrypoint script to all jobs which would auth with GitHub (via a GitHub application JWT), then update the deploy status as the Batch job runs.
This completely obviates the need for the polling logic and solves the 6 hour problem. It also allows us to...
Use different deploy environments to manage job types
Instead of controlling job resources by editing the workflow YAML, we can setup different deploy environments, each with an associated job size. This has many advantages:
I envision this as basically a dropdown with the following environments:
Switch to static job queue and compute envs
We currently instantiate the job queue and computer environment for each PR. However, this seems unnecessarily complicated and error-prone. Instead, I propose we create 4 permanent job queues/compute environments, one for each of the environments above.
This way, the only thing we need to Terraform for each workflow is the job definition, as based on the built container and deploy env. We can further simplify the cleanup step to run after receiving a deployment status update from AWS. It would then only need to delete the job definition, since the other resources are static.