Skip to content

Add first-class Kubernetes safe-to-evict support #250

@nateGeorge

Description

@nateGeorge

Summary

Kubernetes launchers should expose a first-class, policy-controlled way for a task to request selected pod-level metadata, especially autoscaler eviction protection for long-running jobs.

Problem

Some long-running Kubernetes workloads need pod metadata such as:

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

to avoid disruption during cluster scale-down. Today, task annotations are launcher/control-plane inputs, not Kubernetes pod annotations. That boundary is good: blindly copying every task annotation onto pod metadata would leak internal launcher controls and could accidentally trigger Kubernetes/webhook behavior.

The gap is that downstream launcher wrappers may need to add one-off allowlists to forward specific pod annotations. That works as a tactical fix, but it is not an ideal upstream API.

Proposal

Add an explicit Kubernetes-launcher API for eviction protection, for example:

tangleml.com/launchers/kubernetes/safe_to_evict: "false"

The Kubernetes launcher would map this to pod metadata:

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

This should apply to both direct pod launches and Job/multi-node launches by setting the pod template metadata annotations.

A more general follow-up option would be a validated/allowlisted pod-annotation passthrough API, but the autoscaler case can be solved with a narrower first-class knob.

Non-goals

  • Do not pass through all task annotations to Kubernetes pod metadata.
  • Do not make arbitrary Kubernetes metadata mutation the default behavior.
  • Do not require task authors to know every internal launcher annotation.

Acceptance criteria

  • Kubernetes pod launcher supports an explicit safe-to-evict task annotation.
  • Kubernetes Job / multi-node launcher sets the annotation on spec.template.metadata.annotations.
  • Boolean/string values are validated or normalized.
  • Tests cover both forwarding the supported annotation and not forwarding arbitrary task annotations.
  • Documentation explains when to use this for long-running jobs.

Migration note

Downstream deployments may carry a narrow allowlisted passthrough as an immediate operational fix. Once this exists upstream and is released, those wrappers can adopt the Tangle version and remove the local one-off mapping.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions