A production-ready, fully self-hosted Git service and CI/CD platform built on a 6-node K3s cluster. Infrastructure-as-code all the way down — every component declared, versioned, and reproducible.
- Overview
- Architecture
- Cluster Topology
- IP Addressing
- Component Stack
- Design Decisions
- Network Policy & Security
- CI/CD Autoscaling
- Deployment
- Repository Structure
This repository contains all Kubernetes manifests needed to deploy and operate a self-hosted GitOps platform. The platform is built on a 6-node K3s cluster (3 control-plane + 3 workers) and provides:
| Capability | Technology |
|---|---|
| Source control & Actions | Gitea |
| CI/CD execution | Act Runner (GitHub Actions-compatible) |
| Autoscaling | KEDA |
| Ingress + TCP routing | Traefik v3 |
| TLS automation | cert-manager + Let's Encrypt |
| Load balancing | MetalLB (L2/ARP mode) |
| Control-plane HA | kube-vip (ARP) |
| Automated node maintenance | kured |
Everything is managed through Kustomize with Helm used solely for the Gitea application chart. There is no GitOps controller (Flux/ArgoCD) — deployments are driven by idempotent shell scripts that wrap kubectl and helm.
┌─────────────────────────────────────────────────┐
│ LAN / Internet │
└───────────────────────┬─────────────────────────┘
│
┌──────────────────▼──────────────────┐
│ MetalLB VIP 172.16.69.60 │
│ (L2/ARP — announced on LAN) │
└──────────────────┬──────────────────┘
│
┌──────────────────▼──────────────────┐
│ Traefik (2 replicas) │
│ :80 (HTTP) :443 (HTTPS) │
│ :2222 (TCP — Git SSH) │
└──────┬──────────────────┬───────────┘
│ HTTP(S) │ TCP/SSH
┌──────────▼──────┐ ┌───────▼──────────┐
│ IngressRoute │ │ IngressRouteTCP │
│ git.open-ict.hu│ │ port 2222 │
└──────────┬──────┘ └───────┬──────────┘
│ │
┌──────────▼───────────────────▼──────────┐
│ Gitea (1 replica) │
│ rootless · port 3000/22 │
└──────────┬──────────────────┬───────────┘
│ │
┌─────────────▼──────┐ ┌─────────▼───────────────┐
│ PostgreSQL HA │ │ Valkey Cluster │
│ 2 replicas │ │ 6 nodes (3M + 3R) │
│ + 2 pgpool │ │ hard anti-affinity │
└────────────────────┘ └─────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Control Plane HA │
│ │
│ kube-vip VIP 172.16.69.50:6443 (ARP) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ master1 │ │ master2 │ │ master3 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ CI/CD Autoscaling (KEDA) │
│ │
│ Gitea Job Queue ──► KEDA ScaledObject ──► Act Runner pods │
│ (poll every 15s) min=2 / max=10 (dind sidecar) │
└──────────────────────────────────────────────────────────────────────┘
| Node | Role | Description |
|---|---|---|
master1 |
Control Plane | Cluster init node, bootstraps kube-vip + platform components |
master2 |
Control Plane | Joins via VIP 172.16.69.50:6443 |
master3 |
Control Plane | Joins via VIP 172.16.69.50:6443 |
worker1 |
Worker | Runs application workloads |
worker2 |
Worker | Runs application workloads |
worker3 |
Worker | Runs application workloads |
The three control-plane nodes provide etcd quorum — the cluster tolerates the loss of one control-plane node without interruption. kube-vip floats the API VIP across whichever control-plane node is the current leader.
| Address | Role | Component |
|---|---|---|
172.16.69.50 |
Control-plane VIP | kube-vip |
172.16.69.60 |
Application LoadBalancer VIP | MetalLB |
Both VIPs are announced via ARP (Layer 2), which works well on a flat LAN (e.g. Proxmox virtual network). Upstream routing is not required.
Pod-to-pod DNS resolution for git.open-ict.hu is solved with hostAliases injected directly into cert-manager and KEDA operator pods, pointing the hostname at the MetalLB VIP. This avoids a dependency on split-horizon DNS while keeping the Let's Encrypt HTTP-01 challenge and the KEDA runner-queue API working from inside the cluster.
Deployed during cluster bootstrap via scripts/01-bootstrap-first-master.sh. These components are prerequisites for everything else.
| Component | Version | Namespace | Purpose |
|---|---|---|---|
| K3s | v1.32.3+k3s1 | — | Lightweight Kubernetes distribution |
| kube-vip | v0.8.7 | kube-system |
Floating VIP for the Kubernetes API server |
| MetalLB | v0.14.9 | metallb-system |
L2 load balancer for application services |
| kured | v1.15.0 | kube-system |
Automated rolling node reboot (weekdays 02:00–05:00) |
| KEDA | v2.15.1 | keda |
Event-driven pod autoscaling |
Deployed via scripts/04-deploy-apps.sh. Order matters — cert-manager must be ready before Gitea, Traefik must have its LoadBalancer IP before routes are created.
| Component | Version | Namespace | Purpose |
|---|---|---|---|
| Traefik | v3.3.4 | traefik |
Ingress controller + TCP proxy |
| cert-manager | v1.15.3 | cert-manager |
Automated TLS certificates via Let's Encrypt |
| Gitea | 1.23.8 | gitea |
Self-hosted Git service with Actions support |
| PostgreSQL HA | chart | gitea |
Highly available database for Gitea |
| Valkey Cluster | chart | gitea |
Distributed cache and session store |
| Act Runner | 0.4.1 | gitea-runners |
GitHub Actions-compatible CI/CD executor |
K3s ships with its own ServiceLB (formerly Klipper), which satisfies LoadBalancer services by running a hostPort DaemonSet on every node. It is functional but has a critical limitation: it cannot guarantee a stable, single IP address across the cluster. Different nodes can advertise the service at their own node IPs, which creates ambiguity and breaks DNS-based routing.
MetalLB in L2 mode solves this cleanly:
- A single virtual IP (
172.16.69.60) is announced via ARP. - The speaker pod that wins leader election holds the VIP; if that node goes down, a new speaker takes over and announces the VIP within seconds.
- Traefik's
LoadBalancerservice always resolves to one predictable IP, which is what DNS records and Let's Encrypt HTTP-01 challenges depend on.
K3s is launched with --disable=servicelb to remove the conflict.
kube-vip runs as a static pod on each control-plane node (placed directly into /etc/kubernetes/manifests/ before K3s starts). It uses ARP-based leader election to float the VIP 172.16.69.50 across whichever control-plane node is currently healthy.
This is kept completely separate from MetalLB by design:
- kube-vip is responsible for API server access only — it never touches application traffic.
- MetalLB is responsible for application LoadBalancer services only.
Mixing the two responsibilities into one tool would make troubleshooting harder and couple the availability of kubectl access to the availability of your application IP pool.
Gitea supports Git-over-SSH. Rather than exposing an additional LoadBalancer service (which would consume a second IP from the MetalLB pool), SSH traffic is routed through Traefik via a dedicated TCP entrypoint on port 2222.
The IngressRouteTCP resource in apps/gitea/ingressroute-tcp.yaml matches all traffic on that entrypoint using HostSNI('*') (TCP passthrough — no TLS inspection) and forwards it to the Gitea SSH service. From the user's perspective, their Git remote is simply ssh://git.open-ict.hu:2222.
This keeps the entire platform reachable through a single IP address.
Linux restricts binding to ports below 1024 to processes running as root. Traefik runs as UID 65532 (non-root). Rather than granting the NET_BIND_SERVICE capability, Traefik listens on high ports (8000, 8443, 2222) internally. The LoadBalancer service maps the standard external ports (80, 443, 2222) to these high internal ports via targetPort. No capabilities needed, no root required.
Gitea's data lives in PostgreSQL. A single-instance database is a hard availability boundary — if the pod restarts or the node is drained for maintenance, Gitea becomes unavailable until it recovers.
The PostgreSQL HA chart deploys:
- 2 PostgreSQL replicas — one primary, one hot standby with streaming replication.
- 2 pgpool replicas — connection pool and query router. pgpool handles failover promotion transparently; Gitea only ever connects to pgpool, never directly to a Postgres pod.
This means a PostgreSQL primary failure causes a brief pause while pgpool promotes the standby, after which Gitea automatically reconnects — rather than a full outage until a pod is rescheduled.
Valkey (a Redis-compatible fork) is deployed as a 6-node cluster: 3 shards, each with a primary and a replica. All 6 pods have hard pod anti-affinity on kubernetes.io/hostname, meaning each must land on a different node.
With 6 nodes in a 6-node cluster (3 CP + 3 workers), this guarantees:
- No two Valkey pods share a node.
- The cluster can survive the loss of one shard's primary and still serve cache traffic from the remaining 4 nodes.
- A full node failure only takes down one shard, not the entire cache.
This is deliberately over-provisioned for a platform of this scale — the goal is to demonstrate cluster-aware placement and HA patterns.
Each Act Runner pod registers itself with Gitea on startup using a one-time registration token and deregisters on graceful shutdown. This means:
- Crashed or deleted pods do not leave zombie runner registrations behind in Gitea.
- New pods are always registered with a fresh identity — no stale state from previous runs.
- The runner registration token is generated by the bootstrap script via the Gitea API and stored as a Kubernetes
Secret; it is never committed to this repository.
The termination grace period is set to 3660 seconds (one hour plus one minute). This gives a running CI job a full hour to complete before the pod is force-killed during a rolling update or scale-down event.
The runner Deployment starts at 0 replicas. KEDA watches the Gitea Actions job queue (via the github-runner trigger, which is Gitea-compatible) and scales the deployment based on queued job count:
| Condition | Replicas |
|---|---|
| No jobs queued | 0 (or 2 if Gitea API is unreachable for 3 consecutive polls) |
| Jobs queued | 1 runner per queued job, up to 10 |
| Post-job cooldown | Scales back down after 120 seconds |
The fallback minimum of 2 exists as a safety net: if the Gitea API is temporarily unreachable, KEDA switches to fallback mode and maintains a minimum 2 runners rather than scaling to zero, preventing jobs from getting stuck with no runner available.
All namespaces with application workloads have explicit NetworkPolicy resources. The default posture is deny-all ingress and egress, with specific allow rules for each required communication path.
| Direction | Peer | Ports | Purpose |
|---|---|---|---|
| Ingress | traefik namespace |
3000, 22 | HTTP and SSH from ingress controller |
| Ingress | gitea-runners namespace |
3000 | Runner API calls |
| Ingress | keda namespace |
3000 | KEDA job-queue polling |
| Egress | gitea namespace (pgpool) |
5432 | Database connections |
| Egress | gitea namespace (valkey) |
6379 | Cache and session store |
| Egress | External | 443, 25 | HTTPS outbound + SMTP for notifications |
| Component | UID | Read-only rootfs | Seccomp | Capabilities |
|---|---|---|---|---|
| Traefik | 65532 | Yes | RuntimeDefault | drop ALL |
| cert-manager | 1000 | Yes | RuntimeDefault | drop ALL |
| Gitea | 1000 | No (writable app dir) | RuntimeDefault | drop ALL |
| Act Runner | 1000 | No | RuntimeDefault | drop ALL |
| dind sidecar | root | No | Unconfined | SYS_ADMIN (required for mount/overlayfs) |
The dind sidecar is the only privileged workload and is unavoidable for Docker-in-Docker CI execution. It is isolated to the gitea-runners namespace and cannot reach the Gitea or platform namespaces except through the allowed network policy rules.
Gitea Actions job pushed
│
▼
KEDA polls Gitea API (every 15s)
GET /api/v1/repos/.../actions/runners?status=queued
│
▼
ScaledObject computes desired replicas
(1 runner per queued job, 0–10 range)
│
▼
Kubernetes scales the runner Deployment
│
Each new pod:
init → register with Gitea API (gets runner token)
main → act_runner daemon picks up jobs
dind → Docker daemon on 127.0.0.1:2375
│
On scale-down (SIGTERM):
act_runner drains current job (up to 3660s grace period)
init → deregister from Gitea API
Supported job labels: ubuntu-latest, ubuntu-24.04, ubuntu-22.04
Full command reference is in COMMANDS.md.
- 6 Linux nodes reachable over SSH
- IP range
172.16.69.50–172.16.69.60available on the LAN - DNS record:
git.open-ict.hu→172.16.69.60 - Internet access for pulling images and Let's Encrypt challenges
# 1. Initialize the first control-plane node
bash scripts/01-bootstrap-first-master.sh
# 2. Join the remaining control-plane nodes (run on master2, master3)
bash scripts/02-join-control-plane.sh
# 3. Join worker nodes (run on worker1–3)
bash scripts/03-join-worker.sh
# 4. Deploy the application stack (run on master1)
bash scripts/04-deploy-apps.shEach script is idempotent. Re-running it will not duplicate resources. To tear down the application layer:
bash scripts/05-reset-apps.shThe bootstrap scripts generate and store the following as Kubernetes Secrets at deploy time:
| Secret | Namespace | Contents |
|---|---|---|
gitea-admin |
gitea |
Gitea admin username + password |
gitea-runner-token |
gitea-runners |
Act Runner registration token |
keda-gitea-token |
keda / gitea-runners |
Gitea API token for KEDA |
gitea-postgresql |
gitea |
Database credentials |
This repository is organized to provide a clear separation between platform infrastructure, application workloads, and operational scripts. Below is an overview of the major directories and files at the root level, along with their purposes:
- install.sh: Entry point script for bootstrapping the first control-plane node. It sources the main bootstrap script and should be run on the initial master node.
- COMMANDS.md: Comprehensive command reference for all deployment and operational tasks.
- README.md: This documentation file.
Contains all automation scripts for cluster lifecycle management:
- 01-bootstrap-first-master.sh: Initializes the first control-plane node and deploys core platform components.
- 02-join-control-plane.sh: Used to join additional control-plane nodes to the cluster.
- 03-join-worker.sh: Used to join worker nodes.
- 04-deploy-apps.sh: Deploys the full application stack (Traefik, cert-manager, Gitea, etc.).
- 05-reset-apps.sh: Removes all application workloads from the cluster.
- lib-functions.sh: Shared Bash functions used by other scripts.
Holds all platform-level Kubernetes manifests and Kustomize overlays:
- metallb/: Configures MetalLB for L2 load balancing and IP address pool management.
- system/: Contains kube-vip static pod manifests for control-plane HA and kured for automated node reboots.
- rbac/: RBAC policies for system daemons and controllers.
- keda/: KEDA operator deployment and host alias patches for autoscaling.
- configs/: Additional platform configuration overlays.
Contains application-specific Kubernetes manifests and Kustomize overlays:
-
traefik/: Ingress controller configuration. The
base/subdirectory includes deployment, service, RBAC, and IngressClass resources. -
cert-manager/: PKI automation for TLS certificates. Includes
base/for upstream release andissuers/for Let's Encrypt ClusterIssuers. -
gitea/: Self-hosted Git service. Contains: -
values.yaml: Helm chart values for Gitea deployment. -ingressroute-tcp.yaml: Traefik TCP route for SSH (port 2222). -middleware.yaml: Rate limiting and HTTPS redirect policies. -networkpolicy*.yaml: Network isolation for Gitea, PostgreSQL, and Valkey. -
gitea-runner/: CI/CD runner deployment. The
base/subdirectory includes the runner Deployment, KEDA ScaledObject for autoscaling, and NetworkPolicy for isolation. -
anubis/: Example application with its own namespace, certificate, deployment, service, ingress, middleware, and network policies.
-
namespace.yaml: Defines the Kubernetes namespace for the component.
-
deployment.yaml: Describes the Deployment resource for running pods.
-
service.yaml: Exposes the application internally or externally.
-
ingressroute.yaml / ingressroute-tcp.yaml: Traefik-specific routing for HTTP(S) and TCP (SSH) traffic.
-
middleware.yaml: Traefik middleware for rate limiting, redirects, etc.
-
networkpolicy.yaml: Enforces network segmentation and security.
-
certificate.yaml: Requests TLS certificates via cert-manager.
-
policy-configmap.yaml: Stores policy configuration for apps.
-
kustomization.yaml: Kustomize manifest for composing resources.
-
values.yaml: Helm values for templated deployments (Gitea).
Middleware resources are defined in middleware.yaml files found in various application directories (e.g., apps/gitea/middleware.yaml, apps/anubis/middleware.yaml). These files configure Traefik middleware components such as:
- Rate limiting: Protects backend services from excessive requests.
- HTTPS redirection: Ensures all HTTP traffic is redirected to HTTPS.
- Header manipulation: Adds or modifies HTTP headers for security or compliance.
Each service can have its own middleware configuration, referenced by its IngressRoute or IngressRouteTCP resource. This modular approach allows for fine-grained traffic management and security policies per application.
- namespace.yaml: Defines the Kubernetes namespace for the component.
- deployment.yaml: Describes the Deployment resource for running pods.
- service.yaml: Exposes the application internally or externally.
- ingressroute.yaml / ingressroute-tcp.yaml: Traefik-specific routing for HTTP(S) and TCP (SSH) traffic.
- middleware.yaml: Traefik middleware for rate limiting, redirects, etc.
- networkpolicy.yaml: Enforces network segmentation and security.
- certificate.yaml: Requests TLS certificates via cert-manager.
- policy-configmap.yaml: Stores policy configuration for apps.
- kustomization.yaml: Kustomize manifest for composing resources.
- values.yaml: Helm values for templated deployments (Gitea).
- Kubernetes: Container orchestration and workload management.
- Kustomize: Native Kubernetes configuration management and overlays.
- Helm: Used only for Gitea application deployment.
- Traefik: Ingress controller and TCP proxy for HTTP(S) and SSH traffic. Traefik's middleware system is used to implement rate limiting, HTTPS redirection, and header manipulation for enhanced security and traffic management.
- cert-manager: Automated TLS certificate management with Let's Encrypt.
- MetalLB: L2 load balancer for exposing services with stable IPs.
- KEDA: Event-driven autoscaling for CI/CD runners.
- kube-vip: Floating VIP for control-plane HA.
- kured: Automated node reboots for security updates.
All ingress traffic is routed through Traefik, which leverages its middleware system to enforce security and operational policies. Middleware components are attached to IngressRoute and IngressRouteTCP resources to provide:
- Rate limiting to protect backend services from abuse
- Automatic HTTP to HTTPS redirection for secure access
- Custom header injection and manipulation for compliance and security
This approach ensures consistent, centralized traffic management across all applications and services deployed in the cluster.
- Kubernetes: Container orchestration and workload management.
- Kustomize: Native Kubernetes configuration management and overlays.
- Helm: Used only for Gitea application deployment.
- Traefik: Ingress controller and TCP proxy for HTTP(S) and SSH traffic.
- cert-manager: Automated TLS certificate management with Let's Encrypt.
- MetalLB: L2 load balancer for exposing services with stable IPs.
- KEDA: Event-driven autoscaling for CI/CD runners.
- kube-vip: Floating VIP for control-plane HA.
- kured: Automated node reboots for security updates.
- Clone the repository to all nodes.
- Run
install.shon the first control-plane node to bootstrap the cluster and deploy platform components. - Join additional control-plane and worker nodes using the provided scripts in the
scripts/directory. - Deploy the application stack with
scripts/04-deploy-apps.shafter the platform is ready. - Reset or tear down applications with
scripts/05-reset-apps.shas needed.
All scripts are idempotent and safe to re-run. Secrets are generated at deploy time and stored as Kubernetes Secrets (never committed to the repo).
Maintained by the Open ICT platform team — platform-ops@open-ict.hu