# Infrayard by Solvia Lab — Features

A complete overview of Infrayard by Solvia Lab as an OCI-native Internal Developer Platform (IDP) for governed OKE lifecycle management: self-service provisioning, BYON networking, access delivery, approvals, Activity history, and FinOps visibility. Detailed deployment and integration runbooks are available during evaluation/POC.

---

## Self-service cluster provisioning

- **One-click deploy** — engineers provision fully managed OKE clusters through a clean web portal, no CLI, Terraform, or OCI Console knowledge required
- **Live Terraform streaming** — `terraform init`, `plan`, and `apply` output streams in real time to the browser during deploy, scale, upgrade, and destroy operations
- **Template-based or custom** — choose from admin-defined cluster templates or configure everything manually
- **Automatic resource creation** — each cluster gets its own compartment, VCN, subnet, internet gateway, route table, security list, and node pools
- **VPN-first Kubernetes API access** — admins can enable public OKE API endpoints restricted to runner/VPN/corporate CIDRs. Infrayard creates a dedicated public API endpoint subnet and TCP/6443 allowlist rules, avoiding DRG cost and LPG scaling limits while keeping access unavailable from the open internet.
- **Bring your own infrastructure** — optionally supply existing VCN, compartment, or subnet OCIDs via the Advanced tab to skip resource creation and wire into existing networks. Supplied BYO resources are referenced read-only and never edited by Terraform; if only an existing VCN is supplied, Infrayard may still create cluster subnets/security lists inside it. Existing subnet overrides require an existing VCN override and are validated against OCI before Terraform starts. Infrayard-created VCN/subnet/security list resources are fully managed and manual OCI Console edits to them will be overwritten on the next apply. When using existing VCN/subnet overrides, route tables are not managed by Infrayard, so equivalent private-subnet egress must already exist for OKE workers: route `0.0.0.0/0` to NAT Gateway (or equivalent corporate egress path) and route `all-<region>-services-in-oracle-services-network` to Service Gateway. This is routing, not an open ingress security rule.
- **BYON scope behavior** - `Existing Compartment OCID` alone reuses only the compartment. Infrayard does not auto-discover existing VCN/subnet objects inside that compartment; leave VCN/subnet blank only when you want Infrayard to create a fresh network stack there.
- **Advanced override guardrails** - Advanced OCID fields validate resource-type prefixes and provide OCI-backed autocomplete for compartments, VCNs, and subnets to reduce typo/copy-paste errors before deploy.
- **Shared compartment pattern** — for multiple clusters in one compartment, use a dedicated BYO compartment OCID in Advanced for every cluster in that shared domain; avoid reusing an auto-created per-cluster compartment as a shared target.
- **Compartment subtree enforcement** — `Existing Compartment OCID` is validated against the install-time anchor compartment (set by the customer at install — can be the tenancy root, an existing org compartment, or a dedicated one) before Terraform runs. Descendants of the anchor at any depth are accepted; siblings, ancestors, or compartments in unrelated subtrees are rejected with a clear 400. This catches the misconfiguration upfront instead of letting it surface as a confusing mid-`terraform apply` IAM failure, since the runner's IAM `manage` policies are scoped to the anchor subtree. Customers who want a flat "anywhere goes" model anchor at the tenancy root; customers who want narrower blast radius pick a smaller anchor. Non-existent OCIDs return "compartment not found"; transient OCI API errors return 503 so the deploy can be retried.
- **CIDR pool management** — admins pre-populate a pool of /24 ranges; each cluster allocates one on deploy and releases it on destroy
- **Multi-pool support** — configure 1–N node pools per cluster, each with independent node count, OCPU, RAM, and storage sizing
- **Shape sync from OCI** - Admin Configuration includes **Sync from OCI** for VM shapes, pulling OKE-compatible shapes for the current region/tenancy and merging them into `allowed_shapes` while preserving existing labels/toggles
- **K8s version sync from OCI** - Admin Configuration can refresh available OKE versions for the configured region so deploy options and upgrade recommendation pills stay aligned with newly published patch versions
- **Automatic K8s version auto-sync** - the platform refreshes OCI's available OKE versions on a recurring interval (default daily) so newly-released patch versions appear in admin config without manual *Sync from OCI*. Newly-fetched versions land disabled by default; admins still curate enablement
- **Automatic upgrade-recommendation notifications** - a separate sweeper (default hourly) walks running clusters and, when an enabled K8s version newer than the cluster's current one appears, emits one durable Activity entry plus one email to the cluster owner per (cluster, target version) pair. Idempotent so steady-state runs never spam; auto-resets after the cluster is upgraded, ready to fire again on the next bump
- **Sync fallback behavior** - if OCI shape sync is unavailable (credentials/policy/network), existing `allowed_shapes` remain unchanged and admins can continue with manual shape curation from `oci ce node-pool-options get --node-pool-option-id all`
- **Shape-aware K8s picker** - deploy form Kubernetes options are filtered by the selected VM shape and region, and only include admin-enabled versions with OKE node-image compatibility for that shape. If compatibility lookup fails, the picker is fail-closed (no permissive fallback list), preventing invalid shape/version combinations before apply
- **Node image selection** - admins configure allowed OCI compute images; users select an image on the deploy form or leave it as auto-select (latest OKE-compatible image). Templates can lock a specific image. Recommended onboarding flow: sync shapes from OCI first, then curate images from `oci ce node-pool-options get --node-pool-option-id all`
- **Architecture-aware auto-select** — when no image is set, Terraform filters OKE image options by the requested shape's architecture: ARM shapes (`VM.Standard.A*`) resolve to `aarch64` images, GPU shapes to `Gen2-GPU` variants, and all others to plain x86_64 — preventing shape/image mismatches that would otherwise block node launches

---

## Cluster templates

- **Pre-configured profiles** — admins create templates that encode K8s version, VM shape, node image, pool layout, tier, TTL, and destroy protection
- **Deploy form pre-fill + lock** — selecting a template pre-fills and locks resource fields (pools, nodes, CPU, RAM, storage, pool names, and add/remove pool controls). Users can still set cluster name, CIDR, compartment, and advanced overrides. Select "Custom" to unlock all fields and configure manually with limit enforcement
- **Template values can exceed user limits** — templates represent admin-pre-approved configurations, so template-defined values are not clamped to the user's personal limits. The lock prevents users from editing these values
- **Requests workflow** — protected-cluster destroy approvals and per-user limit-increase requests share a dedicated admin Requests queue. Users submit a reason, admins approve/deny with an optional note, and outcomes are visible in Activity. When SMTP is configured, submit/approve/deny events also send info-only email pings to admins and users with requester, action, comments, and request parameters. Destroy approvals still require review of the Terraform destroy plan before force-destroy; limit approvals apply granted overrides to the user's account
- **Time-to-live (TTL)** — optional expiry in hours, enforced at deploy time; when TTL is reached, Infrayard automatically triggers destroy/cleanup. Also available on custom deploys without a template
- **Live cost preview** — add/edit modal shows estimated monthly and hourly cost that updates as you change pools, shape, or tier
- **Template shape/K8s guardrail** — template modal uses the same shape-aware compatibility filtering as deploy; save/update re-validates selected shape+K8s and blocks incompatible combinations
- **Role-based access** — restrict templates to users with a specific Keycloak realm role. This enables environment-tier gating across your organisation:

  | Template | Required role | Who sees it |
  |---|---|---|
  | DEV — Small | *(none)* | All users |
  | TEST — Medium | `testing` | QA engineers and testers |
  | UAT — Large | `uat` | Release managers and senior engineers |
  | PROD — HA | `production` | Production team only |

  Create the roles in Keycloak under **Realm roles**, assign them to the relevant users, and set the `required_role` field on each template. Templates without a required role are visible to everyone. Users only see templates they have access to on the deploy form — no error messages, the restricted templates simply don't appear.
- **Sort order** — controls card display position on the deploy form
- **Enable/disable** — disabled templates disappear from the deploy form but remain referenced by existing clusters
- **Permanent delete** — removes a template from the admin panel entirely

---

## Cost visibility (FinOps)

Infrayard provides live cost estimation across the entire platform using OCI Pay-As-You-Go rates as defaults, with optional admin overrides for custom contracts.

| Surface | What's shown |
|---|---|
| Deploy form — Deployment Summary | Estimated monthly and hourly cost, updates live as you configure pools |
| Deploy plan confirm modal | Estimated monthly cost in the resource plan |
| eashboard cluster cards | Estimated monthly cost per cluster |
| Cluster detail page | Monthly cost + full breakdown (per-pool cost, control plane cost, total with hourly rate) |
| Admin — All Clusters table | Monthly + hourly cost per cluster |
| Admin — Stats bar | Lifecycle status cards (online, provisioning, upgrading, destroying, failed, total) plus CIDRs used and monthly spend |
| Admin — Cluster Templates table | Monthly + hourly cost per template |
| Admin — Template add/edit modal | Live cost preview that updates on pool/shape/tier changes |

**Pricing model:**
- Compute: `nodes x (OCPU x $0.025/hr + RAM GB x $0.0015/hr)`
- Storage: `nodes x storage GB x $0.0255/mo`
- Enhanced control plane: `$0.10/hr` (Basic is free)
- Shape-specific rate overrides supported for custom OCI contracts
- Both server-side (Python) and client-side (JS) implementations produce identical results

---

## Scaling

- **Full resource scaling** — adjust nodes, OCPU, RAM, and storage per pool from the portal
- **Scale to zero** — node counts accept `0` on deploy and scale, letting users park a cluster configuration without running compute. Useful for pausing charges on idle Basic clusters where the control plane is free
- **Pool add/remove** — add new node pools or remove existing ones directly from the scale modal, no redeployment needed
- **Optional new-pool naming** — new pools added during scale default their name to the cluster's bare name and accept any value matching the standard pool-name rules (lowercase letters/digits/hyphens, 32 chars max). Existing pool names remain immutable so Terraform/OKE state stays aligned. The form rejects collisions with sibling pools upfront
- **Per-pool control** — scale each pool independently
- **Change preview** — review all changes before applying (current vs. new values, new pools highlighted, removed pools shown)
- **Separate K8s upgrade flow** — Kubernetes version upgrades use a dedicated Upgrade action/modal (not the scale flow)
- **Enhanced tier** — full in-place scaling via OKE API, no manual node cycling
- **Basic tier** — adding/removing nodes or node pools is automatic; changing shape/OCPU/RAM/storage requires rolling node cycling (`N -> 2N -> N` after new nodes are `Ready/Active`)
- **Both-tier admin policy** — admins can allow Basic, Enhanced, or both tiers platform-wide. When both are allowed the deploy form shows a per-cluster tier picker; when only one is allowed the choice is implicit. Per-user overrides remain force-only (pin a user to Basic or Enhanced regardless of the global policy)

## Kubernetes upgrades

- **Dedicated action** — available from cluster detail and dashboard actions
- **Shape-aware + allowlist-aware** — upgrade options are filtered to OCI-compatible versions enabled by admin config
- **Sequential minors only** — direct upgrades allow same/next minor only; skip-minor upgrades are blocked with explicit error
- **Control plane + node-pool target update** — upgrade action updates cluster Kubernetes and node-pool target version/image
- **Basic upgrade behavior** — perform rolling worker refresh per pool (`N -> 2N -> N`) after new nodes are `Ready/Active`
- **Enhanced upgrade behavior** — fully automated rollout (control plane + workers), no manual cycling required
- **Tier guidance in UI** — Basic and Enhanced show upgrade-specific notes with rollout guidance
- **Dynamic Basic rollout guidance** — Basic upgrade modal reads live pool/node counts and renders exact per-pool steps (for example `1 -> 2 -> 1` or `3 -> 6 -> 3`) instead of a static example

---

## Cluster lifecycle

- **Status tracking** — real-time status across all views: provisioning, scaling, upgrading, destroying, running, error, destroyed
- **TTL visibility** — dashboard cards show color-coded countdown badges (green >24h, orange <24h, red <4h) for clusters with TTL. Detail page shows full expiry timestamp and remaining time
- **Destroy protection + approval queue** — protected clusters show a red "Protected" badge on dashboard cards, the admin All Clusters table, and the detail page. Non-admin users clicking "Destroy" open a "Request destroy" modal (optional reason) which creates a pending approval ticket. The admin nav shows a live-count "Requests (N)" badge, refreshed every 5s. On the admin Requests page, admins approve, review the destroy plan, then confirm force-destroy, or deny with a note — the user's cluster card then displays a "Destroy pending" (amber) or "Destroy denied" (red, note in tooltip) pill. At most one pending request per cluster. Every submit/approve/deny is audit-logged and can emit info-only request emails when SMTP is enabled. Admins can still force-destroy directly via `?force=true`.
- **Activity inbox** — user nav includes a persistent Activity dropdown with unread counts, last events, and mark-read controls. Destroy-request approve/deny events, limit request submit/review events, TTL warnings, deploy/scale/upgrade/destroy lifecycle events, and admin-driven account limit changes emit inbox rows.
- **Destroy with cleanup** — `terraform destroy` removes cluster-scoped OCI resources and returns CIDR to pool; Infrayard-managed child compartments may be deleted, external compartment overrides are retained, and only the cluster `.tfstate` object is deleted while the user prefix remains
- **Error recovery** — failed deployments show troubleshooting tips and a "Clean up" button to remove partial resources
- **Kubeconfig download** — universal kubeconfig with embedded per-user ServiceAccount token; no OCI CLI or local OCI config required. Infrayard and the user must still reach the OKE API endpoint; the supported no-DRG/no-LPG path is the restricted public API endpoint allowlist. Explicit OCI-exec fallback remains available via `/kubeconfig-oci`.
- **SSH key download** — Terraform-generated private key available on the detail page

---

## Identity and access

- **Any OIDC provider** — works with Keycloak, Azure AD, Okta, Google Workspace, or any OIDC-compliant IdP
- **No user directory** — Infrayard auto-provisions users on first login from the JWT `sub` claim
- **PKCE authentication** — Authorization Code + PKCE flow; no client secrets stored in the frontend
- **Role-based access** — `admin` role (from IdP) grants access to the admin panel; custom realm roles can restrict cluster template visibility (e.g. `production`, `staging`); all other users are regular users
- **Session management** — automatic token refresh, silent re-auth, secure logout via IdP end-session endpoint
- **Cached OIDC discovery** — well-known config cached in sessionStorage (1-hour TTL) and at the nginx proxy layer, eliminating network round-trips on page load, sign-in, and sign-out

---

## Resource limits

- **Two-tier limit system** — global platform defaults + per-user overrides
- **Granular control** — limits on clusters per user, pools per cluster, nodes per pool, OCPU, RAM, storage, and cluster tier
- **OCI minimums enforced** — storage per node has a hard floor of 50 GB (OCI compute boot volume minimum), enforced in admin config, per-user overrides, and deploy validation
- **Per-user overrides** — admins can raise or lower any limit for individual users without affecting others
- **Visual limit feedback** — stepper inputs gray out when values reach the configured maximum, giving a clear visual cue that the limit has been reached
- **Live enforcement** — deploy form constraints update on page load; server-side validation on every deploy request
- **Override visibility** — admin Users table shows which users have overrides and their effective resolved limits

---

## Admin panel

Six dedicated admin pages accessible to users with the `admin` role:

### All Clusters
- Every cluster across all users with status, owner, CIDR, K8s version, tier, resources, cost, and age
- Stats bar: online, provisioning (includes scaling), upgrading k8s, destroying, failed, total, CIDRs used, monthly spend
- Actions: details, scale, upgrade k8s, destroy, and state-based view logs; plus new cluster (bypasses user quotas)

### Users & Limits
- All users with cluster count, limit, and per-user override badges
- Edit Limits modal: set per-user overrides for any combination of limits and tier
- Reset to global defaults with one click; direct edits and resets notify the affected user in Activity

### Configuration
- Platform-wide settings: region, compartment, cluster tier, state bucket, namespace
- CIDR pool: add/remove /24 ranges with allocation status
- VM shapes: sync from OCI, then optionally add custom shapes with display labels
- K8s versions: sync from OCI and manage available versions (deploy form shows only versions compatible with the currently selected shape/region; upgrade recommendations use the latest enabled version)
- Node images: add OCI compute images with display labels, enable/disable, auto-select fallback when no images configured
- Global resource limits: cluster limit, pool max, node max, OCPU, RAM, storage
- All changes take effect immediately — no restart or redeployment needed

### Cluster Templates
- Template table with name, pools, shape, image, K8s version, cost, TTL, protection status, required role, active state
- Add/edit modal with live cost preview
- Enable/disable toggle and permanent delete

### Requests
- Approval queue for user-submitted destroy requests on protection-enabled clusters
- Live count badge in the admin nav (`Requests (N)`) refreshed every 5 seconds; hidden when zero pending
- Filters: Pending / Approved / Denied / All
- Columns: requested-at (relative time), cluster, user, reason, status, reviewer, actions
- **Review** action opens a modal: optional admin note, then **Approve** opens the destroy plan for a second confirmation, or **Deny** (note surfaces on user's cluster card as "Destroy denied")
- Limit-request review lets admins grant requested or adjusted values for cluster count, pools, nodes, OCPU, RAM, and storage; approve/deny results notify the user in Activity
- Optional request emails notify admins on submit and users on submit/approve/deny; email content includes requester, action, request id, comments, cluster details for destroy requests, and requested/granted values for limit requests
- **Admin-action emails to cluster owner** — when an admin scales, upgrades, or force-destroys a cluster owned by another user, the owner is emailed at every lifecycle state (started, succeeded, failed) with the operation, the admin who initiated it, cluster details, and a link back to Infrayard. Failure emails explicitly note that admins with OCI Console access can intervene/clean up directly. Owner-on-own actions stay silent
- Row-locked approval prevents double-approval when two admins click simultaneously
- Every submit / approve / deny is audit-logged
- Bypass path: admins can still force-destroy directly via `?force=true` for incident response

### Audit Log
- Append-only record of every deploy, scale, upgrade, and destroy operation, plus `destroy-request:*` and `limit-request:*` events
- Columns: timestamp, user, operation, cluster name, status, duration
- Filterable by user, operation type, and status

---

## Architecture

- **No build step** — vanilla HTML/CSS/JS frontend, served from any static host
- **FastAPI backend** — async Python, SSE streaming, JWT validation via JWKS
- **PostgreSQL** — clusters, jobs, users, config, templates, audit log
- **Terraform execution** — per-job runner with isolated state in OCI Object Storage
- **Helm deployment** — single chart with configurable values for any k3s/K8s environment
- **Single-node ready** — runs on a single OCI ARM VM (Always Free tier compatible)

---

## Testing & CI

- **160 automated tests** — business logic, validation rules, API contracts, access control, lifecycle notifications, request emails, admin-action owner emails, K8s upgrade recommendations + auto-sync, and cost estimation
- **Zero external dependencies** — in-memory SQLite with mocked auth; no database server, IdP, or OCI access needed to run tests
- **Coverage areas** — user provisioning, limit resolution, admin config CRUD, cluster templates, cost engine (basic/enhanced tiers, multi-pool, custom pricing)
- **Reference CI pipeline (maintainer-owned)** — GitHub Actions and GitLab CI run validation and publish images for maintainer release flow; customer operators consume published image tags/digests

---

## Deployment options

Two first-class deployment paths, each tuned to its target environment. Both use the same Helm chart and the same upstream container images — the difference is the values file, which configures the right ingress controller, storage class, and scheduling for each platform.

| | **Existing OKE cluster** | **Single-node k3s on OCI VM** |
|---|---|---|
| **Best for** | Production, enterprise OKE users | Dev/test, demos, Always Free tier |
| **Values file** | `values-oke.yaml` | `values-k3s.yaml` |
| **Ingress** | ingress-nginx with OCI flexible Load Balancer (installed separately) | Traefik (bundled with k3s, `className: traefik`, no extra install) |
| **Storage class** | `oci-bv` (OCI Block Volume) | `local-path` (k3s default) |
| **PostgreSQL volume floor** | 50 GB (OCI BV minimum) | 20 GB (local disk) |
| **Image pull policy** | `IfNotPresent` | `Always` |
| **Scheduling** | No tolerations (dedicated workers) | Control-plane tolerations for single-node |
| **Setup time** | ~30 min (cluster exists) | ~15 min |
| **End-to-end guide** | Available during evaluation/POC | Available during evaluation/POC |
| **Validation playbook** | Available during evaluation/POC | Available during evaluation/POC |

- **Any Kubernetes** — the Helm chart also works on any K8s cluster that has an ingress controller and a default StorageClass; use `values.yaml` as a starting point and override for your environment
- **Image delivery with any container registry** — release images are published to GHCR as the canonical registry and mirrored 1:1 to OCIR and GitLab Container Registry from the single release pipeline. Customers can also mirror into Harbor, Docker Hub, or another approved registry. Helm chart supports `imagePullSecrets` for private registries.
- **OCI Marketplace (planned)** — one-click "Launch Stack" deployment from OCI Console is planned for a future release
- **In-cluster access agent (planned)** — outbound agent tunnel for private-by-default environments that do not allow public OKE API endpoints, even CIDR-restricted ones. This removes DRG/LPG pressure after bootstrap and gives Rancher-like kubeconfig behavior.

---

## Marketplace features (planned)

> These features are prepared for a future OCI Marketplace listing. The same deployment capabilities are available today via the Helm chart.

- **Guided deployment form** — OCI Resource Manager schema with dynamic dropdowns for compartment, VCN, subnet, shapes, images, and existing clusters
- **Conditional OKE creation** — create a new OKE cluster with VCN, subnets, and security lists, or deploy into an existing cluster
- **Bring your own identity** — deploy bundled Keycloak or connect to an existing OIDC provider (Keycloak, Azure AD, Okta, Google Workspace)
- **OCI credential validation** — schema validates API key fingerprint format, requires private key PEM, and auto-injects tenancy/user OCID from Resource Manager context
- **Auto-generated passwords** — database and Keycloak passwords auto-generated when not provided, with retrieval instructions in stack outputs
- **Ingress-NGINX with OCI LB** — automatically deploys ingress controller with OCI flexible load balancer annotations
- **Post-deployment guide** — stack outputs include step-by-step instructions for eNS setup, Keycloak config, and first login

---

Built by [Solvia Lab s.r.o.](https://solvialab.tech)
