🤖 AI & MACHINE LEARNING

AI at the Edge

Run large language models on your own hardware. NVIDIA DGX Spark with Blackwell GPU, AIBrix inference platform, and vLLM serving — fully integrated into the Kubernetes cluster.

NVIDIA DGX SPARK SPECIFICATIONS

Superchip

Grace Blackwell GB10

AI Performance

1 PFLOP FP4 · 1000 TOPS

CPU Architecture

ARM64 Grace CPU

Unified Memory

128GB LPDDR5x · 273 GB/s

Network

ConnectX-7 · 100/200 GbE

CUDA

13.0 · DGX OS (Ubuntu 24.04)

Inference Architecture

Dual-gateway architecture with Cilium handling TLS termination and Envoy Gateway routing to vLLM pods on the DGX Spark node.

graph TD
  CLIENT["Client / App"]:::client
  DNS["CoreDNS
llm.exitthecloud.eu"]:::dns
  CGW["Cilium Gateway
192.168.0.200
TLS termination"]:::cgw
  EGW["Envoy Gateway
192.168.0.201
Model routing"]:::egw
  VLLM["vLLM Pod
DGX Spark (gx10)
Blackwell GPU"]:::vllm
  HF["HuggingFace Hub
Model weights"]:::hf

  CLIENT --> DNS --> CGW --> EGW --> VLLM
  HF -.->|"download"| VLLM

  classDef client fill:#1e293b,stroke:#e2e8f0,color:#e2e8f0,stroke-width:2px
  classDef dns fill:#0e3a3a,stroke:#06b6d4,color:#67e8f9,stroke-width:2px
  classDef cgw fill:#14332a,stroke:#4ade80,color:#86efac,stroke-width:2px
  classDef egw fill:#2e2a0e,stroke:#facc15,color:#fde68a,stroke-width:2px
  classDef vllm fill:#1a2e1a,stroke:#10b981,color:#6ee7b7,stroke-width:2px
  classDef hf fill:#2e1a47,stroke:#a78bfa,color:#c4b5fd,stroke-width:2px

SUPPORTED MODELS

Serve any HuggingFace model that fits in 128GB unified memory. One GPU, one model at a time.

🟢

Qwen 2.5 (1.5B / 7B / 32B)

Alibaba Cloud

🟣

Llama 3.1 (8B)

Meta AI

🔵

Mistral (7B)

Mistral AI

Platform Integration

The AI stack is not a sidecar — it's fully woven into the Kubernetes platform.

🔄

GitOps

3-wave ArgoCD sync: CRDs → Controllers → Workloads. Fully declarative.

🔐

Secrets

HuggingFace token from Vault via ESO. Zero credentials in Git.

📊

Monitoring

DCGM exporter + ServiceMonitor. GPU metrics in Grafana dashboards.

🌐

Networking

Cilium Gateway → Envoy Gateway → vLLM. Dual-gateway with TLS.

API Endpoint

OpenAI-compatible API available at:

https://llm.exitthecloud.eu/v1/chat/completions

All Components

NVIDIA DGX Spark

production

Desktop AI supercomputer powered by Grace Blackwell GB10 Superchip — 1 PFLOP FP4, 128GB unified LPDDR5x memory, ARM64 architecture.

Role: Dedicated GPU worker node (gx10) with Blackwell GPU, CUDA 13.0, and ConnectX-7 networking

AIBrix

production

Open-source Kubernetes-native AI inference platform with prefix-cache-aware routing, LLM-specific autoscaling, and distributed KV cache.

Role: LLM model serving control plane — 3-wave ArgoCD deployment with Envoy Gateway routing

vLLM

production

High-throughput LLM inference engine with PagedAttention, continuous batching, and OpenAI-compatible API.

Role: Inference runtime serving Qwen, Llama, and Mistral models via NVIDIA NGC images on ARM64

NVIDIA GPU Operator

production

Kubernetes operator automating GPU driver, container toolkit, device plugin, and DCGM exporter lifecycle.

Role: GPU resource management with driver-less mode for DGX OS — exposes nvidia.com/gpu to scheduler

Hindsight

production

Temporal semantic memory system for AI agents — retain, recall, and reflect operations backed by pgvector similarity search.

Role: Agent memory layer with GPU-accelerated local embeddings and reranking, powered by minimax LLM

← BACK TO FULL STACK