🤖 AI & MACHINE LEARNING
AI at the Edge
Run large language models on your own hardware. NVIDIA DGX Spark with Blackwell GPU, AIBrix inference platform, and vLLM serving — fully integrated into the Kubernetes cluster.
NVIDIA DGX SPARK SPECIFICATIONS
Superchip
Grace Blackwell GB10
AI Performance
1 PFLOP FP4 · 1000 TOPS
CPU Architecture
ARM64 Grace CPU
Unified Memory
128GB LPDDR5x · 273 GB/s
Network
ConnectX-7 · 100/200 GbE
CUDA
13.0 · DGX OS (Ubuntu 24.04)
Inference Architecture
Dual-gateway architecture with Cilium handling TLS termination and Envoy Gateway routing to vLLM pods on the DGX Spark node.
graph TD CLIENT["Client / App"]:::client DNS["CoreDNS llm.apps.edgeprime.io"]:::dns CGW["Cilium Gateway 192.168.0.200 TLS termination"]:::cgw EGW["Envoy Gateway 192.168.0.201 Model routing"]:::egw VLLM["vLLM Pod DGX Spark (gx10) Blackwell GPU"]:::vllm HF["HuggingFace Hub Model weights"]:::hf CLIENT --> DNS --> CGW --> EGW --> VLLM HF -.->|"download"| VLLM classDef client fill:#1e293b,stroke:#e2e8f0,color:#e2e8f0,stroke-width:2px classDef dns fill:#0e3a3a,stroke:#06b6d4,color:#67e8f9,stroke-width:2px classDef cgw fill:#14332a,stroke:#4ade80,color:#86efac,stroke-width:2px classDef egw fill:#2e2a0e,stroke:#facc15,color:#fde68a,stroke-width:2px classDef vllm fill:#1a2e1a,stroke:#10b981,color:#6ee7b7,stroke-width:2px classDef hf fill:#2e1a47,stroke:#a78bfa,color:#c4b5fd,stroke-width:2px
SUPPORTED MODELS
Serve any HuggingFace model that fits in 128GB unified memory. One GPU, one model at a time.
Qwen 2.5 (1.5B / 7B / 32B)
Alibaba Cloud
Llama 3.1 (8B)
Meta AI
Mistral (7B)
Mistral AI
Platform Integration
The AI stack is not a sidecar — it's fully woven into the Kubernetes platform.
GitOps
3-wave ArgoCD sync: CRDs → Controllers → Workloads. Fully declarative.
Secrets
HuggingFace token from Vault via ESO. Zero credentials in Git.
Monitoring
DCGM exporter + ServiceMonitor. GPU metrics in Grafana dashboards.
Networking
Cilium Gateway → Envoy Gateway → vLLM. Dual-gateway with TLS.
API Endpoint
OpenAI-compatible API available at:
https://llm.apps.edgeprime.io/v1/chat/completions All Components
NVIDIA DGX Spark
productionDesktop AI supercomputer powered by Grace Blackwell GB10 Superchip — 1 PFLOP FP4, 128GB unified LPDDR5x memory, ARM64 architecture.
Role: Dedicated GPU worker node (gx10) with Blackwell GPU, CUDA 13.0, and ConnectX-7 networking
AIBrix
productionOpen-source Kubernetes-native AI inference platform with prefix-cache-aware routing, LLM-specific autoscaling, and distributed KV cache.
Role: LLM model serving control plane — 3-wave ArgoCD deployment with Envoy Gateway routing
vLLM
productionHigh-throughput LLM inference engine with PagedAttention, continuous batching, and OpenAI-compatible API.
Role: Inference runtime serving Qwen, Llama, and Mistral models via NVIDIA NGC images on ARM64
NVIDIA GPU Operator
productionKubernetes operator automating GPU driver, container toolkit, device plugin, and DCGM exporter lifecycle.
Role: GPU resource management with driver-less mode for DGX OS — exposes nvidia.com/gpu to scheduler