🤖 AI & MACHINE LEARNING

L'IA en périphérie

Exécutez des grands modèles de langage sur votre propre matériel. NVIDIA DGX Spark avec GPU Blackwell, plateforme d'inférence AIBrix et service vLLM — entièrement intégré au cluster Kubernetes.

SPÉCIFICATIONS NVIDIA DGX SPARK

Superpuce

Grace Blackwell GB10

Performance IA

1 PFLOP FP4 · 1000 TOPS

Architecture CPU

CPU ARM64 Grace

Mémoire unifiée

128 Go LPDDR5x · 273 Go/s

Réseau

ConnectX-7 · 100/200 GbE

CUDA

13.0 · DGX OS (Ubuntu 24.04)

Architecture d'inférence

Architecture à double passerelle avec Cilium pour la terminaison TLS et Envoy Gateway pour le routage vers les pods vLLM sur le nœud DGX Spark.

graph TD
  CLIENT["Client / App"]:::client
  DNS["CoreDNS
llm.exitthecloud.eu"]:::dns
  CGW["Cilium Gateway
192.168.0.200
TLS termination"]:::cgw
  EGW["Envoy Gateway
192.168.0.201
Model routing"]:::egw
  VLLM["vLLM Pod
DGX Spark (gx10)
Blackwell GPU"]:::vllm
  HF["HuggingFace Hub
Model weights"]:::hf

  CLIENT --> DNS --> CGW --> EGW --> VLLM
  HF -.->|"download"| VLLM

  classDef client fill:#1e293b,stroke:#e2e8f0,color:#e2e8f0,stroke-width:2px
  classDef dns fill:#0e3a3a,stroke:#06b6d4,color:#67e8f9,stroke-width:2px
  classDef cgw fill:#14332a,stroke:#4ade80,color:#86efac,stroke-width:2px
  classDef egw fill:#2e2a0e,stroke:#facc15,color:#fde68a,stroke-width:2px
  classDef vllm fill:#1a2e1a,stroke:#10b981,color:#6ee7b7,stroke-width:2px
  classDef hf fill:#2e1a47,stroke:#a78bfa,color:#c4b5fd,stroke-width:2px

MODÈLES SUPPORTÉS

Servez n'importe quel modèle HuggingFace qui tient dans 128 Go de mémoire unifiée. Un GPU, un modèle à la fois.

🟢

Qwen 2.5 (1.5B / 7B / 32B)

Alibaba Cloud

🟣

Llama 3.1 (8B)

Meta AI

🔵

Mistral (7B)

Mistral AI

Intégration à la plateforme

La stack IA n'est pas un sidecar — elle est entièrement tissée dans la plateforme Kubernetes.

🔄

GitOps

Sync ArgoCD en 3 vagues : CRDs → Contrôleurs → Workloads. Entièrement déclaratif.

🔐

Secrets

Token HuggingFace depuis Vault via ESO. Zéro identifiant dans Git.

📊

Monitoring

Exporteur DCGM + ServiceMonitor. Métriques GPU dans les dashboards Grafana.

🌐

Réseau

Cilium Gateway → Envoy Gateway → vLLM. Double passerelle avec TLS.

Point d'accès API

API compatible OpenAI disponible à :

https://llm.exitthecloud.eu/v1/chat/completions

Tous les composants

NVIDIA DGX Spark

production

Desktop AI supercomputer powered by Grace Blackwell GB10 Superchip — 1 PFLOP FP4, 128GB unified LPDDR5x memory, ARM64 architecture.

Rôle : Dedicated GPU worker node (gx10) with Blackwell GPU, CUDA 13.0, and ConnectX-7 networking

AIBrix

production

Open-source Kubernetes-native AI inference platform with prefix-cache-aware routing, LLM-specific autoscaling, and distributed KV cache.

Rôle : LLM model serving control plane — 3-wave ArgoCD deployment with Envoy Gateway routing

vLLM

production

High-throughput LLM inference engine with PagedAttention, continuous batching, and OpenAI-compatible API.

Rôle : Inference runtime serving Qwen, Llama, and Mistral models via NVIDIA NGC images on ARM64

NVIDIA GPU Operator

production

Kubernetes operator automating GPU driver, container toolkit, device plugin, and DCGM exporter lifecycle.

Rôle : GPU resource management with driver-less mode for DGX OS — exposes nvidia.com/gpu to scheduler

Hindsight

production

Temporal semantic memory system for AI agents — retain, recall, and reflect operations backed by pgvector similarity search.

Rôle : Agent memory layer with GPU-accelerated local embeddings and reranking, powered by minimax LLM

← RETOUR AU STACK COMPLET