I Turned My Gaming Rig + DGX Spark Into One AI Cluster
Joining a gaming laptop with an RTX 4090 and an NVIDIA DGX Spark into a single Kubernetes cluster so they work as one pool of compute - and a from-the-ground-up Kubernetes lesson using that real cluster as the running example.
Watch (8:44)
Overview
Joining a gaming laptop with an RTX 4090 and an NVIDIA DGX Spark into a single Kubernetes cluster so they work as one pool of compute - and a from-the-ground-up Kubernetes lesson using that real cluster as the running example.
Full transcript (from the video)
If you run AI locally, you know the feeling. You start with one graphics card, then you add a second machine, and soon you have a gaming rig over here and a dedicated AI box over there, both sitting idle half the time. I had exactly that, a gaming laptop with an RTX 4090 and an Nvidia DGX Spark. So, I joined them into a single Kubernetes cluster, and now they work as one pool of computes.
In this video, I'll teach you Kubernetes from the ground up using that real cluster as the running example. Here is the problem local AI runs into. The models are hungry. Text-to-speech wants a graphics card, image generation, video rendering, batch inference, all of it competes for the same scarce GPUs, and your machines sit there as islands.
You secure shell into one to start a render, into another to run a model, and you babysit each job by hand. It does not scale, and the hardware sits idle half the time. Kubernetes solves exactly. It takes a pile of separate machines and turns them into one pool of computes that you schedule work onto.
Before Kubernetes, start with one foundational idea, that a container packs your program, say, a text-to-speech model with everything it needs to run. First are its libraries, Python version, everything gets sealed into one image. Because that image is self-contained, it runs the same way on the gaming rig, on the DGX, or in the cloud. No extra setup on any machine.
Build the image, then any node in the cluster can run it. Containers are the cargo, Kubernetes, which we will meet, is the shipping company that decides where each container runs. So, what is Kubernetes? It is an orchestrator.
The mental shift is this, you stop telling a specific machine what to do, and you start declaring what you want to exist. You say, "Run this model. It needs one graphics card." Kubernetes then decides which machine should do it, starts it there, watches it, and if it dies or node drops off, it brings the work back somewhere healthy. You describe the desired state, and the cluster constantly works to make reality match.
Think of it as air traffic control for every container in your home lab. Every Kubernetes cluster splits into two kinds of machines. The control plane is the brain. It holds the desired state of the whole cluster and decides what runs where.
The worker nodes are the muscle. They actually run your containers. In my home lab, the DGX Spark plays the brain, the control plane, because it is always on and steady. The gaming rig, with its big graphics card, joined as a worker node.
And here is the nice part. The brain does not need to do the heavy lifting itself. It just directs traffic while the muscle does the work. Let's open up the brain, because this is where Kubernetes earns its reputation.
The control plane has a few key parts. The API server is the front door. Every commands you give, and every other components talks through it. ETCD is the cluster's memory, a reliable database that stores the desired state of everything, including the queue of work waiting to run.
The scheduler watches for new work and decides which node should take it, and the controllers run a constant loop comparing what you ask for against what actually exists and fixing any drift. On my DGX, those parts run together and quietly keep the whole cluster honest. Down on each worker node runs a small but crucial agent called the kubelet. When my gaming rig joined the cluster, the kubelet introduced it to the control plane.
It reported the machine's processors, its memory, and crucial graphics cards. From then on, the kubelet is the local foreman. When the scheduler assigns a container to this node, the kubelet pulls the image and starts it. It watches the health and reports back to the API server.
The brain decides the kubelet on each node. Here is a detail that trips up everyone doing local AI. Out of the box, Kubernetes counts processors and memory, but it has no idea your gaming rig has a graphics card. You have to teach it.
You run a small component called the NVIDIA device plugin on the GPU node. It detects the card and advertises it to the cluster as a schedulable resource with a name like nvidia.com/gpu. Once that's running, the magic works. A job can literally request one GPU as a resource, and the scheduler will only ever place it on a node that has advertised one.
Now you want control over where things land, and Kubernetes gives you three tools. Labels are simple. I labeled the gaming rig as a GPU worker. Taints are the opposite of a magnet.
A taint on a T node repels every pod unless that pod carries a match matching toleration. A permis So I taint the gaming rig so that ordinary pods stay off it, and I give only my AI jobs the toleration to land there. The result is exactly what I want. Heavy GPU work runs on the gaming rig, and the DGX control plane stays clear to keep directing.
Now the work itself. The smallest thing Kubernetes runs is a pod, a wrapper around one container and the resources it asks for. Some pods run forever like a model server waiting for a request. But a lot of local AI work runs once and then finishes.
Render this animation, transcribe this audio. For that, Kubernetes has a job. A job spins up a pod, runs your task to completion, records that it succeeded. That run-to-completion shape is perfect for bursty GPU jobs.
It fits exactly what a home lab actually runs. Here is a very local AI problem. You have one or two graphics cards, but 10 jobs you'd like to run, if you launch them all at once, they fight over the card and everything slows to a crawl or crashes. The answer is a queue.
I run KU, a queuing layer for Kubernetes. Instead of stuff, each GPU job is admitted only when there is real capacity for it. The rest wait their turn. So, my renders, my speech jobs, and my image generation line up politely and run one after another instead of trashing a single overloaded graphics card.
This is the heart of Kubernetes, the part that makes the whole thing feel like magic. When a job is admitted, it carries its requirements. I need one graphics card and ideally a node labeled for rendering. The scheduler looks across every node, filters down to the ones that advertise the sys free GPU, and match the labels, scores what's left, and places the job on the best fit in my cluster.
That means a Blender render always lands on the gaming rig with the 4090, never on a machine without the right hardware. I never hand pick the box. I describe the need and the scheduler does the matching. There's a practical hurdle in a home lab cluster.
Your machines might not even be on the same network, and pods on different nodes still need to reach each other. I solved the first half with Tailscale, a private mesh that gives every machine a C stable address regardless of where it physically sits. So, the gaming rig and the DGX can always find each other. Then a cluster network layer, the container network interface, gives every pod a flat address space that spans both machines.
The upshot is that a container on the gaming rig can talk to one on the DGX as if they were neighbors. So, what does this cluster actually do day to day? It runs my local AI. The voice narrating this video was synthesized by a speech model running as a job on the DGX.
The animations you have been watching this whole time were rendered by Blender running as jobs on the gaming rig. All workloads enter the cluster the same way. Image gen video synthesis and batch inference. I submit a job, the queue admits it and the scheduler routes it to whichever G P U node is the right fit.
Two very different machines, one cluster, each doing what it is best at. Here's the takeaway. You do not need a data center to run serious local AI. A gaming rig and one spare GPU box are enough to start.
Install Kubernetes, label your nodes, add the Nvidia device plugin so the cluster sees your cards, and put a queue in front. From that point on, you stop secure shelling into machines and babysitting jobs, and you start simply declaring the work you want done. The cluster schedules it onto the right hardware, queues it when things are busy, and keeps it running as machines come and go. That is Kubernetes, and you just watched a home lab cluster build this