What is Thunder Compute?

Thunder Compute is a cloud GPU provider (often called a 'neo-cloud') that utilizes GPU-over-TCP virtualization to attach high-performance GPUs to lightweight, cost-effective CPU compute instances, passing on hardware savings to developers.

Can you change GPUs on a running instance on Thunder Compute?

Yes. Thunder Compute supports 'hot-swapping' or modifying instance hardware configurations on the fly via the `tnr modify ` command, allowing you to swap GPU types, scale CPU/RAM, or expand storage without losing your persistent workspace disk.

What is the difference between Development and Production modes on Thunder Compute?

Development mode uses aggressive CUDA-level virtualization optimizations to provide low-cost, per-minute billing ideal for testing and prototyping. Production mode bypasses virtualization layers to offer bare-metal-like performance stability, higher uptime SLAs, and support for multi-GPU configurations up to 8 GPUs.

The $10 Ollama Experiment: Benchmarking Thunder Compute’s Neo-Cloud

The Challenge: Hosting Ollama Under $10

Running large models like Llama 3.3 (70B) or DeepSeek-R1-Distill-Llama-70B requires severe VRAM. If you do not own a dual-RTX 3090/4090 desktop or a high-end Mac Studio, you are forced into the cloud. However, major cloud providers (AWS, GCP, Azure) require complex VPC networking, setup fees, and expensive hourly commitments.

Thunder Compute represents a new class of "neo-cloud" providers targeting AI developers. By utilizing a virtualized network-attached GPU layer (GPU-over-TCP), they separate raw GPU cores from underlying CPU nodes. This lets them offer on-demand instances at a fraction of traditional hyperscaler costs.

We set up a strict budget constraint: exactly $10.00 pre-paid. Here is how that budget divides across their active GPU offerings:

GPU Model	VRAM	Hourly Rate	Max Runtime per $10	Optimal Ollama Target
NVIDIA RTX A6000	48 GB	$0.35/hr	28.57 hours	Llama-3.1-70B-Q4_K_M (43GB)
NVIDIA A100 80GB	80 GB	$0.78/hr	12.82 hours	DeepSeek-R1-Distill-70B-Q8_0 (75GB)
NVIDIA L40	48 GB	$0.89/hr	11.23 hours	Stable Diffusion XL / vLLM batches
NVIDIA L40S	48 GB	$0.99/hr	10.10 hours	Fine-tuning smaller 8B/14B models
NVIDIA H100 80GB	80 GB	$1.38/hr	7.24 hours	Llama-3.3-70B-Q8_0 (high throughput)

Step-by-Step Execution Workflow

The installation and configuration process on Thunder Compute is built to be developer-friendly. Below is the exact command sequence we executed to get our test environment live.

1. CLI Installation and Login

First, install the `tnr` CLI tool locally. This tool handles authentication, instance provisioning, and port-forwarding tunnels:

curl -fsSL https://raw.githubusercontent.com/Thunder-Compute/thunder-cli/main/scripts/install.sh | bash
tnr login

2. Provisioning with the Ollama Template

We started our run on a low-cost NVIDIA RTX A6000 (48GB). Instead of setting up CUDA and downloading Ollama manually, we used Thunder Compute's pre-built template, which handles the workspace preparation automatically:

tnr create --gpu "a6000" --template "ollama"

This provisions the node in under 30 seconds. To verify the instance is running and grab its unique ID, check the status:

tnr status

3. Connecting and Launching the Service

Establish an SSH link to the container. Thunder Compute manages your SSH keys automatically if you've added them with `tnr ssh-keys add`:

tnr connect <instance_id>

Once inside the shell, launch the Ollama runtime and its web interface. The preconfigured template includes a helper command to start the background servers:

start-ollama

Ollama is now listening inside the instance on port `11434`.

4. Exposing Ollama to Your Local Machine

To route traffic from your local terminal or codebases directly to the cloud GPU instance, tunnel the Ollama port to your local environment. Run this command on your local machine:

tnr connect <instance_id> -t 11434

Now, you can query the remote GPU as if it were running on your local machine:

curl http://localhost:11434/api/tags

The Secret Weapon: Hot-Swapping GPUs

Normally, scaling resources in the cloud requires stopping your VM, detaching volumes, creating a new VM, and re-attaching disks. Under Thunder Compute, you can reconfigure hardware without rebuilding the workspace.

Suppose you began your session benchmarking a 70B model quantized at Q4 (which fits comfortably inside the 48GB VRAM of the $0.35/hr RTX A6000). You now want to test the full, unquantized or Q8_0 weights of a 70B model, which requires an 80GB GPU.

Instead of destroying your environment, you can **hot-swap the GPU** using the modify command:

# Stop the instance first to safely release the GPU allocation
tnr stop <instance_id>

# Modify the instance configuration to upgrade the GPU to an A100 80GB
tnr modify <instance_id> --gpu "a100"

# Spin the instance back up
tnr connect <instance_id>

Your home folder, downloaded model files, and custom Modelfiles are preserved because the persistent disk storage is independent of the dynamic GPU assignment. This is an immense benefit when dealing with huge model weights; downloading a 50GB model from Hugging Face every time you change hardware is a massive waste of time and budget.

Development Mode vs. Production Mode

When customizing your instance, you will encounter the option to run in either Development Mode or Production Mode. Selecting the correct mode is critical for maximizing your $10 budget:

Development Mode (Pay-per-minute Prototyping): This is optimized for R&D. It runs on a shared infrastructure layer that uses virtualization optimizations to slice GPU cores. You only pay for GPU compute when CUDA tasks are active. If your workspace sits idle while you read documentation, you aren't paying full GPU rates. This is the mode to use for the $10 budget challenge.
Production Mode (Dedicated SLAs): This allocates physical GPUs directly to your instance without shared virtualization. This eliminates network overhead and latency, making it ideal for low-latency production inference or intense multi-GPU training scripts (scaling up to 8 GPUs).

Model Fit & Quant Guide for 48GB vs 80GB

If you are running Ollama on these instances, here is how to size your models to get the best performance and avoid Out-Of-Memory (OOM) crashes:

Model Name	Size	Quantization	RAM/VRAM Required	Best Match GPU
DeepSeek-R1-Distill-Llama-70B	70B	Q4_K_M	~43 GB	RTX A6000 (48GB)
Llama 3.3	70B	Q4_K_M	~43 GB	RTX A6000 (48GB)
DeepSeek-R1-Distill-Llama-70B	70B	Q8_0	~75 GB	A100 80GB / H100 80GB
Llama 3.3	70B	Q8_0	~75 GB	A100 80GB / H100 80GB
Command R+	104B	Q4_K_M	~65 GB	A100 80GB

The Verdict: Is It Worth It?

For a $10 prepaid budget, Thunder Compute offers exceptional utility. We managed to run a complete benchmark suite testing multiple quantized 70B models, upgrade our GPU on the fly to test Q8 quants, and still had credit left over.

What We Liked:

Cost-effectiveness: $0.35/hr for an A6000 means you can leave your development instance running all day without breaking the bank.
Workspace persistence: Swapping GPU models on the fly works perfectly and saves hours of model download times.
Ollama Template: Pre-built environments eliminate the usual CUDA/Docker driver configuration headaches.

What to Keep in Mind:

Storage constraints: Disks can be expanded but cannot be shrunk. Be conservative when setting your initial storage size.
GPU-over-TCP Overhead: In Development mode, latency-sensitive applications may experience minor network overhead due to the virtualization layer. For low-latency production applications, opt for Production mode.