The Challenge: Hosting Ollama Under $10
Running large models like Llama 3.3 (70B) or DeepSeek-R1-Distill-Llama-70B requires severe VRAM. If you do not own a dual-RTX 3090/4090 desktop or a high-end Mac Studio, you are forced into the cloud. However, major cloud providers (AWS, GCP, Azure) require complex VPC networking, setup fees, and expensive hourly commitments.
Thunder Compute represents a new class of "neo-cloud" providers targeting AI developers. By utilizing a virtualized network-attached GPU layer (GPU-over-TCP), they separate raw GPU cores from underlying CPU nodes. This lets them offer on-demand instances at a fraction of traditional hyperscaler costs.
We set up a strict budget constraint: exactly $10.00 pre-paid. Here is how that budget divides across their active GPU offerings:
| GPU Model | VRAM | Hourly Rate | Max Runtime per $10 | Optimal Ollama Target |
|---|---|---|---|---|
| NVIDIA RTX A6000 | 48 GB | $0.35/hr | 28.57 hours | Llama-3.1-70B-Q4_K_M (43GB) |
| NVIDIA A100 80GB | 80 GB | $0.78/hr | 12.82 hours | DeepSeek-R1-Distill-70B-Q8_0 (75GB) |
| NVIDIA L40 | 48 GB | $0.89/hr | 11.23 hours | Stable Diffusion XL / vLLM batches |
| NVIDIA L40S | 48 GB | $0.99/hr | 10.10 hours | Fine-tuning smaller 8B/14B models |
| NVIDIA H100 80GB | 80 GB | $1.38/hr | 7.24 hours | Llama-3.3-70B-Q8_0 (high throughput) |
Step-by-Step Execution Workflow
The installation and configuration process on Thunder Compute is built to be developer-friendly. Below is the exact command sequence we executed to get our test environment live.
1. CLI Installation and Login
First, install the `tnr` CLI tool locally. This tool handles authentication, instance provisioning, and port-forwarding tunnels:
curl -fsSL https://raw.githubusercontent.com/Thunder-Compute/thunder-cli/main/scripts/install.sh | bash tnr login
2. Provisioning with the Ollama Template
We started our run on a low-cost NVIDIA RTX A6000 (48GB). Instead of setting up CUDA and downloading Ollama manually, we used Thunder Compute's pre-built template, which handles the workspace preparation automatically:
tnr create --gpu "a6000" --template "ollama"
This provisions the node in under 30 seconds. To verify the instance is running and grab its unique ID, check the status:
tnr status
3. Connecting and Launching the Service
Establish an SSH link to the container. Thunder Compute manages your SSH keys automatically if you've added them with `tnr ssh-keys add`:
tnr connect <instance_id>
Once inside the shell, launch the Ollama runtime and its web interface. The preconfigured template includes a helper command to start the background servers:
start-ollama
Ollama is now listening inside the instance on port `11434`.
4. Exposing Ollama to Your Local Machine
To route traffic from your local terminal or codebases directly to the cloud GPU instance, tunnel the Ollama port to your local environment. Run this command on your local machine:
tnr connect <instance_id> -t 11434
Now, you can query the remote GPU as if it were running on your local machine:
curl http://localhost:11434/api/tags
The Secret Weapon: Hot-Swapping GPUs
Normally, scaling resources in the cloud requires stopping your VM, detaching volumes, creating a new VM, and re-attaching disks. Under Thunder Compute, you can reconfigure hardware without rebuilding the workspace.
Suppose you began your session benchmarking a 70B model quantized at Q4 (which fits comfortably inside the 48GB VRAM of the $0.35/hr RTX A6000). You now want to test the full, unquantized or Q8_0 weights of a 70B model, which requires an 80GB GPU.
Instead of destroying your environment, you can **hot-swap the GPU** using the modify command:
# Stop the instance first to safely release the GPU allocation tnr stop <instance_id> # Modify the instance configuration to upgrade the GPU to an A100 80GB tnr modify <instance_id> --gpu "a100" # Spin the instance back up tnr connect <instance_id>
Your home folder, downloaded model files, and custom Modelfiles are preserved because the persistent disk storage is independent of the dynamic GPU assignment. This is an immense benefit when dealing with huge model weights; downloading a 50GB model from Hugging Face every time you change hardware is a massive waste of time and budget.
Development Mode vs. Production Mode
When customizing your instance, you will encounter the option to run in either Development Mode or Production Mode. Selecting the correct mode is critical for maximizing your $10 budget:
- Development Mode (Pay-per-minute Prototyping): This is optimized for R&D. It runs on a shared infrastructure layer that uses virtualization optimizations to slice GPU cores. You only pay for GPU compute when CUDA tasks are active. If your workspace sits idle while you read documentation, you aren't paying full GPU rates. This is the mode to use for the $10 budget challenge.
- Production Mode (Dedicated SLAs): This allocates physical GPUs directly to your instance without shared virtualization. This eliminates network overhead and latency, making it ideal for low-latency production inference or intense multi-GPU training scripts (scaling up to 8 GPUs).
Model Fit & Quant Guide for 48GB vs 80GB
If you are running Ollama on these instances, here is how to size your models to get the best performance and avoid Out-Of-Memory (OOM) crashes:
| Model Name | Size | Quantization | RAM/VRAM Required | Best Match GPU |
|---|---|---|---|---|
| DeepSeek-R1-Distill-Llama-70B | 70B | Q4_K_M | ~43 GB | RTX A6000 (48GB) |
| Llama 3.3 | 70B | Q4_K_M | ~43 GB | RTX A6000 (48GB) |
| DeepSeek-R1-Distill-Llama-70B | 70B | Q8_0 | ~75 GB | A100 80GB / H100 80GB |
| Llama 3.3 | 70B | Q8_0 | ~75 GB | A100 80GB / H100 80GB |
| Command R+ | 104B | Q4_K_M | ~65 GB | A100 80GB |
The Verdict: Is It Worth It?
For a $10 prepaid budget, Thunder Compute offers exceptional utility. We managed to run a complete benchmark suite testing multiple quantized 70B models, upgrade our GPU on the fly to test Q8 quants, and still had credit left over.
What We Liked:
- Cost-effectiveness: $0.35/hr for an A6000 means you can leave your development instance running all day without breaking the bank.
- Workspace persistence: Swapping GPU models on the fly works perfectly and saves hours of model download times.
- Ollama Template: Pre-built environments eliminate the usual CUDA/Docker driver configuration headaches.
What to Keep in Mind:
- Storage constraints: Disks can be expanded but cannot be shrunk. Be conservative when setting your initial storage size.
- GPU-over-TCP Overhead: In Development mode, latency-sensitive applications may experience minor network overhead due to the virtualization layer. For low-latency production applications, opt for Production mode.