DeepSeek is one of the strongest open-weight large language model families you can self-host, and running it on your own GPU means your prompts and data never leave infrastructure you control. In this guide we will provision a Cloud GPU instance on Momo Cloud, connect over SSH, verify the GPU, and get DeepSeek answering requests — first the quick way with Ollama, then a brief look at the high-performance path with vLLM.
This is a practical, copy-and-paste walkthrough aimed at engineers. The commands target Ubuntu 22.04/24.04 LTS, which is the default OS for server tutorials at Momo Cloud.
1. Order a Cloud GPU instance
Log in to your client area at cloud.momo.tz with your account email and password, then open Cloud GPU and click Create. The single most important decision here is VRAM. Model weights have to fit in GPU memory (plus some headroom for the KV cache that holds the conversation context), so pick a plan accordingly.
| DeepSeek variant | Rough VRAM (4-bit / quantised) | Suitable GPU class |
|---|---|---|
| deepseek-r1:1.5b (distilled) | ~2–3 GB | Entry / small GPU |
| deepseek-r1:7b (distilled) | ~5–6 GB | Single mid-range GPU |
| deepseek-r1:8b / 14b (distilled) | ~6–10 GB | Mid-range GPU |
| deepseek-r1:32b (distilled) | ~20–22 GB | High-end single GPU |
| Full / 70b+ models | 40 GB and up | Large or multi-GPU |
Tip: The smaller distilled DeepSeek-R1 variants (1.5B–8B) run comfortably on a single mid-range GPU and are the sensible starting point. Only step up to the larger models once you have confirmed your workload actually needs the extra quality — bigger models cost meaningfully more per hour.
Choose your GPU plan and billing cycle, select Ubuntu as the operating system in the order wizard, and complete the order. An invoice is generated in TZS; pay it from the client area using M-Pesa, Tigo Pesa, Airtel Money, Visa/Mastercard, or your wallet balance. Once payment clears, the instance is provisioned automatically — give it a few minutes.
2. Connect over SSH
When the instance is active, open it from your Cloud GPU list to find its IP address and login credentials (reveal the root password with the eye icon if one is shown). From your laptop's terminal, connect:
ssh root@YOUR_INSTANCE_IP
Accept the host key fingerprint on first connection and enter the password. Once in, take a moment to update the system:
apt update && apt upgrade -y
3. Verify the GPU
Before installing anything, confirm the GPU is visible to the operating system. The standard tool is nvidia-smi:
nvidia-smi
On a correctly imaged GPU instance you will see a table listing the GPU model, total/used memory, and the driver and CUDA versions. If the command runs, you are ready to move on to step 4.
If nvidia-smi is missing
If the command is not found, the NVIDIA driver is not installed. Ubuntu ships a helper that picks the right driver for the detected card:
apt install -y ubuntu-drivers-common
ubuntu-drivers autoinstall
reboot
The instance will drop your SSH session while it reboots. Reconnect after a minute and run nvidia-smi again — it should now report the GPU. Ollama bundles the CUDA runtime it needs, so for the Ollama path you do not have to install the full CUDA toolkit separately.
Warning: Driver and kernel versions must match. If a reboot leaves nvidia-smi reporting "Failed to initialize NVML: Driver/library version mismatch", simply reboot once more so the system loads the newly installed kernel module.
4. The easy path: Ollama
Ollama is the fastest way to get DeepSeek running. It handles downloading, quantisation, GPU offload, and exposes a simple API. Install it with the official script:
curl -fsSL https://ollama.com/install.sh | sh
The installer sets up a systemd service called ollama and starts it automatically. Now pull and chat with a DeepSeek model in one command:
ollama run deepseek-r1:7b
The first run downloads the model (a few gigabytes), then drops you into an interactive prompt. Type a question, press Enter, and you should see DeepSeek respond using the GPU. Type /bye to exit. Swap 7b for 1.5b, 8b, 14b or 32b depending on the VRAM you provisioned.
While a model is running, open a second SSH session and confirm the GPU is actually being used — you should see the ollama process and non-zero memory in the output:
nvidia-smi
Expose the Ollama API
By default Ollama serves its HTTP API on 127.0.0.1:11434, reachable only from the instance itself. That is the safe default. If you need other machines on a trusted network to reach it, edit the service to listen on all interfaces:
systemctl edit ollama
Add the following override, then save:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Reload and restart the service:
systemctl daemon-reload
systemctl restart ollama
Warning: Ollama has no authentication. Binding it to 0.0.0.0 exposes an open API to anyone who can reach the port. Never do this without a firewall in front of it — see the security section below.
5. The performance path: vLLM
For production serving — high throughput, batched requests, and an OpenAI-compatible endpoint — vLLM is the standard choice. It is heavier to set up than Ollama and expects a full CUDA toolchain, but delivers far better tokens-per-second under concurrent load. The quickest start is the official Docker image:
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Alternatively, install it into a Python virtual environment with pip install vllm and launch vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Either way, vLLM exposes an OpenAI-compatible API on port 8000, so existing OpenAI SDK code works by simply pointing the base URL at your instance.
6. Test the local API
With Ollama running, send a request to its API from the instance to confirm everything works end to end:
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:7b",
"prompt": "In one sentence, what is a GPU?",
"stream": false
}'
You will get back a JSON object whose response field contains DeepSeek's answer. For vLLM, hit its OpenAI-compatible endpoint instead:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
"messages": [{"role": "user", "content": "Say hello"}]
}'
7. Secure it and control cost
An LLM endpoint with no authentication is a liability. Keep the API bound to localhost wherever possible and reach it through an SSH tunnel from your workstation:
ssh -L 11434:localhost:11434 root@YOUR_INSTANCE_IP
If the service genuinely must be reachable from the internet, do not expose it directly. Put a reverse proxy (Nginx or Caddy) in front to add TLS and authentication, and lock the host firewall down so only SSH and your proxy port are open:
ufw allow OpenSSH
ufw allow 443/tcp
ufw enable
Configure the proxy to require an API key or HTTP basic auth before forwarding to 127.0.0.1:11434 (or :8000 for vLLM), so the model itself is never directly exposed.
Tip: GPU instances are billed for the time they are powered on. When you are not actively using DeepSeek, stop the instance from your Cloud GPU panel — an idle GPU still costs money. A stopped instance keeps its disk and configuration, so you can start it again later and your model files will still be there.
Wrapping up
You now have a repeatable path to self-hosted DeepSeek: order a Cloud GPU with enough VRAM for your chosen variant, verify the GPU with nvidia-smi, run a model in minutes with Ollama (or scale up with vLLM), test it over the local API, and keep it locked down behind SSH or an authenticated proxy. The two habits that matter most are sizing VRAM to the model and stopping the instance when it is idle.
Ready to try it? Spin up a Cloud GPU instance from your Momo Cloud client area at cloud.momo.tz, and if you hit a snag our team is available 24/7 in English and Swahili — just open a support ticket.