Momo Cloud — Professional Web Hosting

A hands-on, senior-engineer walkthrough for running the DeepSeek open LLM on a Momo Cloud Cloud GPU instance — from ordering and SSH access to serving with Ollama or vLLM, plus security and cost tips.

DeepSeek is one of the strongest open-weight large language model families you can self-host, and running it on your own GPU means your prompts and data never leave infrastructure you control. In this guide we will provision a Cloud GPU instance on Momo Cloud, connect over SSH, verify the GPU, and get DeepSeek answering requests — first the quick way with Ollama, then a brief look at the high-performance path with vLLM.

This is a practical, copy-and-paste walkthrough aimed at engineers. The commands target Ubuntu 22.04/24.04 LTS, which is the default OS for server tutorials at Momo Cloud.

1. Order a Cloud GPU instance

Log in to your client area at cloud.momo.tz with your account email and password, then open Cloud GPU and click Create. The single most important decision here is VRAM. Model weights have to fit in GPU memory (plus some headroom for the KV cache that holds the conversation context), so pick a plan accordingly.

DeepSeek variant	Rough VRAM (4-bit / quantised)	Suitable GPU class
deepseek-r1:1.5b (distilled)	~2–3 GB	Entry / small GPU
deepseek-r1:7b (distilled)	~5–6 GB	Single mid-range GPU
deepseek-r1:8b / 14b (distilled)	~6–10 GB	Mid-range GPU
deepseek-r1:32b (distilled)	~20–22 GB	High-end single GPU
Full / 70b+ models	40 GB and up	Large or multi-GPU

Tip: The smaller distilled DeepSeek-R1 variants (1.5B–8B) run comfortably on a single mid-range GPU and are the sensible starting point. Only step up to the larger models once you have confirmed your workload actually needs the extra quality — bigger models cost meaningfully more per hour.

Choose your GPU plan and billing cycle, select Ubuntu as the operating system in the order wizard, and complete the order. An invoice is generated in TZS; pay it from the client area using M-Pesa, Tigo Pesa, Airtel Money, Visa/Mastercard, or your wallet balance. Once payment clears, the instance is provisioned automatically — give it a few minutes.

2. Connect over SSH

When the instance is active, open it from your Cloud GPU list to find its IP address and login credentials (reveal the root password with the eye icon if one is shown). From your laptop's terminal, connect:

ssh root@YOUR_INSTANCE_IP

Accept the host key fingerprint on first connection and enter the password. Once in, take a moment to update the system:

apt update && apt upgrade -y

3. Verify the GPU

Before installing anything, confirm the GPU is visible to the operating system. The standard tool is nvidia-smi:

nvidia-smi

On a correctly imaged GPU instance you will see a table listing the GPU model, total/used memory, and the driver and CUDA versions. If the command runs, you are ready to move on to step 4.

If nvidia-smi is missing

If the command is not found, the NVIDIA driver is not installed. Ubuntu ships a helper that picks the right driver for the detected card:

apt install -y ubuntu-drivers-common
ubuntu-drivers autoinstall
reboot

The instance will drop your SSH session while it reboots. Reconnect after a minute and run nvidia-smi again — it should now report the GPU. Ollama bundles the CUDA runtime it needs, so for the Ollama path you do not have to install the full CUDA toolkit separately.

Warning: Driver and kernel versions must match. If a reboot leaves nvidia-smi reporting "Failed to initialize NVML: Driver/library version mismatch", simply reboot once more so the system loads the newly installed kernel module.

4. The easy path: Ollama

Ollama is the fastest way to get DeepSeek running. It handles downloading, quantisation, GPU offload, and exposes a simple API. Install it with the official script:

curl -fsSL https://ollama.com/install.sh | sh

The installer sets up a systemd service called ollama and starts it automatically. Now pull and chat with a DeepSeek model in one command:

ollama run deepseek-r1:7b

The first run downloads the model (a few gigabytes), then drops you into an interactive prompt. Type a question, press Enter, and you should see DeepSeek respond using the GPU. Type /bye to exit. Swap 7b for 1.5b, 8b, 14b or 32b depending on the VRAM you provisioned.

While a model is running, open a second SSH session and confirm the GPU is actually being used — you should see the ollama process and non-zero memory in the output:

nvidia-smi

Expose the Ollama API

By default Ollama serves its HTTP API on 127.0.0.1:11434, reachable only from the instance itself. That is the safe default. If you need other machines on a trusted network to reach it, edit the service to listen on all interfaces:

systemctl edit ollama

Add the following override, then save:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Reload and restart the service:

systemctl daemon-reload
systemctl restart ollama

Warning: Ollama has no authentication. Binding it to 0.0.0.0 exposes an open API to anyone who can reach the port. Never do this without a firewall in front of it — see the security section below.

5. The performance path: vLLM

For production serving — high throughput, batched requests, and an OpenAI-compatible endpoint — vLLM is the standard choice. It is heavier to set up than Ollama and expects a full CUDA toolchain, but delivers far better tokens-per-second under concurrent load. The quickest start is the official Docker image:

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Alternatively, install it into a Python virtual environment with pip install vllm and launch vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Either way, vLLM exposes an OpenAI-compatible API on port 8000, so existing OpenAI SDK code works by simply pointing the base URL at your instance.

6. Test the local API

With Ollama running, send a request to its API from the instance to confirm everything works end to end:

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:7b",
  "prompt": "In one sentence, what is a GPU?",
  "stream": false
}'

You will get back a JSON object whose response field contains DeepSeek's answer. For vLLM, hit its OpenAI-compatible endpoint instead:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    "messages": [{"role": "user", "content": "Say hello"}]
  }'

7. Secure it and control cost

An LLM endpoint with no authentication is a liability. Keep the API bound to localhost wherever possible and reach it through an SSH tunnel from your workstation:

ssh -L 11434:localhost:11434 root@YOUR_INSTANCE_IP

If the service genuinely must be reachable from the internet, do not expose it directly. Put a reverse proxy (Nginx or Caddy) in front to add TLS and authentication, and lock the host firewall down so only SSH and your proxy port are open:

ufw allow OpenSSH
ufw allow 443/tcp
ufw enable

Configure the proxy to require an API key or HTTP basic auth before forwarding to 127.0.0.1:11434 (or :8000 for vLLM), so the model itself is never directly exposed.

Tip: GPU instances are billed for the time they are powered on. When you are not actively using DeepSeek, stop the instance from your Cloud GPU panel — an idle GPU still costs money. A stopped instance keeps its disk and configuration, so you can start it again later and your model files will still be there.

Wrapping up

You now have a repeatable path to self-hosted DeepSeek: order a Cloud GPU with enough VRAM for your chosen variant, verify the GPU with nvidia-smi, run a model in minutes with Ollama (or scale up with vLLM), test it over the local API, and keep it locked down behind SSH or an authenticated proxy. The two habits that matter most are sizing VRAM to the model and stopping the instance when it is idle.

Ready to try it? Spin up a Cloud GPU instance from your Momo Cloud client area at cloud.momo.tz, and if you hit a snag our team is available 24/7 in English and Swahili — just open a support ticket.

How to Deploy DeepSeek on a Momo Cloud GPU Instance

1. Order a Cloud GPU instance

2. Connect over SSH

3. Verify the GPU

If nvidia-smi is missing

4. The easy path: Ollama

Expose the Ollama API

5. The performance path: vLLM

6. Test the local API

7. Secure it and control cost

Wrapping up

Technical Admin

Related Articles

How to Register and Manage a Domain on Momo Cloud

10 Essential Linux Commands Every VPS Owner Should Master

cPanel Explained: A Beginner's Guide to Managing Your Hosting