Private LLM deployment: running AI models on your own infrastructure

Why deploy LLMs on your own infrastructure

Cloud-hosted AI services from OpenAI, Anthropic, Google, and others are the fastest way to access large language model capabilities. But for a growing number of South African businesses, sending data to external APIs is either unacceptable or uneconomical.

Data sovereignty and privacy

If your business handles privileged legal communications, patient health records, financial data, or classified government information, sending that data to a third-party API introduces risk - regardless of the provider’s security posture. Private deployment keeps every prompt and response within your own network boundary.

Under South Africa’s POPIA, you remain the responsible party for personal information processed by AI systems. Private deployment simplifies compliance by keeping data processing within infrastructure you directly control.

Compliance requirements

Some regulatory frameworks and client contracts explicitly prohibit sending data to third-party cloud services. Private deployment satisfies these requirements without giving up AI capabilities.

Cost at scale

Cloud API pricing (per token or per request) is economical for low to moderate usage. But when AI becomes embedded across multiple business processes - document analysis, customer service, internal search, code assistance - costs scale linearly with usage. At high volumes, dedicated hardware amortised over three to five years can be significantly cheaper per inference.

Latency and availability

On-premise deployment eliminates internet round-trip latency and dependency on external service availability. For applications that require real-time responses (interactive assistants, manufacturing quality inspection), local inference provides consistent, predictable performance.

Hardware requirements

Running large language models is computationally demanding. The hardware decisions you make directly determine which models you can run and at what speed.

GPUs

GPUs are the primary compute engine for LLM inference. The key specification is VRAM (video memory) - the model’s parameters must fit in GPU memory for efficient inference.

Model size	VRAM required (FP16)	Example GPUs
7B parameters	14 GB	NVIDIA RTX 4090, A4000
13B parameters	26 GB	NVIDIA A5000, RTX 6000 Ada
34-40B parameters	70-80 GB	NVIDIA A100 80GB, H100
70B+ parameters	140+ GB	Multi-GPU (2-4x A100/H100)

Quantisation techniques (reducing model precision from FP16 to INT8 or INT4) can halve or quarter memory requirements with modest quality trade-offs, making larger models accessible on smaller GPUs.

CPU and system memory

While inference primarily uses the GPU, the CPU handles tokenisation, scheduling, and data movement. A modern multi-core server CPU (AMD EPYC, Intel Xeon) with 128-256 GB of system RAM provides comfortable headroom.

Storage

Model weights for a 70B parameter model are approximately 140 GB at full precision. You also need space for multiple model versions, fine-tuned variants, and the inference framework. Fast NVMe storage (1-2 TB) ensures quick model loading.

Networking

If you deploy across multiple GPU servers, high-bandwidth, low-latency networking (InfiniBand or 100 GbE) is necessary for tensor parallelism. For single-server deployments, standard networking is sufficient.

Model selection

The open-source LLM ecosystem has matured rapidly. You no longer need to compromise heavily on quality to run models privately.

Leading open-source options

Llama family (Meta) - available in 7B to 405B parameter sizes, with strong general-purpose performance. The Llama 3 series is competitive with many commercial APIs.
Mistral / Mixtral (Mistral AI) - efficient models with strong reasoning, including mixture-of-experts architectures that provide large-model quality at lower inference cost.
Qwen (Alibaba) - strong multilingual performance and competitive benchmarks across sizes.
DeepSeek - particularly strong in coding and reasoning tasks.
Phi (Microsoft) - smaller models (3B-14B) optimised for efficiency, suitable for deployment on modest hardware.

Choosing the right model

Match model size to your use case:

Summarisation, classification, extraction - 7-13B models are often sufficient.
Complex reasoning, writing, analysis - 34-70B models provide a noticeable quality uplift.
Multi-turn conversation, coding - 70B+ models or specialised fine-tunes perform best.

Start with the smallest model that meets your quality requirements. You can always scale up, and smaller models are dramatically cheaper to run.

Deployment patterns

Single-server inference

The simplest pattern: one server with one or more GPUs, running an inference framework (vLLM, llama.cpp, TGI) that serves a REST API. Suitable for teams of up to a few hundred users with moderate concurrent usage.

Load-balanced multi-server

For higher concurrency or redundancy, deploy multiple inference servers behind a load balancer. This also enables serving different models for different use cases (a small fast model for classification, a large model for complex tasks).

Retrieval-augmented generation (RAG)

Most business applications benefit from RAG: the model answers questions using your organisation’s documents rather than relying solely on its training data. This requires a vector database (Milvus, Qdrant, pgvector) alongside the inference server, with a pipeline that indexes your documents as embeddings.

Fine-tuning pipeline

If general-purpose models don’t meet your quality requirements for specific tasks, fine-tuning on your own data can close the gap. This requires additional GPU capacity for training (typically more demanding than inference) and a data preparation pipeline.

Operational considerations

Running LLMs in production requires more than just starting a server.

Monitoring and observability

Track inference latency, throughput (tokens per second), GPU utilisation, memory usage, and error rates. Set alerts for degradation - a slow model response cascades into poor user experience across every application that depends on it.

Model updates and versioning

New model releases are frequent. Establish a process for evaluating, testing, and deploying updated models without disrupting service. Keep previous versions available for rollback.

Security

Secure the inference API with authentication and authorisation. Log all requests and responses for audit purposes. Implement input validation to prevent prompt injection attacks. ITHQ’s AI security, governance, and compliance practice can help establish these controls.

Capacity planning

Usage tends to grow rapidly once AI capabilities are available. Plan for 2-3x your initial projected usage within the first year, and design the architecture to scale by adding GPU servers.

Cost comparison

A realistic comparison for a 70B parameter model serving an organisation of 500 users:

Cost category	Cloud API (annual)	Private deployment (annual, amortised)
Inference compute	R800K - R2.5M+	R400K - R600K
Hardware (amortised 3yr)	-	R300K - R500K
Power and cooling	-	R50K - R100K
Staff (partial FTE)	-	R150K - R250K
Total	R800K - R2.5M+	R900K - R1.45M

The break-even point depends on usage volume. At low usage, cloud APIs win on cost. At moderate to high usage, private deployment becomes the more economical choice - with the added benefits of data control and predictable performance.

Getting started

Private LLM deployment is a meaningful infrastructure project, but it doesn’t need to be all-or-nothing.

Identify the use case - start with one high-value application, not a platform play.
Evaluate models - test two or three open-source models against your requirements.
Size the hardware - match GPU and memory to your chosen model and expected concurrency.
Deploy and iterate - start with a single-server deployment, gather usage data, and expand.

ITHQ’s private and on-premise AI solutions team specialises in helping businesses deploy LLMs on their own infrastructure - from hardware specification through production operations.