Running AI models in production does not require a $50,000 DGX station or a $10,000 monthly cloud GPU bill. For inference workloads that serve a business rather than train foundation models, refurbished enterprise hardware with a consumer GPU delivers remarkably capable performance at a fraction of the cost. We built the server that powers all of AIQSO's AI services for under $2,000, and it has been running production workloads continuously for over a year.
Key Takeaways
- Refurbished enterprise workstations like the HP Z840 provide server-grade reliability at consumer prices
- An RTX 3090 with 24GB VRAM handles 70B parameter models and concurrent inference for multiple services
- Ollama with LiteLLM proxy creates an OpenAI-compatible API layer for any application
- Self-hosted inference eliminates per-token costs, which compound rapidly at production volumes
- Cloudflare Tunnels provide secure remote access without exposing ports or managing VPNs
- The total build cost of approximately $1,800 replaces $2,000-4,000 in monthly cloud GPU spend
Why Self-Hosted AI Inference
Before covering the build, the economics deserve attention because they drive the decision. Cloud AI inference pricing follows a per-token model. For GPT-4 class models, input tokens cost $2.50-10.00 per million and output tokens cost $10.00-30.00 per million depending on the provider. At moderate production volumes of 50,000 requests per day with average context lengths, monthly costs reach $2,000-4,000.
Self-hosted inference has zero marginal cost per token. The hardware investment is fixed, electricity runs approximately $40-60 per month for a workstation-class server under moderate load, and model updates are free. The breakeven point against cloud inference typically arrives within 30-60 days of deployment.
Beyond cost, self-hosted inference provides data sovereignty. Every prompt, every document processed through RAG pipelines, every customer interaction stays on your infrastructure. For businesses handling sensitive client data or operating under compliance requirements, this eliminates an entire category of risk.
Model control is the third advantage. You choose which models to run, when to update them, and how to configure them. There is no deprecation notice forcing a migration, no provider deciding to change pricing, and no rate limits during peak usage.
The Hardware Build
The HP Z840 is a dual-socket workstation that HP designed for engineering and scientific computing. On the refurbished market, these machines represent exceptional value because enterprises refresh them on predictable cycles regardless of remaining useful life.
Base system: HP Z840 Workstation — $800 refurbished
The unit we sourced came with a single Xeon E5-2699 v3 processor (18 cores, 36 threads). We added a second identical processor for $120 from a server parts vendor. The dual-socket configuration provides 36 physical cores and 72 threads, which matters for CPU-bound preprocessing, embedding generation on CPU fallback, and handling concurrent requests while the GPU processes inference.
GPU: NVIDIA RTX 3090 — $700 used
The RTX 3090 remains the price-performance leader for inference workloads. Its 24GB of VRAM loads quantized 70B parameter models (Q4 precision) or runs multiple smaller models simultaneously. Used prices have stabilized around $600-800 as miners and early adopters sell off cards. For inference specifically, the 3090 matches or exceeds the RTX 4090 in throughput-per-dollar because model loading is memory-bound rather than compute-bound at inference time.
The Z840 accepts full-length, dual-slot GPUs without modification. The stock 1125W power supply handles the 3090's 350W TDP with headroom for the dual processors.
Memory: 192GB DDR4 ECC — $180 (incremental)
The refurbished unit came with 64GB. We added 128GB across the remaining DIMM slots using 16GB ECC registered modules sourced from server decommissions. ECC memory matters for a server running continuous workloads because silent bit errors in non-ECC memory accumulate over time and cause unpredictable model behavior.
The 192GB of system RAM serves as the staging area for model loading and handles the context windows for large models. When running LLaMA 3.3 70B at Q4 quantization, approximately 42GB loads into VRAM with overflow layers spilling to system RAM. The ample system memory ensures this spillover does not become a bottleneck.
Storage: 2TB NVMe — $120
Model files for 11 Ollama models total approximately 120GB. The NVMe storage ensures fast model swapping when switching between models that cannot all reside in VRAM simultaneously.
Total build cost: approximately $1,800 including the base system, second CPU, additional RAM, GPU, and storage upgrade.
The Software Stack
The server runs Ubuntu 24.04 LTS, chosen for its NVIDIA driver support and long-term update cycle. The AI stack layers on top of the base OS.
Ollama serves as the model runtime. It handles model file management, GPU memory allocation, context window management, and exposes a REST API for inference. Ollama's architecture automatically manages model loading and unloading based on request patterns, which is essential when running multiple models on a single GPU.
Our deployed model roster includes:
- qwen3:8b — Primary general-purpose model for chat and analysis tasks, fast enough for real-time interactions
- llama3.3:70b — Heavy reasoning tasks that need maximum capability, loaded on demand
- nomic-embed-text — 768-dimension embeddings for RAG pipelines, runs concurrently with inference models
- aiqsochat:latest — Fine-tuned 16GB model specifically trained on AIQSO service documentation
- cyberque-chat:latest — Domain-specific model for cybersecurity consulting interactions
- qwen2.5-coder:32b — Code generation and analysis tasks
LiteLLM proxy sits in front of Ollama and provides an OpenAI-compatible API endpoint. This means any application built against the OpenAI SDK can point at our server with a configuration change. LiteLLM handles request routing, load balancing across models, fallback chains, and usage logging.
ComfyUI runs alongside the LLM stack for image-to-video generation using Wan2.1 models. This is a separate workload that time-shares the GPU, typically running batch video generation jobs during off-peak hours.
Network Architecture and Remote Access
The server sits on the local network at a static IP address within our Proxmox infrastructure environment. Rather than exposing ports to the internet, we use Cloudflare Tunnels to provide secure remote access.
The tunnel configuration creates named routes that map public hostnames to local services. Inference requests from deployed applications route through Cloudflare's network to the tunnel endpoint, then to the Ollama API on the local network. This provides DDoS protection, TLS termination, and access control without opening any inbound firewall ports.
For applications running within the same network, such as services deployed on our Proxmox cluster, direct API access avoids the tunnel and provides lower latency. The LiteLLM proxy listens on the local network interface and accepts requests from trusted internal sources.
Performance Characteristics
Real-world inference performance depends on model size, quantization, context length, and concurrent request count. These are representative numbers from our production workload.
qwen3:8b at Q4 quantization generates approximately 45-55 tokens per second for single requests. This translates to a perceived response time of 2-4 seconds for typical chat completions. Under concurrent load of 5-10 simultaneous requests, per-request throughput drops to 15-25 tokens per second, which remains acceptable for production use.
llama3.3:70b at Q4 quantization generates approximately 8-12 tokens per second. This model loads on demand for tasks requiring maximum reasoning capability and is not suited for real-time chat at scale. Batch processing and complex analysis tasks are its primary use case.
nomic-embed-text processes embedding requests at approximately 200-300 documents per minute with typical document lengths of 500-1000 tokens. This throughput supports the RAG pipelines for our AI automation services without becoming a bottleneck.
The dual Xeon configuration handles preprocessing, tokenization, and request scheduling on CPU while the GPU focuses on inference. During peak loads, CPU utilization reaches 40-60% while GPU utilization sustains 85-95%, indicating well-balanced hardware selection.
Operational Lessons
Running a production AI server on workstation hardware has taught us several lessons that may save time for others considering a similar build.
Thermal management matters. The Z840 has enterprise-grade cooling, but adding a 350W GPU to a chassis designed for workstation GPUs required attention. We run the system in an air-conditioned server closet with supplemental airflow. GPU temperatures stabilize at 72-78 degrees Celsius under sustained inference load, well within the 3090's thermal envelope.
VRAM is the real constraint. The 24GB of VRAM on the 3090 is generous but finite. Running multiple models simultaneously requires careful memory management. Ollama's automatic model unloading helps, but for production workloads, we configured explicit model keep-alive times to ensure high-priority models remain loaded while lower-priority models swap in and out as needed.
Model quantization is a trade-off, not a compromise. Q4 quantized models show minimal quality degradation for most business tasks compared to full-precision versions. The 70B Q4 model fits in 42GB (VRAM plus RAM spillover) versus 140GB at full precision, making it practical on consumer hardware. For our use cases, which include document analysis, chat, code generation, and embeddings, quantized models perform indistinguishably from cloud-hosted full-precision alternatives.
Monitoring is non-negotiable. We run health checks every 5 minutes that verify Ollama responsiveness, GPU memory utilization, temperature, and API latency. These feed into our broader infrastructure monitoring stack. A hung model or VRAM leak that goes undetected for hours degrades service for all downstream applications.
When This Applies
This approach makes sense for businesses with consistent AI inference needs that currently spend $1,000 or more per month on cloud AI APIs. The breakeven math is straightforward: if your monthly cloud AI spend exceeds the amortized hardware cost plus electricity, self-hosting saves money from month one.
It also applies to organizations with data sensitivity requirements. Legal firms processing confidential documents through AI, healthcare organizations analyzing patient records, and defense contractors operating under CMMC requirements all benefit from keeping inference on-premises.
The approach is less suitable for businesses with sporadic AI usage, those requiring the absolute latest models on release day, or organizations without the technical capability or managed services partner to maintain the infrastructure. Cloud inference remains the better choice when usage is unpredictable or when the operational overhead of self-hosting would exceed the cost savings.
For businesses evaluating whether self-hosted AI inference fits their operations, the key question is not whether the hardware can handle it — workstation-class hardware is more than capable. The question is whether your inference volume justifies the fixed investment, and whether your engineering team or infrastructure partner can maintain the system reliably. In most cases where the monthly cloud AI bill exceeds $1,500, the answer to both questions is yes.