Introduction

Industries such as finance, healthcare, and government impose strict data security requirements that public large model APIs cannot meet. Private deployment of large models is a must for these sectors.

Drawing on our experience delivering private large model deployments for over 10 enterprises, this article systematically walks through the 7 key steps.

Step 1: Model Selection

1.1 Comparison of Mainstream Open-Source Models

Model	Parameters	Chinese Capability	Inference Speed	Open-Source License	Recommended Scenarios
Qwen2.5-72B	72B	★★★★★	Moderate	Apache 2.0	Top choice for general purpose
Qwen2.5-7B	7B	★★★★	Fast	Apache 2.0	Lightweight scenarios
DeepSeek-V3	671B MoE	★★★★★	Fast	MIT	When budget is ample
ChatGLM4-9B	9B	★★★★	Fast	Apache 2.0	Conversational scenarios
Llama3.1-70B	70B	★★★	Moderate	Llama3	English-centric use
Yi-1.5-34B	34B	★★★★	Relatively fast	Apache 2.0	Best cost-performance ratio

1.2 Selection Advice

Prioritize general capability: Qwen2.5-72B

Limited budget: Yi-1.5-34B or Qwen2.5-7B

Inference-heavy scenarios: DeepSeek-V3

Resource-constrained: Quantized version of Qwen2.5-7B

Step 2: Computing Resource Assessment

2.1 GPU Requirements Reference

Model	FP16	INT8	INT4
7B	1×A100 40G	1×A10 24G	1×RTX4090 24G
34B	2×A100 80G	1×A100 80G	1×A100 40G
72B	4×A100 80G	2×A100 80G	2×A100 40G

2.2 Cost Estimation

Configuration	Purchase Cost	Monthly Rental	Suitable Scenarios
1×RTX4090	¥15,000	¥3,000	7B model testing
1×A100 40G	¥80,000	¥15,000	7B-34B models
2×A100 80G	¥250,000	¥40,000	34B-72B models
4×A100 80G	¥500,000	¥80,000	72B+ models

Step 3: Inference Engine Selection

Engine	Throughput	Latency	Ease of Use	Recommended Scenarios
vLLM	★★★★★	★★★★	★★★★	Preferred for production
TGI	★★★★	★★★★	★★★★	When compatibility is key
TensorRT-LLM	★★★★	★★★★★	★★★	Latency-sensitive use
Ollama	★★★	★★★	★★★★★	Local development and testing

Our recommendation: Use vLLM for production (highest throughput, active community) and Ollama for development/testing (one-click deployment).

Step 4: Model Quantization

4.1 Quantization Method Comparison

Method	Accuracy Loss	Speed Increase	Model Size Reduction	Use Case
FP16→INT8 (AWQ)	<1%	2x	2x	General recommendation
FP16→INT4 (GPTQ)	1%-3%	3x	4x	Resource-constrained
FP16→INT4 (GGUF)	2%-5%	3x	4x	CPU inference

4.2 Quantization Effect Reference

Quantization results for Qwen2.5-72B on Chinese benchmarks:

Quantization Method	C-Eval	Inference Speed (Tokens/s)	GPU Memory Usage
FP16	83.5	25	144GB
AWQ-INT8	82.8	48	72GB
GPTQ-INT4	81.2	72	40GB

Step 5: Containerized Deployment

```yaml

docker-compose.yml example

services:

vllm:

image: vllm/vllm-openai:latest

deploy:

resources:

reservations:

devices:

capabilities: [gpu]

count: 2

command: >

--model Qwen/Qwen2.5-72B-Instruct-AWQ

--quantization awq

--tensor-parallel-size 2

--max-model-len 8192

--gpu-memory-utilization 0.9

ports:

"8000:8000"

```

Step 6: Performance Optimization

Optimization	Method	Effect
Continuous Batching	Dynamic batching	2-3x throughput increase
PagedAttention	Paged VRAM management	40% better VRAM utilization
Prefix Caching	System prompt caching	50% latency reduction for identical prefixes
Speculative Decoding	Small model drafts, large model verifies	2-3x inference speed boost

Step 7: Monitoring and Operations

7.1 Key Monitoring Metrics

Metric	Alert Threshold
GPU utilization	>95% sustained for 5 minutes
Inference latency P99	>5 seconds
Request failure rate	>1%
VRAM usage	>90%
Model service availability	<99.9%

7.2 Operations Strategies

Auto-scaling: Dynamically adjust inference instance count based on request volume

Blue-green deployment: Zero-downtime model updates

Canary releases: Route 5% of traffic to the new model first for validation

Log aggregation: Full-chain request tracing

Conclusion

Private deployment is not simply "buy a server and install a model." Selecting the right model, provisioning sufficient computing power, optimizing inference, and establishing solid operations are what make a private large model truly effective. We recommend starting with a 7B model to quickly validate your business use case, then scaling up to a 72B model once feasibility is confirmed.

Interested in a private large model deployment solution? Book a free computing resource assessment