Introduction
Industries such as finance, healthcare, and government impose strict data security requirements that public large model APIs cannot meet. Private deployment of large models is a must for these sectors.
Drawing on our experience delivering private large model deployments for over 10 enterprises, this article systematically walks through the 7 key steps.
Step 1: Model Selection
1.1 Comparison of Mainstream Open-Source Models
| Model | Parameters | Chinese Capability | Inference Speed | Open-Source License | Recommended Scenarios |
|---|---|---|---|---|---|
| Qwen2.5-72B | 72B | ★★★★★ | Moderate | Apache 2.0 | Top choice for general purpose |
| Qwen2.5-7B | 7B | ★★★★ | Fast | Apache 2.0 | Lightweight scenarios |
| DeepSeek-V3 | 671B MoE | ★★★★★ | Fast | MIT | When budget is ample |
| ChatGLM4-9B | 9B | ★★★★ | Fast | Apache 2.0 | Conversational scenarios |
| Llama3.1-70B | 70B | ★★★ | Moderate | Llama3 | English-centric use |
| Yi-1.5-34B | 34B | ★★★★ | Relatively fast | Apache 2.0 | Best cost-performance ratio |
1.2 Selection Advice
Step 2: Computing Resource Assessment
2.1 GPU Requirements Reference
| Model | FP16 | INT8 | INT4 |
|---|---|---|---|
| 7B | 1×A100 40G | 1×A10 24G | 1×RTX4090 24G |
| 34B | 2×A100 80G | 1×A100 80G | 1×A100 40G |
| 72B | 4×A100 80G | 2×A100 80G | 2×A100 40G |
2.2 Cost Estimation
| Configuration | Purchase Cost | Monthly Rental | Suitable Scenarios |
|---|---|---|---|
| 1×RTX4090 | ¥15,000 | ¥3,000 | 7B model testing |
| 1×A100 40G | ¥80,000 | ¥15,000 | 7B-34B models |
| 2×A100 80G | ¥250,000 | ¥40,000 | 34B-72B models |
| 4×A100 80G | ¥500,000 | ¥80,000 | 72B+ models |
Step 3: Inference Engine Selection
| Engine | Throughput | Latency | Ease of Use | Recommended Scenarios |
|---|---|---|---|---|
| vLLM | ★★★★★ | ★★★★ | ★★★★ | Preferred for production |
| TGI | ★★★★ | ★★★★ | ★★★★ | When compatibility is key |
| TensorRT-LLM | ★★★★ | ★★★★★ | ★★★ | Latency-sensitive use |
| Ollama | ★★★ | ★★★ | ★★★★★ | Local development and testing |
Our recommendation: Use vLLM for production (highest throughput, active community) and Ollama for development/testing (one-click deployment).
Step 4: Model Quantization
4.1 Quantization Method Comparison
| Method | Accuracy Loss | Speed Increase | Model Size Reduction | Use Case |
|---|---|---|---|---|
| FP16→INT8 (AWQ) | <1% | 2x | 2x | General recommendation |
| FP16→INT4 (GPTQ) | 1%-3% | 3x | 4x | Resource-constrained |
| FP16→INT4 (GGUF) | 2%-5% | 3x | 4x | CPU inference |
4.2 Quantization Effect Reference
Quantization results for Qwen2.5-72B on Chinese benchmarks:
| Quantization Method | C-Eval | Inference Speed (Tokens/s) | GPU Memory Usage |
|---|---|---|---|
| FP16 | 83.5 | 25 | 144GB |
| AWQ-INT8 | 82.8 | 48 | 72GB |
| GPTQ-INT4 | 81.2 | 72 | 40GB |
Step 5: Containerized Deployment
```yaml
docker-compose.yml example
services:
vllm:
image: vllm/vllm-openai:latest
deploy:
resources:
reservations:
devices:
count: 2
command: >
--model Qwen/Qwen2.5-72B-Instruct-AWQ
--quantization awq
--tensor-parallel-size 2
--max-model-len 8192
--gpu-memory-utilization 0.9
ports:
```
Step 6: Performance Optimization
| Optimization | Method | Effect |
|---|---|---|
| Continuous Batching | Dynamic batching | 2-3x throughput increase |
| PagedAttention | Paged VRAM management | 40% better VRAM utilization |
| Prefix Caching | System prompt caching | 50% latency reduction for identical prefixes |
| Speculative Decoding | Small model drafts, large model verifies | 2-3x inference speed boost |
Step 7: Monitoring and Operations
7.1 Key Monitoring Metrics
| Metric | Alert Threshold |
|---|---|
| GPU utilization | >95% sustained for 5 minutes |
| Inference latency P99 | >5 seconds |
| Request failure rate | >1% |
| VRAM usage | >90% |
| Model service availability | <99.9% |
7.2 Operations Strategies
Conclusion
Private deployment is not simply "buy a server and install a model." Selecting the right model, provisioning sufficient computing power, optimizing inference, and establishing solid operations are what make a private large model truly effective. We recommend starting with a 7B model to quickly validate your business use case, then scaling up to a 72B model once feasibility is confirmed.
Interested in a private large model deployment solution? Book a free computing resource assessment