LLM
2025-07-09
内网搭建LLM服务架构
+-----------------------------+
| Frontend UI | ←→ 用户
| (Open WebUI) |
+-----------------------------+
│
▼
+-----------------------------+
| API Gateway | ←→ 鉴权 / 限流 / 多模型路由
| (LiteLLM) |
+-----------------------------+
│
▼
+-----------------------------+
| LLM Inference Layer | ←→ 多实例,支持横向扩容
| (vLLM / TGI / Ollama 等) |
+-----------------------------+
│
▼
+-----------------------------+
| Model Store | ←→ 存模型文件、本地磁盘或对象存储
+-----------------------------+
UI:
https://github.com/open-webui/open-webui
GW:
LLM:
https://github.com/vllm-project/vllm
vllm
# 查看模型
hf cache scan
# 安装vllm (注意使用gpt-oss模型务必使用新的虚拟环境,旧环境存在包冲突)
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
# 加载 Hugging Face 在线模型(自动下载)
vllm serve openai/gpt-oss-20b
# 下载模型
hf download openai/gpt-oss-20b
HF环境变量:
# 设置 Hugging Face 缓存的本地目录位置,所有模型、tokenizer、dataset 等将保存在这里,而不是默认的 ~/.cache/huggingface
export HF_HOME=/mnt/jqsd4_dataware/LLM/model_llm/.cache/huggingface
# 启用 Hugging Face Hub 的加速下载功能(hf_transfer),可提升大文件下载速度(如模型权重)
export HF_HUB_ENABLE_HF_TRANSFER=1
# 启用离线模式,不会尝试从 Hugging Face 服务器下载内容,只使用本地已有缓存
export HF_HUB_OFFLINE=1
# 默认使用 CPU(防止意外使用 GPU)
export TRANSFORMERS_NO_CUDA=1
# 设置 Hugging Face 的 API 镜像地址(如清华镜像或其他加速服务),用于替代默认的 https://huggingface.co
export HF_ENDPOINT=https://hf-mirror.com
# 禁用所有远程镜像访问(用于完全离线环境)
export HF_HUB_DISABLE_TELEMETRY=1
ollama
https://ollama.com/library/gpt-oss
#!/bin/sh
export OLLAMA_MODELS="/data/ollama/models"
export OLLAMA_HOST=0.0.0.0
export OLLAMA_ORIGINS=*
export http_proxy=http://127.0.0.1:10808
export https_proxy=http://127.0.0.1:10808
# 启动ollama服务 注意在启动前配置代理
ollama serve
# 使用cli下载使用 gpt-oss-20b (注意不能在这里配置代理, 下载请求是发往服务端的)
ollama pull gpt-oss:20b
ollama run gpt-oss:20b
litellm
启动
litellm –config config.yaml –port 5000
config.yaml 配置:
# 这是一个示例的 litellm 配置文件 config.yaml
litellm_settings:
drop_params: true # 是否丢弃未使用的参数
enable_stream: true # 是否启用流式输出
model_list:
- model_name: GLM-4-Flash-250414
litellm_params:
api_base: https://open.bigmodel.cn/api/paas/v4/
api_key: your_api_key_here
model: openai/GLM-4-Flash-250414
temperature: 0.7 # 模型温度参数
- model_name: glm-4v-flash
litellm_params:
api_base: https://open.bigmodel.cn/api/paas/v4/
api_key: your_api_key_here
model: openai/glm-4v-flash
max_tokens: 150 # 最大输出token数
- model_name: claude-3-5-haiku-20241022
litellm_params:
api_base: https://api.siliconflow.cn/v1
api_key: your_api_key_here
model: openai/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- model_name: DeepSeek-R1-0528-Qwen3-8B
litellm_params:
api_base: https://api.siliconflow.cn/v1
api_key: your_api_key_here
model: openai/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
# - model_name: claude-3-5-haiku-20241022
# 给claude-code使用这里可以用这个模型名称
- model_name: deepseek-v3-250324
litellm_params:
model: volcengine/deepseek-v3-250324
api_key: your_api_key_here
- model_name: deepseek-r1-250528
litellm_params:
model: volcengine/deepseek-r1-250528
api_key: your_api_key_here
- model_name: gemma-3-4b-it
litellm_params:
model: hosted_vllm/google/gemma-3-4b-it
api_base: http://172.16.1.166:8000/v1
provider: google
- model_name: gemini-2.5-flash
litellm_params:
model: gemini/gemini-2.5-flash-preview-04-17
api_key: your_api_key_here
- model_name: kimi-dev-72b
litellm_params:
model: openrouter/moonshotai/kimi-dev-72b:free
api_key: your_api_key_here
- model_name: deepseek-r1t2-chimera
litellm_params:
model: openrouter/tngtech/deepseek-r1t2-chimera:free
api_key: your_api_key_here
- model_name: deepseek-r1
litellm_params:
model: openrouter/deepseek/deepseek-r1:free
api_key: your_api_key_here
- model_name: mistral-small-3.2-24b-instruct-2506:free
litellm_params:
model: openrouter/mistralai/mistral-small-3.2-24b-instruct-2506:free
api_key: your_api_key_here
api_base: https://openrouter.ai/api/v1
FIM
continue、copilot、fitten code、Amazon Q
continue 插件
~/.continue/config.yaml
name: Local Assistant
version: 1.0.0
schema: v1
models:
- name: GLM-4-Flash-250414
provider: openai
model: GLM-4-Flash-250414
apiKey: your_api_key_here
apiBase: https://open.bigmodel.cn/api/paas/v4/
- name: a100/GLM-4-Flash-250414
provider: openai
model: GLM-4-Flash-250414
apiKey: your_api_key_here
apiBase: http://172.16.1.166:4000/v1
- name: a100/gemma-3-4b-it
provider: openai
model: google/gemma-3-4b-it
apiKey: your_api_key_here
apiBase: http://172.16.1.166:8000/v1
roles:
- autocomplete
- chat
- embed
max-tokens: 3000
- name: a100/火山引擎/deepseek-v3-250324
provider: deepseek
model: deepseek-v3-250324
apiKey: your_api_key_here
apiBase: http://172.16.1.166:4000/v1
- name: a100/火山引擎/DeepSeek-R1-0528-Qwen3-8B
provider: openai
model: DeepSeek-R1-0528-Qwen3-8B
apiKey: your_api_key_here
apiBase: http://172.16.1.166:4000/v1
- name: 硅基流动/DeepSeek-R1-Distill-Qwen-7B
provider: siliconflow
model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
apiKey: your_api_key_here
apiBase: https://api.siliconflow.cn/v1
- name: FIM/硅基流动/Qwen2.5-Coder-7B-Instruct
provider: siliconflow
model: Qwen/Qwen2.5-Coder-7B-Instruct
apiKey: your_api_key_here
apiBase: https://api.siliconflow.cn/v1
roles:
- autocomplete
- name: Gemini 2.5 Pro Experimental
provider: gemini
model: gemini-2.5-pro-exp-03-25
apiKey: your_api_key_here
- name: a100-vllm/gemma-3-4b-it
provider: openai
model: google/gemma-3-4b-it
apiKey: your_api_key_here
apiBase: http://172.16.1.166:8000/v1
chatOptions:
baseSystemMessage: >-
<important_rules>
You are in chat mode.
If the user asks to make changes to files offer that they can use the Apply Button on the code block, or switch to Agent Mode to make the suggested updates automatically.
If needed consisely explain to the user they can switch to agent mode using the Mode Selector dropdown and provide no other details.
Always include the language and file name in the info string when you write code blocks.
If you are editing "src/main.py" for example, your code block should start with '```python src/main.py'
When addressing code modification requests, present a concise code snippet that
emphasizes only the necessary changes and uses abbreviated placeholders for
unmodified sections. For example:
```language /path/to/file
// ... existing code ...
// ... existing code ...
// ... rest of code ...
```
In existing files, you should always restate the function or class that the snippet belongs to:
```language /path/to/file
// ... existing code ...
function exampleFunction() {
// ... existing code ...
// ... rest of function ...
}
// ... rest of code ...
```
Since users have access to their complete file, they prefer reading only the
relevant modifications. It's perfectly acceptable to omit unmodified portions
at the beginning, middle, or end of files using these "lazy" comments. Only
provide the complete file when explicitly requested. Include a concise explanation
of changes unless the user specifically asks for code only.
</important_rules>
You are an expert software developer. You give helpful and concise
responses.
用中文回答问题
roles:
- autocomplete
- chat
- embed
- name: openrouter/deepseek-r1t2-chimera:free
provider: openai
model: tngtech/deepseek-r1t2-chimera:free
apiBase: https://openrouter.ai/api/v1
apiKey: your_api_key_here
- name: openrouter/mistral-small-3.2-24b-instruct-2506:free
provider: openai
model: mistralai/mistral-small-3.2-24b-instruct-2506:free
apiKey: your_api_key_here
apiBase: https://openrouter.ai/api/v1
roles:
- chat
- embed
max-tokens: 3000
context:
- provider: code
- provider: docs
- provider: diff
- provider: terminal
- provider: problems
- provider: folder
- provider: codebase
code
claude code
https://docs.anthropic.com/zh-CN/docs/claude-code/settings
切换到litellm
~/.claude/settings.json
{
"env": {
"ANTHROPIC_BASE_URL": "http://127.0.0.1:4000",
"ANTHROPIC_AUTH_TOKEN": "sk-litellm-static-key",
"CLAUDE_CODE_MAX_OUTPUT_TOKENS": "16384"
},
"model": "claude-3-5-haiku-20241022"
}
windows下使用claude code
1.0.51 (Claude Code) 以上版本支持windows了
需在添加用户环境变量:
CLAUDE_CODE_GIT_BASH_PATH=C:\Program Files\Git\bin\bash.exe
安装
Neo4j Desktop 下载地址