LLM

2025-07-09

内网搭建LLM服务架构

+-----------------------------+
|        Frontend UI          | ←→ 用户
|        (Open WebUI)         |
+-----------------------------+
               │
               ▼
+-----------------------------+
|         API Gateway         | ←→ 鉴权 / 限流 / 多模型路由
|         (LiteLLM)           |
+-----------------------------+
               │
               ▼
+-----------------------------+
|      LLM Inference Layer    | ←→ 多实例,支持横向扩容
| (vLLM / TGI / Ollama 等)    |
+-----------------------------+
               │
               ▼
+-----------------------------+
|         Model Store         | ←→ 存模型文件、本地磁盘或对象存储
+-----------------------------+

UI:

https://github.com/open-webui/open-webui

GW:

https://www.litellm.ai/

LLM:

https://github.com/vllm-project/vllm

https://ollama.com/download

https://lmstudio.ai/

vllm

# 查看模型
hf cache scan

# 安装vllm (注意使用gpt-oss模型务必使用新的虚拟环境,旧环境存在包冲突)
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match


# 加载 Hugging Face 在线模型(自动下载)
vllm serve openai/gpt-oss-20b

# 下载模型
hf download openai/gpt-oss-20b


HF环境变量:

# 设置 Hugging Face 缓存的本地目录位置,所有模型、tokenizer、dataset 等将保存在这里,而不是默认的 ~/.cache/huggingface
export HF_HOME=/mnt/jqsd4_dataware/LLM/model_llm/.cache/huggingface

# 启用 Hugging Face Hub 的加速下载功能(hf_transfer),可提升大文件下载速度(如模型权重)
export HF_HUB_ENABLE_HF_TRANSFER=1

# 启用离线模式,不会尝试从 Hugging Face 服务器下载内容,只使用本地已有缓存
export HF_HUB_OFFLINE=1

# 默认使用 CPU(防止意外使用 GPU)
export TRANSFORMERS_NO_CUDA=1

# 设置 Hugging Face 的 API 镜像地址(如清华镜像或其他加速服务),用于替代默认的 https://huggingface.co
export HF_ENDPOINT=https://hf-mirror.com

# 禁用所有远程镜像访问(用于完全离线环境)
export HF_HUB_DISABLE_TELEMETRY=1

ollama

https://ollama.com/library/gpt-oss

#!/bin/sh
export OLLAMA_MODELS="/data/ollama/models"
export OLLAMA_HOST=0.0.0.0
export OLLAMA_ORIGINS=*
export http_proxy=http://127.0.0.1:10808
export https_proxy=http://127.0.0.1:10808
# 启动ollama服务 注意在启动前配置代理
ollama serve


# 使用cli下载使用 gpt-oss-20b (注意不能在这里配置代理, 下载请求是发往服务端的)
ollama pull gpt-oss:20b
ollama run gpt-oss:20b


litellm

启动

litellm –config config.yaml –port 5000

config.yaml 配置:

# 这是一个示例的 litellm 配置文件 config.yaml
litellm_settings:
  drop_params: true  # 是否丢弃未使用的参数
  enable_stream: true  # 是否启用流式输出

model_list:
  - model_name: GLM-4-Flash-250414
    litellm_params:
      api_base: https://open.bigmodel.cn/api/paas/v4/
      api_key: your_api_key_here
      model: openai/GLM-4-Flash-250414
      temperature: 0.7  # 模型温度参数

  - model_name: glm-4v-flash
    litellm_params:
      api_base: https://open.bigmodel.cn/api/paas/v4/
      api_key: your_api_key_here
      model: openai/glm-4v-flash
      max_tokens: 150  # 最大输出token数

  - model_name: claude-3-5-haiku-20241022
    litellm_params:
      api_base: https://api.siliconflow.cn/v1
      api_key: your_api_key_here
      model: openai/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

  - model_name: DeepSeek-R1-0528-Qwen3-8B
    litellm_params:
      api_base: https://api.siliconflow.cn/v1
      api_key: your_api_key_here
      model: openai/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

  # - model_name: claude-3-5-haiku-20241022
  #   给claude-code使用这里可以用这个模型名称

  - model_name: deepseek-v3-250324
    litellm_params:
      model: volcengine/deepseek-v3-250324
      api_key: your_api_key_here

  - model_name: deepseek-r1-250528
    litellm_params:
      model: volcengine/deepseek-r1-250528
      api_key: your_api_key_here

  - model_name: gemma-3-4b-it
    litellm_params:
      model: hosted_vllm/google/gemma-3-4b-it
      api_base: http://172.16.1.166:8000/v1
      provider: google

  - model_name: gemini-2.5-flash
    litellm_params:
      model: gemini/gemini-2.5-flash-preview-04-17
      api_key: your_api_key_here

  - model_name: kimi-dev-72b
    litellm_params:
      model: openrouter/moonshotai/kimi-dev-72b:free
      api_key: your_api_key_here

  - model_name: deepseek-r1t2-chimera
    litellm_params:
      model: openrouter/tngtech/deepseek-r1t2-chimera:free
      api_key: your_api_key_here

  - model_name: deepseek-r1
    litellm_params:
      model: openrouter/deepseek/deepseek-r1:free
      api_key: your_api_key_here

  - model_name: mistral-small-3.2-24b-instruct-2506:free
    litellm_params:
      model: openrouter/mistralai/mistral-small-3.2-24b-instruct-2506:free
      api_key: your_api_key_here
      api_base: https://openrouter.ai/api/v1

FIM

continue、copilot、fitten code、Amazon Q

continue 插件

~/.continue/config.yaml

name: Local Assistant
version: 1.0.0
schema: v1
models:
  - name: GLM-4-Flash-250414
    provider: openai
    model: GLM-4-Flash-250414
    apiKey: your_api_key_here
    apiBase: https://open.bigmodel.cn/api/paas/v4/

  - name: a100/GLM-4-Flash-250414
    provider: openai
    model: GLM-4-Flash-250414
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:4000/v1

  - name: a100/gemma-3-4b-it
    provider: openai
    model: google/gemma-3-4b-it
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:8000/v1
    roles:
      - autocomplete
      - chat
      - embed
    max-tokens: 3000

  - name: a100/火山引擎/deepseek-v3-250324
    provider: deepseek
    model: deepseek-v3-250324
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:4000/v1

  - name: a100/火山引擎/DeepSeek-R1-0528-Qwen3-8B
    provider: openai
    model: DeepSeek-R1-0528-Qwen3-8B
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:4000/v1

  - name: 硅基流动/DeepSeek-R1-Distill-Qwen-7B
    provider: siliconflow
    model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
    apiKey: your_api_key_here
    apiBase: https://api.siliconflow.cn/v1

  - name: FIM/硅基流动/Qwen2.5-Coder-7B-Instruct
    provider: siliconflow
    model: Qwen/Qwen2.5-Coder-7B-Instruct
    apiKey: your_api_key_here
    apiBase: https://api.siliconflow.cn/v1
    roles:
      - autocomplete

  - name: Gemini 2.5 Pro Experimental
    provider: gemini
    model: gemini-2.5-pro-exp-03-25
    apiKey: your_api_key_here

  - name: a100-vllm/gemma-3-4b-it
    provider: openai
    model: google/gemma-3-4b-it
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:8000/v1
    chatOptions:
      baseSystemMessage: >-
        <important_rules>
          You are in chat mode.

          If the user asks to make changes to files offer that they can use the Apply Button on the code block, or switch to Agent Mode to make the suggested updates automatically.
          If needed consisely explain to the user they can switch to agent mode using the Mode Selector dropdown and provide no other details.

          Always include the language and file name in the info string when you write code blocks.
          If you are editing "src/main.py" for example, your code block should start with '```python src/main.py'

          When addressing code modification requests, present a concise code snippet that
          emphasizes only the necessary changes and uses abbreviated placeholders for
          unmodified sections. For example:

          ```language /path/to/file
          // ... existing code ...

          

          // ... existing code ...

          

          // ... rest of code ...
          ```

          In existing files, you should always restate the function or class that the snippet belongs to:

          ```language /path/to/file
          // ... existing code ...

          function exampleFunction() {
            // ... existing code ...

            

            // ... rest of function ...
          }

          // ... rest of code ...
          ```

          Since users have access to their complete file, they prefer reading only the
          relevant modifications. It's perfectly acceptable to omit unmodified portions
          at the beginning, middle, or end of files using these "lazy" comments. Only
          provide the complete file when explicitly requested. Include a concise explanation
          of changes unless the user specifically asks for code only.

        </important_rules>

        You are an expert software developer. You give helpful and concise
        responses.
        用中文回答问题
    roles:
      - autocomplete
      - chat
      - embed

  - name: openrouter/deepseek-r1t2-chimera:free
    provider: openai
    model: tngtech/deepseek-r1t2-chimera:free
    apiBase: https://openrouter.ai/api/v1
    apiKey: your_api_key_here

  - name: openrouter/mistral-small-3.2-24b-instruct-2506:free
    provider: openai
    model: mistralai/mistral-small-3.2-24b-instruct-2506:free
    apiKey: your_api_key_here
    apiBase: https://openrouter.ai/api/v1
    roles:
      - chat
      - embed
    max-tokens: 3000

context:
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase

code

claude code

https://docs.anthropic.com/zh-CN/docs/claude-code/settings

切换到litellm

~/.claude/settings.json

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:4000",
    "ANTHROPIC_AUTH_TOKEN": "sk-litellm-static-key",
    "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "16384"
  },
  "model": "claude-3-5-haiku-20241022"
}

windows下使用claude code

1.0.51 (Claude Code) 以上版本支持windows了

需在添加用户环境变量:

CLAUDE_CODE_GIT_BASH_PATH=C:\Program Files\Git\bin\bash.exe

安装

Neo4j Desktop 下载地址

https://neo4j.com/download/