LLM

2025-07-09

内网搭建LLM服务架构

+-----------------------------+
|        Frontend UI          | ←→ 用户
|        (Open WebUI)         |
+-----------------------------+
               │
               ▼
+-----------------------------+
|         API Gateway         | ←→ 鉴权 / 限流 / 多模型路由
|         (LiteLLM)           |
+-----------------------------+
               │
               ▼
+-----------------------------+
|      LLM Inference Layer    | ←→ 多实例,支持横向扩容
| (vLLM / TGI / Ollama 等)    |
+-----------------------------+
               │
               ▼
+-----------------------------+
|         Model Store         | ←→ 存模型文件、本地磁盘或对象存储
+-----------------------------+

https://github.com/open-webui/open-webui

https://www.litellm.ai/

https://github.com/vllm-project/vllm

litellm

启动

litellm –config config.yaml –port 5000

config.yaml 配置:

# 这是一个示例的 litellm 配置文件 config.yaml
litellm_settings:
  drop_params: true  # 是否丢弃未使用的参数
  enable_stream: true  # 是否启用流式输出

model_list:
  - model_name: GLM-4-Flash-250414
    litellm_params:
      api_base: https://open.bigmodel.cn/api/paas/v4/
      api_key: your_api_key_here
      model: openai/GLM-4-Flash-250414
      temperature: 0.7  # 模型温度参数

  - model_name: glm-4v-flash
    litellm_params:
      api_base: https://open.bigmodel.cn/api/paas/v4/
      api_key: your_api_key_here
      model: openai/glm-4v-flash
      max_tokens: 150  # 最大输出token数

  - model_name: claude-3-5-haiku-20241022
    litellm_params:
      api_base: https://api.siliconflow.cn/v1
      api_key: your_api_key_here
      model: openai/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

  - model_name: DeepSeek-R1-0528-Qwen3-8B
    litellm_params:
      api_base: https://api.siliconflow.cn/v1
      api_key: your_api_key_here
      model: openai/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

  # - model_name: claude-3-5-haiku-20241022
  #   给claude-code使用这里可以用这个模型名称

  - model_name: deepseek-v3-250324
    litellm_params:
      model: volcengine/deepseek-v3-250324
      api_key: your_api_key_here

  - model_name: deepseek-r1-250528
    litellm_params:
      model: volcengine/deepseek-r1-250528
      api_key: your_api_key_here

  - model_name: gemma-3-4b-it
    litellm_params:
      model: hosted_vllm/google/gemma-3-4b-it
      api_base: http://172.16.1.166:8000/v1
      provider: google

  - model_name: gemini-2.5-flash
    litellm_params:
      model: gemini/gemini-2.5-flash-preview-04-17
      api_key: your_api_key_here

  - model_name: kimi-dev-72b
    litellm_params:
      model: openrouter/moonshotai/kimi-dev-72b:free
      api_key: your_api_key_here

  - model_name: deepseek-r1t2-chimera
    litellm_params:
      model: openrouter/tngtech/deepseek-r1t2-chimera:free
      api_key: your_api_key_here

  - model_name: deepseek-r1
    litellm_params:
      model: openrouter/deepseek/deepseek-r1:free
      api_key: your_api_key_here

  - model_name: mistral-small-3.2-24b-instruct-2506:free
    litellm_params:
      model: openrouter/mistralai/mistral-small-3.2-24b-instruct-2506:free
      api_key: your_api_key_here
      api_base: https://openrouter.ai/api/v1

FIM

continue、copilot、fitten code、Amazon Q

continue 插件

~/.continue/config.yaml

name: Local Assistant
version: 1.0.0
schema: v1
models:
  - name: GLM-4-Flash-250414
    provider: openai
    model: GLM-4-Flash-250414
    apiKey: your_api_key_here
    apiBase: https://open.bigmodel.cn/api/paas/v4/

  - name: a100/GLM-4-Flash-250414
    provider: openai
    model: GLM-4-Flash-250414
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:4000/v1

  - name: a100/gemma-3-4b-it
    provider: openai
    model: google/gemma-3-4b-it
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:8000/v1
    roles:
      - autocomplete
      - chat
      - embed
    max-tokens: 3000

  - name: a100/火山引擎/deepseek-v3-250324
    provider: deepseek
    model: deepseek-v3-250324
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:4000/v1

  - name: a100/火山引擎/DeepSeek-R1-0528-Qwen3-8B
    provider: openai
    model: DeepSeek-R1-0528-Qwen3-8B
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:4000/v1

  - name: 硅基流动/DeepSeek-R1-Distill-Qwen-7B
    provider: siliconflow
    model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
    apiKey: your_api_key_here
    apiBase: https://api.siliconflow.cn/v1

  - name: FIM/硅基流动/Qwen2.5-Coder-7B-Instruct
    provider: siliconflow
    model: Qwen/Qwen2.5-Coder-7B-Instruct
    apiKey: your_api_key_here
    apiBase: https://api.siliconflow.cn/v1
    roles:
      - autocomplete

  - name: Gemini 2.5 Pro Experimental
    provider: gemini
    model: gemini-2.5-pro-exp-03-25
    apiKey: your_api_key_here

  - name: a100-vllm/gemma-3-4b-it
    provider: openai
    model: google/gemma-3-4b-it
    apiKey: your_api_key_here
    apiBase: http://172.16.1.166:8000/v1
    chatOptions:
      baseSystemMessage: >-
        <important_rules>
          You are in chat mode.

          If the user asks to make changes to files offer that they can use the Apply Button on the code block, or switch to Agent Mode to make the suggested updates automatically.
          If needed consisely explain to the user they can switch to agent mode using the Mode Selector dropdown and provide no other details.

          Always include the language and file name in the info string when you write code blocks.
          If you are editing "src/main.py" for example, your code block should start with '```python src/main.py'

          When addressing code modification requests, present a concise code snippet that
          emphasizes only the necessary changes and uses abbreviated placeholders for
          unmodified sections. For example:

          ```language /path/to/file
          // ... existing code ...

          

          // ... existing code ...

          

          // ... rest of code ...
          ```

          In existing files, you should always restate the function or class that the snippet belongs to:

          ```language /path/to/file
          // ... existing code ...

          function exampleFunction() {
            // ... existing code ...

            

            // ... rest of function ...
          }

          // ... rest of code ...
          ```

          Since users have access to their complete file, they prefer reading only the
          relevant modifications. It's perfectly acceptable to omit unmodified portions
          at the beginning, middle, or end of files using these "lazy" comments. Only
          provide the complete file when explicitly requested. Include a concise explanation
          of changes unless the user specifically asks for code only.

        </important_rules>

        You are an expert software developer. You give helpful and concise
        responses.
        用中文回答问题
    roles:
      - autocomplete
      - chat
      - embed

  - name: openrouter/deepseek-r1t2-chimera:free
    provider: openai
    model: tngtech/deepseek-r1t2-chimera:free
    apiBase: https://openrouter.ai/api/v1
    apiKey: your_api_key_here

  - name: openrouter/mistral-small-3.2-24b-instruct-2506:free
    provider: openai
    model: mistralai/mistral-small-3.2-24b-instruct-2506:free
    apiKey: your_api_key_here
    apiBase: https://openrouter.ai/api/v1
    roles:
      - chat
      - embed
    max-tokens: 3000

context:
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase

code

claude code

https://docs.anthropic.com/zh-CN/docs/claude-code/settings

切换到litellm

~/.claude/settings.json

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:4000",
    "ANTHROPIC_AUTH_TOKEN": "sk-litellm-static-key",
    "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "16384"
  },
  "model": "claude-3-5-haiku-20241022"
}

windows下使用claude code

1.0.51 (Claude Code) 以上版本支持windows了

需在添加用户环境变量:

CLAUDE_CODE_GIT_BASH_PATH=C:\Program Files\Git\bin\bash.exe