我要投稿

小白入门：使用vLLM在本机MAC上部署大模型

发布日期：2025-04-01 05:25:32 浏览次数： 3371

作者：算法狗

微信搜一搜，关注“算法狗”

在本文中，我将探索 vLLM，这是一款广泛用于在计算机上运行大型语言模型（LLM）的工具。我将指导你如何在 Mac 上安装 vLLM，并演示如何通过 REST API 运行 LLM。

什么是 vLLM？

vLLM 是由加州大学伯克利分校研究人员创建的开源库，旨在简化大型语言模型（LLM）的推理过程，使其快速、高效且用户友好。vLLM 的名称代表“Virtual Large Language Model”（虚拟大型语言模型）。其主要目标是增强 LLM 的服务和部署，特别是在对性能要求较高的场景中，例如实时应用、API 或研究实验。

尽管 vLLM 为 GPU（使用 CUDA）进行了优化，以利用诸如 PagedAttention 等特性实现最佳性能，但它也支持基于 CPU 的推理和服务。

截至目前，vLLM 不支持 macOS 上的 Metal Performance Shaders（MPS），这意味着在 Mac 上使用时，它只能在 CPU 上运行。

安装

在本文中，我假设你已经在 Mac 上安装了 Anaconda。首先，创建一个名为 usingvllm 的虚拟环境：

$ conda create -n usingvllm jupyter

系统会提示你安装一些文件。输入 y 并按回车键。

创建虚拟环境后，激活它：

$ conda activate usingvllm

进入虚拟环境后，克隆 vLLM 的 Git 仓库：

$ git clone https://github.com/vllm-project/vllm.git

你还需要安装两个包：

$ pip install torch torchvision

克隆仓库后，进入 vllm 目录，并通过运行以下命令安装 vLLM：

$ cd vllm
$ pip install -e .

测试 vLLM

安装完成后，启动 Jupyter Notebook：

$ jupyter notebook

现在，你可以使用 vLLM 加载一个模型，例如 tiiuae/falcon-7b-instruct：

from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

llm = LLM(model="tiiuae/falcon-7b-instruct")

sampling_params = SamplingParams(temperature = 0.9, 
                                 max_tokens = 200)

prompt = "What is quantum computing?"
output = llm.generate(prompt, sampling_params)

print(output)
print(output[0].outputs[0].text)

对于 Apple Silicon Mac，vLLM 使用 float16 数据类型。

tiiuae/falcon-7b-instruct 模型将被下载到你的本地计算机上。

vLLM 将 Hugging Face 模型下载到默认的 ~/cache/huggingface/hub 文件夹中。

模型下载并加载完成后，你会看到类似以下内容：

[RequestOutput(request_id=0, prompt='What is quantum computing?', prompt_token_ids=[1562, 304, 17235, 15260, 42], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\nQuantum computing is the practice of computing using quantum-mechanical phenomena, such as superposition and entanglement. Unlike classical computing, which relies on binary numbers and bits, quantum computing uses quantum-mechanical phenomena to generate information. This can offer incredible performance improvements in certain types of operations that were not possible before, such as cryptography.', token_ids=(193, 26847, 381, 15260, 304, 248, 3100, 275, 15260, 1241, 17235, 24, 1275, 38804, 25849, 23, 963, 345, 2014, 9073, 273, 833, 642, 1977, 25, 15752, 15613, 15260, 23, 585, 21408, 313, 16529, 4169, 273, 12344, 23, 17235, 15260, 4004, 17235, 24, 1275, 38804, 25849, 271, 7420, 1150, 25, 735, 418, 1880, 7309, 2644, 10421, 272, 1714, 3059, 275, 5342, 325, 646, 416, 1777, 996, 23, 963, 345, 6295, 3842, 25, 11), cumulative_logprob=None, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1742731247.0472178, last_token_time=1742731264.39699, first_scheduled_time=1742731247.054006, first_token_time=1742731251.790441, time_in_queue=0.0067882537841796875, finished_time=1742731264.3972821, scheduler_time=0.005645580822601914, model_forward_time=None, model_execute_time=None, spec_token_acceptance_counts=[0]), lora_request=None, num_cached_tokens=0, multi_modal_placeholders={})]

量子计算是利用量子力学现象（如叠加和纠缠）进行计算的一种实践。与依赖二进制数字和比特的经典计算不同，量子计算利用量子力学现象生成信息。这可以在某些之前无法实现的操作类型（如密码学）中提供惊人的性能提升。

将 vLLM 作为服务器运行

你还可以将 vLLM 作为服务器运行，通过 REST API 接受客户端连接。为此，进入 vllm 文件夹，并使用以下命令提供特定模型的服务：

$ cd vllm
$ vllm serve tiiuae/falcon-7b-instruct

注意，如果你想提供多个模型的服务，需要多次运行上述命令。此外，服务器默认监听端口号为 8000。如果你想自定义端口号，可以使用 --port 选项，例如：vllm serve tiiuae/falcon-7b-instruct --port 5002，这将使服务器监听 5002 端口。

如果你遇到“RuntimeError: Failed to infer device type”错误，请在命令中添加 --device cpu 选项：

$ vllm serve tiiuae/falcon-7b-instruct --device cpu

vLLM 服务器运行后，你可以在另一个终端中使用 curl 命令进行测试：

$ curl http://localhost:8000/docs

上述命令将以 HTML 格式返回服务器的文档：

<!DOCTYPE html>
<html>
<head>
<link type="text/css" rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui.css">
<link rel="shortcut icon" href="https://fastapi.tiangolo.com/img/favicon.png">
<title>FastAPI - Swagger UI</title>
</head>
<body>
<div id="swagger-ui">
</div>
<script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
<!-- `SwaggerUIBundle` is now available on the page -->
<script>
const ui = SwaggerUIBundle({
    url: '/openapi.json',
    "dom_id": "#swagger-ui",
    "layout": "BaseLayout",
    "deepLinking": true,
    "showExtensions": true,
    "showCommonExtensions": true,
    oauth2RedirectUrl: window.location.origin + '/docs/oauth2-redirect',
    presets: [
        SwaggerUIBundle.presets.apis,
        SwaggerUIBundle.SwaggerUIStandalonePreset
    ],
})
</script>
</body>
</html>

你也可以使用以下 URL 在浏览器中查看文档：http://localhost:8000/docs。

如果你想了解在 REST API 中使用的模型的详细信息，可以使用以下端点：

$ curl http://localhost:8000/v1/models

你将看到以下详细信息：

{
    "object": "list",
    "data": [
        {
            "id": "tiiuae/falcon-7b-instruct",
            "object": "model",
            "created": 1742723998,
            "owned_by": "vllm",
            "root": "tiiuae/falcon-7b-instruct",
            "parent": null,
            "max_model_len": 2048,
            "permission": [
                {
                    "id": "modelperm-dc669ee015d54b5497dd0ae2cb64fdad",
                    "object": "model_permission",
                    "created": 1742723998,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": false,
                    "allow_view": true,
                    "allow_fine_tuning": false,
                    "organization": "*",
                    "group": null,
                    "is_blocking": false
                }
            ]
        }
    ]
}

要使用 REST API 生成响应，可以在 curl 中使用以下命令：

$ curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "tiiuae/falcon-7b-instruct",
        "prompt": "Where is Singapore located?",
        "max_tokens": 500,
        "temperature": 0.7
      }'

你应该会收到类似以下的响应：

{
    "id": "cmpl-ea189bc39565488386a48459b5695577",
    "object": "text_completion",
    "created": 1742724348,
    "model": "tiiuae/falcon-7b-instruct",
    "choices": [
        {
            "index": 0,
            "text": "\nSingapore is located in Southeast Asia, on the tip of the Malay Peninsula.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 5,
        "total_tokens": 22,
        "completion_tokens": 17,
        "prompt_tokens_details": null
    }
}

使用 OpenAI 类访问 REST API

要访问 vLLM 暴露的 REST API，你可以使用 openai Python 库中的 OpenAI 类，它提供了一个方便的接口来与 vLLM 的 OpenAI 兼容 API 端点进行交互。由于 vLLM 的服务器模仿了 OpenAI API 的结构，因此你可以使用这个类来发送文本生成、嵌入或其他支持功能的请求。

以下示例展示了如何向 tiiuae/falcon-7b-instruct 模型提问：

from openai import OpenAI

openai_api_key = "anything here"   # 你可以在这里设置任何内容
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key = openai_api_key,
    base_url = openai_api_base,
)
completion = client.completions.create(model = "tiiuae/falcon-7b-instruct",
                                       prompt = "Where is Singapore located?",
                                       max_tokens = 500)

# 仅打印生成的文本
print(completion.choices[0].text.strip())

你应该会收到类似以下的响应：

Singapore is located in Southeast Asia on the island of Singapore.

总结

在本文中，我演示了如何在 Mac 上设置 vLLM 并用它来本地托管大型语言模型。尽管 vLLM 目前不支持 MPS，但它仍然可以有效地使用 CPU 运行。此外，你还了解了如何将 vLLM 配置为 REST API，从而让开发人员可以通过 OpenAI 类与之交互。你使用 vLLM 的个人体验如何呢？