我要投稿

【Text2SQL领域】Hrida-T2SQL-3B小钢炮大模型来啦！实操部署效果展示:单表查询聚合、多表关联sql轻松拿捏！

发布日期：2024-08-24 11:28:53 浏览次数： 2701

作者：z先生的备忘录

微信搜一搜，关注“z先生的备忘录”

今天给大家分享在Text2SQL领域由HridaAI基于Phi3架构微调的开源大模型Hrida-T2SQL-3B; 该大模型能够将用户的自然语言问题转换为对应准确的SQL代码，以便能够在各种数据库进行执行，提高数据分析维护的效率。

本文手把手实操部署Hrida-T2SQL-3B大模型并提供案例展示，轻松拿捏单表查询聚合等sql代码的生成、即使是多表关联等复杂sql代码也能准确实现~下面进入我们今天的主题~

本文目录

Hrida-T2SQL-3B-V0.1模型介绍

Hrida-T2SQL-3B-V0.1模型性能测试
Hrida-T2SQL-3B-V0.1模型的提示词模版介绍

实战篇: 手把手部署T2SQL-3B-V0.1模型进行text2sql效果展示

准备数据表结构
配置运行环境
加载T2SQL-3B-V0.1模型权重并且打印网络模型
文本实现单表sql查询效果展示
文本实现单表sql聚合函数效果展示
文本实现多表关联sql效果展示

参考链接

Hrida-T2SQL-3B-V0.1模型介绍

HridaAI 推出了首个基于Phi-3架构的开源 Text-to-SQL 模型 Hrida-T2SQL-3B-V0.1，该模型能够将自然语言问题转换为精确的 SQL 查询，提高了数据库交互的准确性和效率。

Hrida-T2SQL-3B-V0.1模型的亮点包括：

高准确性和效率：该模型能够将自然语言问题转换为精确的SQL查询，具有卓越的准确性和效率。
革命性数据库交互：它改变了与数据库的交互方式，使得数据分析和探索更加直观和有效。
易于集成：可以无缝集成到生产环境中，自动化数据库查询，简化数据处理工作流程。
解决技能差距：在数据驱动的世界中，高效查询数据库的能力至关重要。该模型通过允许用户用自然语言提问并得到准确的SQL查询响应，弥补了SQL技能的不足。

目前已开源三个模型版本:

Hrida-T2SQL-3B-V0.1模型性能测试

为确保 Hrida-T2SQL-3B-V0.1 在现实世界应用中的卓越表现，官方进行了广泛的基准测试。精心策划了50个超出训练数据的问题。这些问题涵盖了多样化的 SQL 查询和场景，有助于全面评估模型在各种查询类型和复杂性上的性能。每个问题都经过精心挑选，以代表广泛的 SQL 查询结构和数据库交互场景，从而给模型带来独特的挑战。

为确保公平和客观性，我们让另一个大型语言模型 Mistral 7B 来评判输出结果，并根据1到10的评分标准对它们进行打分。以下是 Hrida-T2SQL-3B-V0.1 与其他领先模型的对比情况：从测试结果看Hrida-T2SQL-3B-V0.1表现出色，平均得分为9.26分，表现出了较高的准确性和可靠性，并超越了其他现有的模型。

Hrida-T2SQL-3B-V0.1模型的提示词模版介绍

### Instruction: 
Provide the system prompt.
### Dialect:
Specify the SQL dialect (e.g., MySQL, PostgreSQL, SQL Server, etc.).
### Context: 
Provide the database schema including table names, column names, and data types.
### Input: 
User's query.
### Response:
Expected SQL query output based on the input and context.

可以看出提示词组成的部分主要有五个部分组成:Instruction、Dialect、Context、Input、Response;

那具体效果怎么样，好不好? 下面我将给大家实操部署Hrida-T2SQL-3B-V0.1模型，带大家直观感受text2sql的能力效果展示~

实战篇: 手把手部署T2SQL-3B-V0.1模型进行text2sql效果展示

准备数据表结构

假设我现在有2个数据表结构，一个是员工表，另外一个是员工部门表信息，对应的字段如下:

# 员工个人信息表
 CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    City VARCHAR(50),
    Age INT,
    DepartmentID INT,
    Salary DECIMAL(10, 2),
    DateHired DATE,
    Active BOOLEAN,
    FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)
); 
# 员工部门信息
CREATE TABLE Departments (
    DepartmentID INT PRIMARY KEY,
    DepartmentName VARCHAR(100),
    Location VARCHAR(100)
);

配置运行环境

import torch
import warnings 
warnings.filterwarnings('ignore')
import torch 
import transformers
import accelerate 
print(transformers.__version__, torch.__version__,accelerate.__version__)
# 4.44.0 2.4.0 0.33.0

加载T2SQL-3B-V0.1模型权重并且打印网络模型

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Define the model and tokenizer
model_id = "HridaAI/Hrida-T2SQL-3B-V0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, trust_remote_code=True).cuda().eval()
model

模型的网络结构:

文本实现单表sql查询效果展示

下面我将实现sql查找对员工信息表中，active处于激活状态，年龄在22~50岁、居住城市在‘重庆’或者‘成都’;按照前面提到的prompt模版部分传入我的要求；下面是我的推理代码;

# Define the context and prompt
prompt = """
Answer to the query will be in the form of an SQL query.
### Context: CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    City VARCHAR(50),
    Age INT,
    DepartmentID INT,
    Salary DECIMAL(10, 2),
    DateHired DATE,
    Active BOOLEAN,
    FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)
); 
### Input: Write a SQL query to select all active employees who are between 22 and 50 years old and are located in either '重庆' or ’成都'.
### Response:
"""
# Prepare the input
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
# Generate the output
outputs = model.generate(inputs, max_length=300)
outputs = outputs[:, inputs.shape[1]:]
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

模型输出的sql代码展示:可以看出输出的sql代码非常正确，对应字段类型也是理解到位的。

文本实现单表sql聚合函数效果展示

现在我想统计不同城市有多少员工，以及对应员工最大年龄和最小年龄? 下面是模型运行的效果:

# Define the context and prompt
prompt = """
Answer to the query will be in the form of an SQL query.
### Context: CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    City VARCHAR(50),
    Age INT,
    DepartmentID INT,
    Salary DECIMAL(10, 2),
    DateHired DATE,
    Active BOOLEAN,
    FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)
); 

### Input: Write a SQL query to find out how many employees are in each city, as well as the maximum and minimum employee ages.
### Response:
"""
# Prepare the input
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")

# Generate the output
outputs = model.generate(inputs, max_length=300)
outputs = outputs[:, inputs.shape[1]:]
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

模型输出的sql代码展示:

文本实现多表关联sql效果展示

我想实现所有年龄在22至50岁之间、位于“重庆”或“成都”的在职员工并且这些员工的部门位于成都和重庆的sql查询，该怎么实现呢? 下面让Hrida-T2SQL-3B-V0.1大模型帮我们编写sql代码实现。

# Define the context and prompt
prompt = """
Answer to the query will be in the form of an SQL query.
### Context: CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    City VARCHAR(50),
    Age INT,
    DepartmentID INT,
    Salary DECIMAL(10, 2),
    DateHired DATE,
    Active BOOLEAN,
    FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)
); 

CREATE TABLE Departments (
    DepartmentID INT PRIMARY KEY,
    DepartmentName VARCHAR(100),
    Location VARCHAR(100)
); 
### Input: Write a SQL query to select all active employees who are between 22 and 50 years old and are located in either '重庆' or ’成都' and whose departments are located in '重庆' or ’成都'.
### Response:
"""
# Prepare the input
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")

# Generate the output
outputs = model.generate(inputs, max_length=1024)
outputs = outputs[:, inputs.shape[1]:]
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

模型输出的sql代码展示:非常正确！可以看出Hrida-T2SQL-3B-V0.1大模型虽然只有3B的参数量，其在实现单表查询聚合输出的sql代码准确率很高，甚至多表关联也能准确实现对应的sql代码~

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业