微信扫码
与创始人交个朋友
我要投稿
RAG(检索增强生成)是一种结合了检索模型和生成模型的技术,它通过检索大量外部知识来辅助文本生成,从而提高大型语言模型(LLMs)的准确度和可靠性。
RAG特别适合于需要不断更新知识的知识密集型场景或特定领域应用,它通过引入外部信息源,有效缓解了大语言模型在领域知识缺乏、信息准确性问题以及生成虚假内容等方面的挑战。本次挑战赛旨在探索RAG技术的极限,鼓励开发者、研究人员和爱好者利用RAG技术解决实际问题,推动人工智能领域的进步。
赛题需要参赛选手设计并实现一个RAG模型,该模型能够从给定的问题出发,检索知识库中的相关信息。利用检索到的信息,结合问题本身,生成准确、全面、权威的回答。
1.数据说明
数据集还可能包括一些未标注的文本,需要参赛者使用RAG技术中的检索增强方法来找到相关信息,并生成答案。这要求参赛者不仅要有强大的检索能力,还要能够生成准确、连贯且符合上下文的文本。
测试集为模拟生成的用户提问,需要参赛选手结合提问和语料完成回答。需注意,在问题中存在部分问题无法回答,需要选手设计合适的策略进行拒绝回答的逻辑。
• corpus.txt.zip:语料库,每行为一篇新闻
• test_question.csv:测试提问
对于测试提问的回答,采用字符重合比例进行评价,分数最高为1。
检索语料
文本长度
import pickle
import pandas as pd
from tqdm import tqdm
from gomate.modules.document.chunk import TextChunker
from gomate.modules.document.txt_parser import TextParser
from gomate.modules.document.utils import PROJECT_BASE
from gomate.modules.generator.llm import GLM4Chat
from gomate.modules.reranker.bge_reranker import BgeRerankerConfig, BgeReranker
from gomate.modules.retrieval.bm25s_retriever import BM25RetrieverConfig
from gomate.modules.retrieval.dense_retriever import DenseRetrieverConfig
from gomate.modules.retrieval.hybrid_retriever import HybridRetriever, HybridRetrieverConfig
def generate_chunks():
tp = TextParser()
tc = TextChunker()
paragraphs = tp.parse(r'H:/2024-Xfyun-RAG/data/corpus.txt', encoding="utf-8")
print(len(paragraphs))
chunks = []
for content in tqdm(paragraphs):
chunk = tc.chunk_sentences([content], chunk_size=1024)
chunks.append(chunk)
with open(f'{PROJECT_BASE}/output/chunks.pkl', 'wb') as f:
pickle.dump(chunks, f)
if __name__ == '__main__':
# test_path="H:/2024-Xfyun-RAG/data/test_question.csv"
# embedding_model_path="H:/pretrained_models/mteb/bge-m3"
# llm_model_path="H:/pretrained_models/llm/Qwen2-1.5B-Instruct"
test_path = "/data/users/searchgpt/yq/GoMate_dev/data/competitions/xunfei/test_question.csv"
embedding_model_path = "/data/users/searchgpt/pretrained_models/bge-large-zh-v1.5"
llm_model_path = "/data/users/searchgpt/pretrained_models/glm-4-9b-chat"
# ====================文件解析+切片=========================
generate_chunks()
with open(f'{PROJECT_BASE}/output/chunks.pkl', 'rb') as f:
chunks = pickle.load(f)
corpus = []
for chunk in chunks:
corpus.extend(chunk)
# ====================检索器配置=========================
# BM25 and Dense Retriever configurations
bm25_config = BM25RetrieverConfig(
method='lucene',
index_path='indexs/description_bm25.index',
k1=1.6,
b=0.7
)
bm25_config.validate()
print(bm25_config.log_config())
dense_config = DenseRetrieverConfig(
model_name_or_path=embedding_model_path,
dim=1024,
index_path='indexs/dense_cache'
)
config_info = dense_config.log_config()
print(config_info)
# Hybrid Retriever configuration
# 由于分数框架不在同一维度,建议可以合并
hybrid_config = HybridRetrieverConfig(
bm25_config=bm25_config,
dense_config=dense_config,
bm25_weight=0.7, # bm25检索结果权重
dense_weight=0.3 # dense检索结果权重
)
hybrid_retriever = HybridRetriever(config=hybrid_config)
# 构建索引
# hybrid_retriever.build_from_texts(corpus)
# 保存索引
# hybrid_retriever.save_index()
# 加载索引
hybrid_retriever.load_index()
# ====================检索测试=========================
query = "新冠肺炎疫情"
results = hybrid_retriever.retrieve(query, top_k=5)
# Output results
for result in results:
print(f"Text: {result['text']}, Score: {result['score']}")
# ====================排序配置=========================
reranker_config = BgeRerankerConfig(
model_name_or_path="/data/users/searchgpt/pretrained_models/bge-reranker-large"
)
bge_reranker = BgeReranker(reranker_config)
# ====================生成器配置=========================
# qwen_chat = QwenChat(llm_model_path)
glm4_chat = GLM4Chat(llm_model_path)
# ====================检索问答=========================
test = pd.read_csv(test_path)
answers = []
for question in tqdm(test['question'], total=len(test)):
search_docs = hybrid_retriever.retrieve(question)
search_docs = bge_reranker.rerank(
query=question,
documents=[doc['text'] for idx, doc in enumerate(search_docs)]
)
# print(search_docs)
content = '/n'.join([f'信息[{idx}]:' + doc['text'] for idx, doc in enumerate(search_docs)])
answer = glm4_chat.chat(prompt=question, content=content)
answers.append(answer[0])
print(question)
print(answer[0])
print("************************************/n")
test['answer'] = answers
test[['answer']].to_csv(f'{PROJECT_BASE}/output/gomate_baseline.csv', index=False)
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-11-15
RAG技术全解析:从基础到前沿,掌握智能问答新动向
2024-11-15
RAG在未来会消失吗?附RAG的5种切分策略
2024-11-15
HtmlRAG:利用 HTML 结构化信息增强 RAG 系统的知识检索能力和准确性
2024-11-15
打造自己的RAG解析大模型:表格数据标注的三条黄金规则
2024-11-13
RAGCache:让RAG系统更高效的多级动态缓存新方案
2024-11-13
Glean:企业AI搜索,估值46亿美元,ARR一年翻4倍
2024-11-12
从安装到配置,带你跑通GraphRAG
2024-11-12
蚂蚁 KAG 框架核心功能研读
2024-07-18
2024-07-09
2024-05-05
2024-07-09
2024-05-19
2024-06-20
2024-07-07
2024-07-07
2024-07-08
2024-07-09
2024-11-06
2024-11-06
2024-11-05
2024-11-04
2024-10-27
2024-10-25
2024-10-21
2024-10-21