微信扫码
添加专属顾问
我要投稿
我们将在本文中介绍一种文本增强技术,该技术利用额外的问题生成来改进矢量数据库中的文档检索。通过生成和合并与每个文本片段相关的问题,增强系统标准检索过程,从而增加了找到相关文档的可能性,这些文档可以用作生成式问答的上下文。
实现步骤
class QuestionGeneration(Enum):
"""
Enum class to specify the level of question generation for document processing.
Attributes:
DOCUMENT_LEVEL (int): Represents question generation at the entire document level.
FRAGMENT_LEVEL (int): Represents question generation at the individual text fragment level.
"""
DOCUMENT_LEVEL = 1
FRAGMENT_LEVEL = 2
方案实现
问题生成
def generate_questions(text: str) -> List[str]:
"""
Generates a list of questions based on the provided text using OpenAI.
Args:
text (str): The context data from which questions are generated.
Returns:
List[str]: A list of unique, filtered questions.
"""
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = PromptTemplate(
input_variables=["context", "num_questions"],
template="Using the context data: {context}\n\nGenerate a list of at least {num_questions} "
"possible questions that can be asked about this context. Ensure the questions are "
"directly answerable within the context and do not include any answers or headers. "
"Separate the questions with a new line character."
)
chain = prompt | llm.with_structured_output(QuestionList)
input_data = {"context": text, "num_questions": QUESTIONS_PER_DOCUMENT}
result = chain.invoke(input_data)
# Extract the list of questions from the QuestionList object
questions = result.question_list
filtered_questions = clean_and_filter_questions(questions)
return list(set(filtered_questions))
处理主流程
def process_documents(content: str, embedding_model: OpenAIEmbeddings):
"""
Process the document content, split it into fragments, generate questions,
create a FAISS vector store, and return a retriever.
Args:
content (str): The content of the document to process.
embedding_model (OpenAIEmbeddings): The embedding model to use for vectorization.
Returns:
VectorStoreRetriever: A retriever for the most relevant FAISS document.
"""
# Split the whole text content into text documents
text_documents = split_document(content, DOCUMENT_MAX_TOKENS, DOCUMENT_OVERLAP_TOKENS)
print(f'Text content split into: {len(text_documents)} documents')
documents = []
counter = 0
for i, text_document in enumerate(text_documents):
text_fragments = split_document(text_document, FRAGMENT_MAX_TOKENS, FRAGMENT_OVERLAP_TOKENS)
print(f'Text document {i} - split into: {len(text_fragments)} fragments')
for j, text_fragment in enumerate(text_fragments):
documents.append(Document(
page_content=text_fragment,
metadata={"type": "ORIGINAL", "index": counter, "text": text_document}
))
counter += 1
if QUESTION_GENERATION == QuestionGeneration.FRAGMENT_LEVEL:
questions = generate_questions(text_fragment)
documents.extend([
Document(page_content=question, metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document})
for idx, question in enumerate(questions)
])
counter += len(questions)
print(f'Text document {i} Text fragment {j} - generated: {len(questions)} questions')
if QUESTION_GENERATION == QuestionGeneration.DOCUMENT_LEVEL:
questions = generate_questions(text_document)
documents.extend([
Document(page_content=question, metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document})
for idx, question in enumerate(questions)
])
counter += len(questions)
print(f'Text document {i} - generated: {len(questions)} questions')
for document in documents:
print_document("Dataset", document)
print(f'Creating store, calculating embeddings for {len(documents)} FAISS documents')
vectorstore = FAISS.from_documents(documents, embedding_model)
print("Creating retriever returning the most relevant FAISS document")
return vectorstore.as_retriever(search_kwargs={"k": 1})
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-10-13
LightRAG × Yuxi-Know——「知识检索 + 知识图谱」实践案例
2025-10-13
PG用户福音|一次性搞定RAG完整数据库套装
2025-10-12
任何格式RAG数据实现秒级转换!彻底解决RAG系统中最令人头疼的数据准备环节
2025-10-12
总结了 13 个 顶级 RAG 技术
2025-10-11
企业级 RAG 系统实战(2万+文档):10 个项目踩过的坑(附代码工程示例)
2025-10-09
RAG-Anything × Milvus:读PDF要集成20个工具的RAG时代结束了!
2025-10-09
RAGFlow 实践:公司研报深度研究智能体
2025-10-04
Embedding与Rerank:90%的RAG系统都搞错了!为什么单靠向量检索会毁了你的AI应用?
2025-09-15
2025-08-05
2025-08-18
2025-09-02
2025-08-25
2025-08-25
2025-07-21
2025-08-25
2025-09-03
2025-08-20
2025-10-04
2025-09-30
2025-09-10
2025-09-10
2025-09-03
2025-08-28
2025-08-25
2025-08-20