微信扫码
与创始人交个朋友
我要投稿
**探索PDF解析与检索的未来,RAG与LlamaParse的结合将如何改变信息处理方式。**核心内容:1. RAG技术的工作原理及其在数据驱动生成式AI中的关键作用2. PDF文件在信息提取中的挑战及LlamaParse技术的优势3. LlamaParse在处理包含表格、图像等复杂文档中的应用前景
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-parse
!pip install llama-index-vector-stores-kdbai
!pip install pandas
!pip install llama-index-postprocessor-cohere-rerank
!pip install kdbai_client
from llama_parse import LlamaParse
from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from llama_index.postprocessor.cohere_rerank import CohereRerank
from getpass import getpass
import os
import kdbai_client as kdbai
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = (
os.environ["LLAMA_CLOUD_API_KEY"]
if "LLAMA_CLOUD_API_KEY" in os.environ
else getpass("LLAMA CLOUD API key: ")
)
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = (
os.environ["OPENAI_API_KEY"]
if "OPENAI_API_KEY" in os.environ
else getpass("OpenAI API Key: ")
)
#Set up KDB.AI endpoint and API key
KDBAI_ENDPOINT = (
os.environ["KDBAI_ENDPOINT"]
if "KDBAI_ENDPOINT" in os.environ
else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
os.environ["KDBAI_API_KEY"]
if "KDBAI_API_KEY" in os.environ
else getpass("KDB.AI API key: ")
)
#connect to KDB.AI
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)
schema = [
dict(name="document_id", type="str"),
dict(name="text", type="str"),
dict(name="embeddings", type="float32s"),
]
indexFlat = {
"name": "flat",
"type": "flat",
"column": "embeddings",
"params": {'dims': 1536, 'metric': 'L2'},
}
# Connect with kdbai database
db = session.database("default")
KDBAI_TABLE_NAME = "LlamaParse_Table"
# First ensure the table does not already exist
try:
db.table(KDBAI_TABLE_NAME).drop()
except kdbai.KDBAIException:
pass
#Create the table
table = db.create_table(KDBAI_TABLE_NAME, schema, indexes=[indexFlat])
!wget 'https://arxiv.org/pdf/2404.08865' -O './LLM_recall.pdf'
EMBEDDING_MODEL = "text-embedding-3-small"
GENERATION_MODEL = "gpt-4o"
llm = OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)
Settings.llm = llm
Settings.embed_model = embed_model
pdf_file_name = './LLM_recall.pdf'
parsing_instructions = '''The document titled "LLM In-Context Recall is Prompt Dependent" is an academic preprint from April 2024, authored by Daniel Machlab and Rick Battle from the VMware NLP Lab. It explores the in-context recall capabilities of Large Language Models (LLMs) using a method called "needle-in-a-haystack," where a specific factoid is embedded in a block of unrelated text. The study investigates how the recall performance of various LLMs is influenced by the content of prompts and the biases in their training data. The research involves testing multiple LLMs with varying context window sizes to assess their ability to recall information accurately when prompted differently. The paper includes detailed methodologies, results from numerous tests, discussions on the impact of prompt variations and training data, and conclusions on improving LLM utility in practical applications. It contains many tables. Answer questions using the information in this article and be precise.'''
documents = LlamaParse(result_type="markdown", parsing_instructions=parsing_instructions).load_data(pdf_file_name)
print(documents[0].text[:1000])
# Parse the documents using MarkdownElementNodeParser
node_parser = MarkdownElementNodeParser(llm=llm, num_workers=8).from_defaults()
# Retrieve nodes (text) and objects (table)
nodes = node_parser.get_nodes_from_documents(documents)
from openai import OpenAI
client = OpenAI()
def embed_query(query):
query_embedding = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
return query_embedding.data[0].embedding
def retrieve_data(query):
query_embedding = embed_query(query)
results = table.search(vectors={'flat':[query_embedding]},n=5,filter=[('<>','document_id','4a9551df-5dec-4410-90bb-43d17d722918')])
retrieved_data_for_RAG = []
for index, row in results[0].iterrows():
retrieved_data_for_RAG.append(row['text'])
return retrieved_data_for_RAG
def RAG(query):
question = "You will answer this question based on the provided reference material: " + query
messages = "Here is the provided context: " + "\n"
results = retrieve_data(query)
if results:
for data in results:
messages += data + "\n"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": question},
{
"role": "user",
"content": [
{"type": "text", "text": messages},
],
}
],
max_tokens=300,
)
content = response.choices[0].message.content
return content
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-02-05
打造RAG智能助手:实时数据检索的终极指南!惊呆你的需求,如何一步到位?
2025-02-05
RAG知识库中文档包含表格数据如何处理?
2025-02-05
产品思维的角度来讲,Deep Research本质是Co-RAG
2025-02-04
你的RAG系统真的达标了吗?生产环境RAG成功的7大关键指标
2025-02-01
35页综述:Agentic RAG七大架构首次曝光!
2025-01-28
Model2Vec加速RAG:模型小15倍,速度快500倍:
2025-01-27
穿过幻觉荒野,大模型RAG越野赛
2025-01-27
只是RAG了一下,我看到了AI大模型的态度!
2024-07-18
2024-09-04
2024-05-05
2024-06-20
2024-10-27
2024-07-09
2024-07-09
2024-06-13
2024-05-19
2024-07-07
2025-02-05
2025-02-05
2025-01-24
2025-01-24
2025-01-20
2025-01-18
2025-01-18
2025-01-18