微信扫码
与创始人交个朋友
我要投稿
关于ollama和LangChain,我曾做一个关于构建知识库的内容,LangChain与大模型结合,是比较新比较新的内容,LangChain正在支持更多的内容,刚刚推出的groq也已经支持。今天看一以篇使用OpenAi、LangChain、chromadb抓取SEC申报的内页内容,对大公司的财务状部进行分析的内容。这个也不算文章,只有一段介绍,和一段代码,我把这个转发过来,大家参考一下,看看对你的学习有没有启发。
一个开源的上市公司财报文件数据提取工具,使用Mistral-7B提取了10-K的收入报表。输出结果被整洁地格式化为JSON。
工作流介绍:
后续的to-do:
网页地址为:https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm
是Airbnb的一个财务报表:
pip install -U -q langchain openai chromadb unstructured==0.12.5 instructor tiktoken
url = "https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm"
loader = UnstructuredURLLoader(urls=[url], headers={'User-Agent': 'your-org your@org.com'})
documents = loader.load()
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import TokenTextSplitter
# Naively chunk the SEC filing by tokens
token_splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=20)
docs = token_splitter.split_documents(documents)
# Save the chunked docs in vector DB
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings(model="text-embedding-3-large"))
query = "What was Airbnb's revenue, net income, and cost of revenue?"
# Get documents from the vector DB
k = 1
top_k_docs = vectorstore.similarity_search(query, k)
context = "\n".join([doc.page_content for doc in top_k_docs])
context
import instructor
from openai import OpenAI
from pydantic import BaseModel
from pydantic import Field
from enum import Enum
from typing import Optional, Union, List
class UnitSuffix(str, Enum):
billion = 'Billion'
million = 'Million'
thousand = 'Thousand'
unknown = ''
class FiscalPeriod(str, Enum):
fy_2023 = 'FY2023'
fy_2022 = 'FY2022'
fy_2021 = 'FY2021'
fy_2020 = 'FY2020'
unknown = ''
# Define our income statement
class IncomeStatement(BaseModel):
period: Optional[FiscalPeriod]
revenue: Union[float, str] = Field(description="Revenue")
revenue_unit: Optional[UnitSuffix]
cost_of_revenue: Union[float, str] = Field(description="Cost of revenue")
cost_of_revenue_unit: Optional[UnitSuffix]
income_from_operations: Union[float, str] = Field(description="Income from operations")
income_from_operations_unit: Optional[UnitSuffix]
operations_and_support: Union[float, str] = Field(description="Operations and support")
operations_and_support_unit: Optional[UnitSuffix]
product_development: Union[float, str] = Field(description="Product development")
product_development_unit: Optional[UnitSuffix]
sales_and_marketing: Union[float, str] = Field(description="Sales and marketing")
sales_and_marketing_unit: Optional[UnitSuffix]
general_and_administrative: Union[float, str] = Field(description="General and administrative")
general_and_administrative_unit: Optional[UnitSuffix]
interest_income: Union[float, str] = Field(description="Interest income")
interest_income_unit: Optional[UnitSuffix]
interest_expense: Union[float, str] = Field(description="Interest expense")
interest_expense_unit: Optional[UnitSuffix]
other_income: Union[float, str] = Field(description="Other income")
other_income_unit: Optional[UnitSuffix]
net_income: Union[float, str] = Field(description="Net income")
net_income_unit: Optional[UnitSuffix]
class Financials(BaseModel):
ticker: str
income_statements: List[IncomeStatement]
IncomeStatement
:Pydantic 的 BaseModel
,表示损益表。它包含各种财务字段,如营收、成本、经营收入等。每个字段都定义了对应的单位后缀。period
字段表示损益表的财政期间。Financials
:另一个 Pydantic 的 BaseModel
,表示公司的财务数据。!pip install -U -q llama-cpp-python huggingface-hub
import llama_cpp
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
import instructor
from pydantic import BaseModel
from typing import List
from rich.console import Console
from huggingface_hub import hf_hub_download
# mixtral_path = "TheBloke/Mixtral-8x7B-v0.1-GGUF"
# mixtral_q4_basename = "mixtral-8x7b-v0.1.Q4_K_M.gguf"
mistral_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
mistral_q4_basename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
model_path = hf_hub_download(repo_id=mistral_path, filename=mistral_q4_basename)
llm = Llama(
model_path=model_path,
n_gpu_layers=--1, # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all
n_batch = 2048, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
n_ctx=2048,
logits_all=False,
)
llm.verbose = False
import time
start = time.time()
response = create(
response_model=instructor.Partial[Financials],
messages=[
{
"role": "user",
"content": f"Extract Airbnb's income statement from 2023, 2022, and 2021 from following context: {context}",
},
],
)
print(f"Took {time.time() - start} seconds to complete!")
print(response.model_dump_json(indent=2))
Took 119.98298811912537 seconds to complete!
{
"income_statements": [
{
"period": "FY2021",
"revenue": 5992.0,
"revenue_unit": "Million",
"cost_of_revenue": 1156.0,
"cost_of_revenue_unit": "Million",
"income_from_operations": 429.0,
"income_from_operations_unit": "Million",
"operations_and_support": 847.0,
"operations_and_support_unit": "Million",
"product_development": 1425.0,
"product_development_unit": "Million",
"sales_and_marketing": 1186.0,
"sales_and_marketing_unit": "Million",
"general_and_administrative": 836.0,
"general_and_administrative_unit": "Million",
"interest_income": 13.0,
"interest_income_unit": "Million",
"interest_expense": -438.0,
"interest_expense_unit": "Million",
"other_income": -304.0,
"other_income_unit": "Million",
"net_income": -352.0,
"net_income_unit": "Million"
},
{
"period": "FY2022",
"revenue": 8399.0,
"revenue_unit": "Million",
"cost_of_revenue": 1499.0,
"cost_of_revenue_unit": "Million",
"income_from_operations": 1802.0,
"income_from_operations_unit": "Million",
"operations_and_support": 1041.0,
"operations_and_support_unit": "Million",
"product_development": 1502.0,
"product_development_unit": "Million",
"sales_and_marketing": 1516.0,
"sales_and_marketing_unit": "Million",
"general_and_administrative": 950.0,
"general_and_administrative_unit": "Million",
"interest_income": 186.0,
"interest_income_unit": "Million",
"interest_expense": -24.0,
"interest_expense_unit": "Million",
"other_income": 25.0,
"other_income_unit": "Million",
"net_income": 1893.0,
"net_income_unit": "Million"
},
{
"period": "FY2023",
"revenue": 9917.0,
"revenue_unit": "Million",
"cost_of_revenue": 1703.0,
"cost_of_revenue_unit": "Million",
"income_from_operations": 1518.0,
"income_from_operations_unit": "Million",
"operations_and_support": 1186.0,
"operations_and_support_unit": "Million",
"product_development": 1722.0,
"product_development_unit": "Million",
"sales_and_marketing": 1763.0,
"sales_and_marketing_unit": "Million",
"general_and_administrative": 2025.0,
"general_and_administrative_unit": "Million",
"interest_income": 721.0,
"interest_income_unit": "Million",
"interest_expense": -83.0,
"interest_expense_unit": "Million",
"other_income": -54.0,
"other_income_unit": "Million",
"net_income": 4792.0,
"net_income_unit": "Million"
}
]
}
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-01-22
LangChain实战 | OutputParser:让大模型输出从 “鸡肋” 变 “瑰宝” 的关键!
2025-01-21
Ambient Agent: 让 AI 主动工作的新范式
2025-01-19
LangChain实战 | 实现一个检索增强生成系统(RAG)
2025-01-19
LangChain:构建智能语言模型应用的开源框架
2025-01-17
报告分享|谷歌 AI Agent 白皮书宣告 2025 年迈入 Agent 时代
2025-01-17
从零开始,用LangChain构建你的第一个智能应用
2025-01-16
深度解析两种增强的AI Agent反思模式
2025-01-07
Agent 最全 Playbook:场景、记忆和交互创新
2024-10-10
2024-04-08
2024-08-18
2024-06-03
2024-09-04
2024-07-13
2024-06-24
2024-04-08
2024-04-17
2024-07-10
2024-12-02
2024-11-25
2024-10-30
2024-10-11
2024-08-18
2024-08-16
2024-08-04
2024-07-29