我要投稿

使用openai和LangChain抓取SEC 上市公司财务数据，自动生成财务分析报表

发布日期：2024-04-02 22:43:10 浏览次数： 3368

作者：峰哥Python笔记

微信搜一搜，关注“峰哥Python笔记”

关于ollama和LangChain，我曾做一个关于构建知识库的内容，LangChain与大模型结合，是比较新比较新的内容，LangChain正在支持更多的内容，刚刚推出的groq也已经支持。今天看一以篇使用OpenAi、LangChain、chromadb抓取SEC申报的内页内容，对大公司的财务状部进行分析的内容。这个也不算文章，只有一段介绍，和一段代码，我把这个转发过来，大家参考一下，看看对你的学习有没有启发。

基本介绍

一个开源的上市公司财报文件数据提取工具，使用Mistral-7B提取了10-K的收入报表。输出结果被整洁地格式化为JSON。

工作流介绍：

下载并分块SEC申报文件
将块存储在向量数据库中
查询向量数据库以获取财务数据
使用大语言模型（LLM）提取财务数据
使用 instructor 输出JSON

后续的to-do：

从SEC申报文件中提取所有财务报表。
将财务报表结构化为JSON。
将报表存储在SQL数据库中。
构建一个API来提供存储的报表。
从应用程序调用API，并以表格形式呈现财务数据。

网页地址为：https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm

是Airbnb的一个财务报表：

代码

1.安装包

pip install -U -q langchain openai chromadb unstructured==0.12.5 instructor tiktoken

2.加载 SEC 数据

url = "https://www.sec.gov/Archives/edgar/data/1559720/000155972024000006/abnb-20231231.htm"
loader = UnstructuredURLLoader(urls=[url], headers={'User-Agent': 'your-org your@org.com'})
documents = loader.load()

3.数据整理

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import TokenTextSplitter

# Naively chunk the SEC filing by tokens
token_splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=20)
docs = token_splitter.split_documents(documents)

# Save the chunked docs in vector DB
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings(model="text-embedding-3-large"))

4.根据上下文进行查询

query = "What was Airbnb's revenue, net income, and cost of revenue?"

# Get documents from the vector DB
k = 1
top_k_docs = vectorstore.similarity_search(query, k)
context = "\n".join([doc.page_content for doc in top_k_docs])

context

5. 使用结构化指令构建schema

import instructor
from openai import OpenAI
from pydantic import BaseModel
from pydantic import Field
from enum import Enum
from typing import Optional, Union, List

class UnitSuffix(str, Enum):
    billion = 'Billion'
    million = 'Million'
    thousand = 'Thousand'
    unknown = ''

class FiscalPeriod(str, Enum):
    fy_2023 = 'FY2023'
    fy_2022 = 'FY2022'
    fy_2021 = 'FY2021'
    fy_2020 = 'FY2020'
    unknown = ''

# Define our income statement
class IncomeStatement(BaseModel):
  period: Optional[FiscalPeriod]

  revenue: Union[float, str] = Field(description="Revenue")
  revenue_unit: Optional[UnitSuffix]

  cost_of_revenue: Union[float, str] = Field(description="Cost of revenue")
  cost_of_revenue_unit: Optional[UnitSuffix]

  income_from_operations: Union[float, str] = Field(description="Income from operations")
  income_from_operations_unit: Optional[UnitSuffix]

  operations_and_support: Union[float, str] = Field(description="Operations and support")
  operations_and_support_unit: Optional[UnitSuffix]

  product_development: Union[float, str] = Field(description="Product development")
  product_development_unit: Optional[UnitSuffix]

  sales_and_marketing: Union[float, str] = Field(description="Sales and marketing")
  sales_and_marketing_unit: Optional[UnitSuffix]

  general_and_administrative: Union[float, str] = Field(description="General and administrative")
  general_and_administrative_unit: Optional[UnitSuffix]

  interest_income: Union[float, str] = Field(description="Interest income")
  interest_income_unit: Optional[UnitSuffix]

  interest_expense: Union[float, str] = Field(description="Interest expense")
  interest_expense_unit: Optional[UnitSuffix]

  other_income: Union[float, str] = Field(description="Other income")
  other_income_unit: Optional[UnitSuffix]

  net_income: Union[float, str] = Field(description="Net income")
  net_income_unit: Optional[UnitSuffix]


class Financials(BaseModel):
  ticker: str
  income_statements: List[IncomeStatement]

IncomeStatement：Pydantic 的 BaseModel，表示损益表。它包含各种财务字段，如营收、成本、经营收入等。每个字段都定义了对应的单位后缀。period 字段表示损益表的财政期间。
Financials：另一个 Pydantic 的 BaseModel，表示公司的财务数据。

5.下载Mistral-7B模型

!pip install -U -q llama-cpp-python huggingface-hub

import llama_cpp
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

import instructor

from pydantic import BaseModel
from typing import List
from rich.console import Console
from huggingface_hub import hf_hub_download

# mixtral_path = "TheBloke/Mixtral-8x7B-v0.1-GGUF"
# mixtral_q4_basename = "mixtral-8x7b-v0.1.Q4_K_M.gguf"

mistral_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"

mistral_q4_basename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

model_path = hf_hub_download(repo_id=mistral_path, filename=mistral_q4_basename)

llm = Llama(
    model_path=model_path,
    n_gpu_layers=--1, # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all
    n_batch = 2048, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_ctx=2048,
    logits_all=False,
)
llm.verbose = False

6.使用Mistral-7B分析财务数据

import time

start = time.time()

response = create(
    response_model=instructor.Partial[Financials],
    messages=[
        {
            "role": "user",
            "content": f"Extract Airbnb's income statement from 2023, 2022, and 2021 from following context: {context}",
        },
    ],
)
print(f"Took {time.time() - start} seconds to complete!")
print(response.model_dump_json(indent=2))

7.分析结果

Took 119.98298811912537 seconds to complete!
{
  "income_statements": [
    {
      "period": "FY2021",
      "revenue": 5992.0,
      "revenue_unit": "Million",
      "cost_of_revenue": 1156.0,
      "cost_of_revenue_unit": "Million",
      "income_from_operations": 429.0,
      "income_from_operations_unit": "Million",
      "operations_and_support": 847.0,
      "operations_and_support_unit": "Million",
      "product_development": 1425.0,
      "product_development_unit": "Million",
      "sales_and_marketing": 1186.0,
      "sales_and_marketing_unit": "Million",
      "general_and_administrative": 836.0,
      "general_and_administrative_unit": "Million",
      "interest_income": 13.0,
      "interest_income_unit": "Million",
      "interest_expense": -438.0,
      "interest_expense_unit": "Million",
      "other_income": -304.0,
      "other_income_unit": "Million",
      "net_income": -352.0,
      "net_income_unit": "Million"
    },
    {
      "period": "FY2022",
      "revenue": 8399.0,
      "revenue_unit": "Million",
      "cost_of_revenue": 1499.0,
      "cost_of_revenue_unit": "Million",
      "income_from_operations": 1802.0,
      "income_from_operations_unit": "Million",
      "operations_and_support": 1041.0,
      "operations_and_support_unit": "Million",
      "product_development": 1502.0,
      "product_development_unit": "Million",
      "sales_and_marketing": 1516.0,
      "sales_and_marketing_unit": "Million",
      "general_and_administrative": 950.0,
      "general_and_administrative_unit": "Million",
      "interest_income": 186.0,
      "interest_income_unit": "Million",
      "interest_expense": -24.0,
      "interest_expense_unit": "Million",
      "other_income": 25.0,
      "other_income_unit": "Million",
      "net_income": 1893.0,
      "net_income_unit": "Million"
    },
    {
      "period": "FY2023",
      "revenue": 9917.0,
      "revenue_unit": "Million",
      "cost_of_revenue": 1703.0,
      "cost_of_revenue_unit": "Million",
      "income_from_operations": 1518.0,
      "income_from_operations_unit": "Million",
      "operations_and_support": 1186.0,
      "operations_and_support_unit": "Million",
      "product_development": 1722.0,
      "product_development_unit": "Million",
      "sales_and_marketing": 1763.0,
      "sales_and_marketing_unit": "Million",
      "general_and_administrative": 2025.0,
      "general_and_administrative_unit": "Million",
      "interest_income": 721.0,
      "interest_income_unit": "Million",
      "interest_expense": -83.0,
      "interest_expense_unit": "Million",
      "other_income": -54.0,
      "other_income_unit": "Million",
      "net_income": 4792.0,
      "net_income_unit": "Million"
    }
  ]
}

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业