微信扫码
与创始人交个朋友
我要投稿
由DALLE于2024年2月6日创建
研究图谱是一种结构化表示研究对象的信息图谱,它捕捉关于研究者、组织、出版物、资助和研究数据之间的实体和关系的信息。目前,这些出版物以PDF文件形式存在,由于自由文本的形式,很难解析PDF文件以提取结构化信息。在本文中,我们将尝试通过从PDF出版物中提取相关信息,并使用OpenAI将其组织成图谱结构来创建研究图谱。
从PDF创建图谱的流程
在这项工作中,我们使用OpenAI API和GPT的新助手功能(目前处于Beta阶段)将PDF文档转换为基于研究图谱模式的结构化JSON文件集。
助手API允许您在应用程序中构建人工智能(AI)助手。助手可以通过使用模型、工具和信息来回答用户问题。它是一个正在积极开发的Beta API。使用助手API,我们可以使用OpenAI托管的工具,如代码解释器和知识检索。本文将重点介绍知识检索。
有时,我们需要AI模型基于未知知识回答查询,比如用户提供的文档或敏感信息。我们可以使用助手API的知识检索工具来增强模型的信息。我们可以将文件上传到助手,它会自动将文档分块,并创建和存储嵌入以实现数据的向量搜索。
在我们的示例中,我们将出版物的PDF文件上传到OpenAI助手和知识检索工具,以获取给定出版物的图谱模式的JSON输出。我们使用的出版物可以从以下链接[1]访问。
读取存储出版物PDF的输入路径和存储JSON输出的输出路径。
import configparserconfig = configparser.ConfigParser()config.read('{}/config.ini'.format(current_path))input_path = config['DEFAULT']['Input-Path']output_path = config['DEFAULT']['Output-Path']debug = config['DEFAULT']['Debug']
从输入路径获取所有PDF文件。
onlyfiles = [f for f in os.listdir(input_path) if os.path.isfile(os.path.join(input_path, f))]
然后,我们需要初始化助手以使用知识检索工具。为此,我们需要在API中指定“retrieval”类型的工具。我们还指定了助手的指令和使用的OpenAI模型。
my_file_ids = []if client.files.list().data==[]:for f in onlyfiles:file = client.files.create(file=open(input_path + f, "rb"),purpose='assistants')my_file_ids.append(file.id)# 添加文件到助手assistant = client.beta.assistants.create(instructions = "你是一个出版物数据库支持聊天机器人。使用上传的pdf文件以最佳方式响应用户查询,输出为JSON格式。",model = "gpt-4-1106-preview",tools = [{"type": "retrieval"}],# 不要将所有文件附加到助手,否则即使在查询消息中指定文件ID,也会导致答案不匹配。# 我们将每条消息分别附加)
然后,我们指定需要从出版物文件中提取的信息,并将其作为用户查询传递给助手。经过实验,我们发现请求每个用户消息的JSON格式输出生成最一致的结果。
user_msgs = ["打印本文的标题,格式为JSON", "打印本文的作者,格式为JSON", "打印本文的摘要部分,格式为JSON", "打印本文的关键词,格式为JSON", "打印本文的DOI号码,格式为JSON", "打印本文的作者单位,格式为JSON", "打印本文的参考文献部分,格式为JSON"]
接下来是将查询传递给助手以生成输出。我们需要为每个用户查询创建一个单独的线程对象,其中包含查询作为用户消息。然后,我们运行线程并检索助手的答案。
all_results = []
for i in my_file_ids:
print('\n#####')
# JSON结果,它可以提取并解析,希望如此
file_result = {}
for q in user_msgs:
# 为每个查询创建线程、用户消息和运行对象
thread = client.beta.threads.create()
msg = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content=q,
file_ids=[i] # 指定要从中提取的文件/出版物
)
print('\n',q)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
additional_instructions="如果找不到答案,请打印“False”" # 在本演示时不太有用
)
# 通过每次检索更新对象检查运行状态
while run.status in ["queued",'in_progress']:
print(run.status)
time.sleep(5)
run = client.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id
)
# 通常是速率限制错误
if run.status=='failed':logging.info("运行失败: ", run)
if run.status=='completed':
print("<完成>")
# 提取更新的消息对象,这包括用户消息
messages = client.beta.threads.messages.list(
thread_id=thread.id
)
for m in messages:
if m.role=='assistant':
value = m.content[0].text.value # 获取文本响应
if "json" not in value:
if value=='False':logging.info("未找到答案:", str(q))
else:
logging.info("不是JSON输出,可能在文件中找不到答案或模型已过时:", str(value))
else:
# 清理响应并尝试解析为JSON
value = value.split("```")[1].split('json')[-1].strip()
try:
d = json.loads(value)
file_result.update(d)
print(d)
except Exception as e:
logging.info(f"查询 {q} \n解析字符串为JSON失败: ", str(e))
print(f"查询 {q} \n解析字符串为JSON失败: ", str(e))
all_results.append(file_result)
生成的JSON输出如下:
[{"title": "Dodes (diagnostic nodes) for Guideline Manipulation","authors": [{"name": "PM Putora", "affiliation": "Department of Radiation-Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland"},{"name": "M Blattner", "affiliation": "Laboratory for Web Science, Zürich, Switzerland"},{"name": "A Papachristofilou", "affiliation": "Department of Radiation Oncology, University Hospital Basel, Basel, Switzerland"},{"name": "F Mariotti", "affiliation": "Laboratory for Web Science, Zürich, Switzerland"},{"name": "B Paoli", "affiliation": "Laboratory for Web Science, Zürich, Switzerland"},{"name": "L Plasswilma", "affiliation": "Department of Radiation-Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland"}],"Abstract": {"Background": "Treatment recommendations (guidelines) are commonly represented in text form. Based on parameters (questions) recommendations are defined (answers).","Objectives": "To improve handling, alternative forms of representation are required.","Methods": "The concept of Dodes (diagnostic nodes) has been developed. Dodes contain answers and questions. Dodes are based on linked nodes and additionally contain descriptive information and recommendations. Dodes are organized hierarchically into Dode trees. Dode categories must be defined to prevent redundancy.","Results": "A centralized and neutral Dode database can provide standardization which is a requirement for the comparison of recommendations. Centralized administration of Dode categories can provide information about diagnostic criteria (Dode categories) underutilized in existing recommendations (Dode trees).","Conclusions": "Representing clinical recommendations in Dode trees improves their manageability handling and updateability."},"Keywords": ["dodes", "ontology", "semantic web", "guidelines", "recommendations", "linked nodes"],"DOI": "10.5166/jroi-2-1-6","references": [{"ref_number": "[1]", "authors": "Mohler J Bahnson RR Boston B et al.", "title": "NCCN clinical practice guidelines in oncology: prostate cancer.", "source": "J Natl Compr Canc Netw.", "year": "2010 Feb", "volume_issue_pages": "8(2):162-200"},{"ref_number": "[2]", "authors": "Heidenreich A Aus G Bolla M et al.", "title": "EAU guidelines on prostate cancer.", "source": "Eur Urol.", "year": "2008 Jan", "volume_issue_pages": "53(1):68-80", "notes": "Epub 2007 Sep 19. Review."},{"ref_number": "[3]", "authors": "Fairchild A Barnes E Ghosh S et al.", "title": "International patterns of practice in palliative radiotherapy for painful bone metastases: evidence-based practice?", "source": "Int J Radiat Oncol Biol Phys.", "year": "2009 Dec 1", "volume_issue_pages": "75(5):1501-10", "notes": "Epub 2009 May 21."},{"ref_number": "[4]", "authors": "Lawrentschuk N Daljeet N Ma C et al.", "title": "Prostate-specific antigen test result interpretation when combined with risk factors for recommendation of biopsy: a survey of urologist's practice patterns.", "source": "Int Urol Nephrol.", "year": "2010 Jun 12", "notes": "Epub ahead of print"},{"ref_number": "[5]", "authors": "Parmelli E Papini D Moja L et al.", "title": "Updating clinical recommendations for breast colorectal and lung cancer treatments: an opportunity to improve methodology and clinical relevance.", "source": "Ann Oncol.", "year": "2010 Jul 19", "notes": "Epub ahead of print"},{"ref_number": "[6]", "authors": "Ahn HS Lee HJ Hahn S et al.", "title": "Evaluation of the Seventh American Joint Committee on Cancer/International Union Against Cancer Classification of gastric adenocarcinoma in comparison with the sixth classification.", "source": "Cancer.", "year": "2010 Aug 24", "notes": "Epub ahead of print"},{"ref_number": "[7]", "authors": "Rami-Porta R Goldstraw P.", "title": "Strength and weakness of the new TNM classification for lung cancer.", "source": "Eur Respir J.", "year": "2010 Aug", "volume_issue_pages": "36(2):237-9"},{"ref_number": "[8]", "authors": "Sinn HP Helmchen B Wittekind CH.", "title": "TNM classification of breast cancer: Changes and comments on the 7th edition.", "source": "Pathologe.", "year": "2010 Aug 15", "notes": "Epub ahead of print"},{"ref_number": "[9]", "authors": "Paleri V Mehanna H Wight RG.", "title": "TNM classification of malignant tumours 7th edition: what's new for head and neck?", "source": "Clin Otolaryngol.", "year": "2010 Aug", "volume_issue_pages": "35(4):270-2"},{"ref_number": "[10]", "authors": "Guarino N.", "title": "Formal Ontology and Information Systems", "source": "1998 IOS Press"},{"ref_number": "[11]", "authors": "Uschold M Gruniger M.", "title": "Ontologies: Principles Methods and Applications.", "source": "Knowledge Engineering Review", "year": "1996", "volume_issue_pages": "11(2)"},{"ref_number": "[12]", "authors": "Aho A Garey M Ullman J.", "title": "The Transitive Reduction of a Directed Graph.", "source": "SIAM Journal on Computing", "year": "1972", "volume_issue_pages": "1(2): 131–137"},{"ref_number": "[13]", "authors": "Tai K", "title": "The tree-to-tree correction problem.", "source": "Journal of the Association for Computing Machinery (JACM)", "year": "1979", "volume_issue_pages": "26(3):422-433"}]}]
需要清理文件对象和助手对象,因为它们在“检索”模式下会产生费用。此外,这也是一种良好的编码实践。
for f in client.files.list().data:client.files.delete(f.id)
# 检索并删除正在运行的助手my_assistants = client.beta.assistants.list(order="desc")for a in my_assistants.data:response = client.beta.assistants.delete(a.id)print(response)
接下来的步骤是使用Python Networkx[2]包生成图谱可视化。
import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
node_colors = []
key = "jroi/" + all_results[0]['title']
G.add_nodes_from([(all_results[0]['title'], {'doi': all_results[0]['DOI'], 'title': all_results[0]['title'], 'source': 'jroi', 'key': key})])
node_colors.append('#4ba9dc')
for author in all_results[0]['authors']:
key = "jroi/" + author['name']
G.add_nodes_from([(author['name'], {'key': key, 'local_id': author['name'], 'full_name': author['name'], 'source': 'jroi'})])
G.add_edge(all_results[0]['title'], author['name'])
node_colors.append('#63cc9e')
for reference in all_results[0]['references']:
key = "jroi/" + reference['title']
G.add_nodes_from([(reference['title'].split('.')[0][:25] + '...', {'title': reference['title'], 'source': 'jroi', 'key': key})])
G.add_edge(all_results[0]['title'], reference['title'].split('.')[0][:25] + '...')
node_colors.append('#4ba9dc')
pos = nx.spring_layout(G)
labels = nx.get_edge_attributes(G, 'label')
nx.draw(G, pos, with_labels=True, node_size=1000, node_color=node_colors, font_size=7, font_color='black')
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
plt.savefig("graph_image.png")
plt.show()
生成的图谱可视化如下:
使用Networkx生成的图谱可视化
注意:请注意,不同执行之间,OpenAI生成的输出结构可能会有所不同。因此,您可能需要根据该结构更新上述代码。
总之,利用GPT API从PDF出版物中提取研究图谱为研究人员和数据分析师提供了一种强大且高效的解决方案。该工作流简化了将PDF出版物转换为结构化和可访问的研究图谱的过程。但我们也必须注意大语言模型(LLMs)生成响应的不一致性。随着时间的推移,通过定期更新和改进提取模型,可以进一步提高准确性和相关性。
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-01-02
2024-07-17
2025-01-03
2024-07-11
2024-07-13
2024-08-13
2024-06-24
2024-06-10
2024-07-12
2024-08-27
2025-01-14
2025-01-10
2025-01-06
2025-01-02
2024-12-16
2024-12-10
2024-12-04
2024-12-01