AI知识库

53AI知识库

学习大模型的前沿技术与行业应用场景


【AI】【Agent】来实践
发布日期:2024-06-07 06:07:16 浏览次数: 1779 来源:毛毛Post


本次就是聚焦到Dify(开源版)进行实践(上次是宏观体验),包含了内容抓取,知识库,绘图和工作流的外部调用,一起来Get!

看效果

如上是对一批网站按照特定分类后做的可视化展示!图像是通过Dify生成的!

话原理

本次的实践流程上,是拿到一批网站,请求网站的既定分类(调用dify的网站分类工作流),再处理汇总数据,调用可视化生成工具(dify的可视化呈现工作流)

其中涉及到几个工具的使用,做个小结

1)知识库

我上传了既定的分类文件,我的上传就包含了2列,ID和Category

经过试了一番下来,最后还是都用了默认的!

创建知识库

这里有三种检索方式,分别是向量检索,全文检索和混合检索!

向量检索里Top k是说在检索时找到k个最相似的文本块,score阈值是相似度来过滤文本块的;Rerank是做文本块的排序,dify种有好几个比如cohere的;

文本检索就是全文字符关键词检索这种方式去找,,这种是弥补向量查询的劣势:比如查询很短的文本,不常用的文本或专业名次,都是向量数据库不擅长的;所以是推荐混合检索

使用时的检索方式

这里可以理解的是,第一个种,就是向量检索后直接找了一个文本块;第二种是会返回多个文本块,结合rerank给到下一步去分析;逻辑上来讲,第二个更加不容易漏掉信息

个人实际体验:

混合检索,检索出来的不准,最后切回了向量检索

Multipath的也没有达到找出更多准确信息,最后切换回了N-to-1

我一开始是上传的txt文件,后面发现大模型在理解字段时不好去分,最后更改上传为excel

PS:这个只是一个项目的体验,只做参考,或者不同的文本效果会不一样

2)内容抓取

我尝试使用了jinareader、Web Scraper,对比下来jinareader和Web Scraper的抓取有的信息够,有的信息抓了很多不相关的

个人实际体验:最后自己使用了GoogleSearch,效果相对好

3)绘图

这块尝试了line、bar和pie,本地项目使用的是pie 饼图

是因为line和bar的输入这块代码判断存在bug,算bug吗,我去看了源码是调用的python的源码,我觉得算

以line为列,下面是一个要求的输入

输入data: 2,3,4; 20,18,32          #组间以英文分号隔开,组内是英文逗号

输入axis: tag1;tag2;tag3;   #以英文分号隔开

这里直接把如上的2,3,4去做浮点化了,就报错了;应该先对组内英文逗号隔开处理!

个人体验建议:

建议dify官方优化下修复下/另外显示上能优化下界面呈现等,现在就朴素的没有使用的冲动哈哈哈

4)外部调用api

这块做的很方便,每次发布完可以直接调;log也很方便,可以协助排查问题,看trace

来实践

1、Workflow1 DSL给网站做分类

app:description: websiteCaticon: "\U0001F916"icon_background: '#FFEAD5'mode: workflowname: websiteCatworkflow:features:file_upload:image:enabled: falsenumber_limits: 3transfer_methods:- local_file- remote_urlopening_statement: ''retriever_resource:enabled: falsesensitive_word_avoidance:enabled: falsespeech_to_text:enabled: falsesuggested_questions: []suggested_questions_after_answer:enabled: falsetext_to_speech:enabled: falselanguage: ''voice: ''graph:edges:- data:sourceType: starttargetType: toolid: 1717578663813-1717579685305source: '1717578663813'sourceHandle: sourcetarget: '1717579685305'targetHandle: targettype: custom- data:sourceType: tooltargetType: llmid: 1717579685305-1717579853399source: '1717579685305'sourceHandle: sourcetarget: '1717579853399'targetHandle: targettype: custom- data:sourceType: llmtargetType: endid: 1717579967241-1717579117964source: '1717579967241'sourceHandle: sourcetarget: '1717579117964'targetHandle: targettype: custom- data:sourceType: llmtargetType: knowledge-retrievalid: 1717579853399-1717580066887source: '1717579853399'sourceHandle: sourcetarget: '1717580066887'targetHandle: targettype: custom- data:sourceType: knowledge-retrievaltargetType: llmid: 1717580066887-1717579967241source: '1717580066887'sourceHandle: sourcetarget: '1717579967241'targetHandle: targettype: customnodes:- data:desc: ''selected: falsetitle: Starttype: startvariables:- label: urlsmax_length: 256options: []required: truetype: text-inputvariable: urlsheight: 90id: '1717578663813'position:x: 112.4877286321854y: 197.44333138123648positionAbsolute:x: 112.4877286321854y: 197.44333138123648selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:desc: ''outputs:- value_selector:- '1717579967241'- textvariable: textselected: falsetitle: Endtype: endheight: 90id: '1717579117964'position:x: 799.2832195913106y: -74.46777660074163positionAbsolute:x: 799.2832195913106y: -74.46777660074163selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:desc: "\u6293\u53D6\u7F51\u7AD9\u4FE1\u606F"provider_id: googleprovider_name: googleprovider_type: builtinselected: truetitle: GoogleSearchtool_configurations:result_type: linktool_label: GoogleSearchtool_name: google_searchtool_parameters:query:type: mixedvalue: '{{#1717578663813.urls#}}'type: toolheight: 120id: '1717579685305'position:x: 198.59871386335033y: -74.46777660074163positionAbsolute:x: 198.59871386335033y: -74.46777660074163selected: truesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:context:enabled: truevariable_selector:- '1717579685305'- textdesc: "\u603B\u7ED3\u7F51\u7AD9\u5173\u952E\u8BCD"model:completion_params:temperature: 0mode: chatname: moonshot-v1-8kprovider: moonshotprompt_template:- id: 4aa3aab4-11f5-41dc-a382-dd9831658f78role: systemtext: "Please extract the corresponding title or description keywords based\\ on the content of {{#1717579685305.text#}}\n\n1\uFF09Note. If this website\\ sells many different categories of things, it can be classified as shopping\n\2) Extract no more than 3 keywords, and the output format is:\n xx, xx,\\ xx\n"- id: dcd22bf7-6d91-4e6e-a843-ed266659d8d7role: usertext: /selected: falsetitle: LLMtype: llmvariables: []vision:enabled: falseheight: 128id: '1717579853399'position:x: 463.64815768773747y: 197.44333138123648positionAbsolute:x: 463.64815768773747y: 197.44333138123648selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:context:enabled: truevariable_selector:- '1717580066887'- resultdesc: "\u6DA6\u8272\u7ED9\u51FA\u6700\u7EC8\u7684\u7F51\u7AD9\u6807\u7B7E"model:completion_params:temperature: 0mode: chatname: moonshot-v1-8kprovider: moonshotprompt_template:- id: 0bb1c5a6-930e-4f74-be63-06474a77fe48role: systemtext: 'Please combine the website''s keywords{{#1717579853399.text#}} withthe classification labels/proofreading retrieved from the knowledge baseto provide the most suitable classification labels retrieved from theknowledge base{{#context#}},Attention:1) This tag should come from a tag retrieved from the knowledge base.Please select only the tag that best expresses this website and outputonly one category tag2) If you don''t know which category it is, just output Uncategorized3) Output only needs to output the website categorywithout detailedexplanation,without IDoutputformat:xx'selected: falsetitle: LLM 2type: llmvariables: []vision:enabled: falseheight: 128id: '1717579967241'position:x: 799.2832195913106y: 197.44333138123648positionAbsolute:x: 799.2832195913106y: 197.44333138123648selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:dataset_ids:- 2307d992-5284-4c0b-b270-8436f497bfeadesc: "\u68C0\u7D22\u5DF2\u6709\u7684\u7F51\u7AD9\u5206\u7C7B\u6807\u7B7E"multiple_retrieval_config:reranking_model:model: rerank-english-v2.0provider: coherescore_threshold: 0.8top_k: 5query_variable_selector:- '1717579853399'- textretrieval_mode: singleselected: falsesingle_retrieval_config:model:completion_params: {}mode: chatname: moonshot-v1-8kprovider: moonshottitle: Knowledge Retrievaltype: knowledge-retrievalheight: 122id: '1717580066887'position:x: 503.45190546262575y: -74.46777660074163positionAbsolute:x: 503.45190546262575y: -74.46777660074163selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244viewport:x: 40.67810562789691y: 254.34055609621913zoom: 0.6607535491528895

2、workflow2统计分类展示

app:description: Batch&Visusalizationicon: "\U0001F916"icon_background: '#FFEAD5'mode: workflowname: Batch&Visusalizationworkflow:features:file_upload:image:enabled: falsenumber_limits: 3transfer_methods:- local_file- remote_urlopening_statement: ''retriever_resource:enabled: falsesensitive_word_avoidance:enabled: falsespeech_to_text:enabled: falsesuggested_questions: []suggested_questions_after_answer:enabled: falsetext_to_speech:enabled: falselanguage: ''voice: ''graph:edges:- data:sourceType: starttargetType: toolid: 1717587078340-1717596240088source: '1717587078340'sourceHandle: sourcetarget: '1717596240088'targetHandle: targettype: custom- data:sourceType: tooltargetType: endid: 1717596240088-1717587757492source: '1717596240088'sourceHandle: sourcetarget: '1717587757492'targetHandle: targettype: customnodes:- data:desc: ''selected: falsetitle: Starttype: startvariables:- label: tagsmax_length: 33024options: []required: truetype: paragraphvariable: tags- label: countmax_length: 256options: []required: truetype: text-inputvariable: countheight: 116id: '1717587078340'position:x: 92.06703286554381y: 80positionAbsolute:x: 92.06703286554381y: 80selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:desc: ''provider_id: chartprovider_name: chartprovider_type: builtinselected: falsetitle: Pie Charttool_configurations: {}tool_label: Pie Charttool_name: pie_charttool_parameters:categories:type: mixedvalue: '{{#1717587078340.tags#}}'data:type: mixedvalue: '{{#1717587078340.count#}}'type: toolheight: 54id: '1717596240088'position:x: 413.7520931625256y: 162.0976621428633positionAbsolute:x: 413.7520931625256y: 162.0976621428633selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:desc: ''outputs:- value_selector:- '1717596240088'- filesvariable: textselected: falsetitle: Endtype: endheight: 90id: '1717587757492'position:x: 663.9837808989655y: 169.00026104872103positionAbsolute:x: 663.9837808989655y: 169.00026104872103selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244viewport:x: -76.8851566982803y: 181.9020021881131zoom: 1.0705699835378846

3、本地的python代码

import requestsimport csvfrom collections import Counterimport webbrowserimport time# Step 1: Read site.csv filesites = []with open('sites.csv', mode='r', encoding='utf-8') as file:reader = csv.DictReader(file)for row in reader:sites.append(row['site'])# Step 2: Request to get the category for each siteheaders = {"Authorization": "Bearer xxxxxxxx","Content-Type": "application/json"}categories = []for site in sites:body = {"inputs": {"urls": site},"response_mode": "blocking","user": "abc-123"}response = requests.post("http://localhost/v1/workflows/run", headers=headers, json=body)if response.status_code == 200:response_data = response.json()if response_data['data'] and response_data['data']['outputs']:category = response_data['data']['outputs']['text']categories.append(category)else:print(f"No 'outputs' in response for site: {site}")print(response_data)else:print(f"Failed to get category for site: {site}")print(response.text)#time.sleep(5)# Wait for 5 seconds before the next request# Step 3: Count the categories and sort themcategory_counts = Counter(categories)sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)# Form the required stringssorted_sites = ";".join(site for site, count in sorted_categories)sorted_counts = ";".join(str(count) for site, count in sorted_categories)# Step 4: Call the visualization APIheaders_vis = {"Authorization": "Bearer xxxxxxxxxxxx","Content-Type": "application/json"}body_vis = {"inputs": {"tags": sorted_sites, "count": sorted_counts},"response_mode": "blocking","user": "abc-123"}response_vis = requests.post("http://localhost/v1/workflows/run", headers=headers_vis, json=body_vis)if response_vis.status_code == 200:response_data_vis = response_vis.json()if response_data_vis['data'] and response_data_vis['data']['outputs']:image_url = response_data_vis['data']['outputs']['text'][0]['url']print(f"Visualization URL: {image_url}")webbrowser.open(image_url)else:print("No 'outputs' in visualization response.")print(response_data_vis)else:print("Failed to generate visualization.")print(response_vis.text)

写在最后

本次实践,Googlesearch的免费API额度100次request过程;过程中发现了dify的line好难用,最后经过一番查找python处理找到了小小bug,昨晚还第一次在Github上提了pr...

此处有点快乐,使用过程种,也期望dify能越来越完善吧,工作流能嵌套调用,增加上数据库?国产之光,期待!

最近就是在哲学,就是竟然能看进去,我也是有点佩服我自己了...

快到端午节了,俺们都端午安康!


53AI,企业落地应用大模型首选服务商

产品:大模型应用平台+智能体定制开发+落地咨询服务

承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

与创始人交个朋友

回到顶部

 
扫码咨询