微信扫码
与创始人交个朋友
我要投稿
本次就是聚焦到Dify(开源版)进行实践(上次是宏观体验),包含了内容抓取,知识库,绘图和工作流的外部调用,一起来Get!
看效果
如上是对一批网站按照特定分类后做的可视化展示!图像是通过Dify生成的!
话原理
本次的实践流程上,是拿到一批网站,请求网站的既定分类(调用dify的网站分类工作流),再处理汇总数据,调用可视化生成工具(dify的可视化呈现工作流)
其中涉及到几个工具的使用,做个小结
1)知识库
我上传了既定的分类文件,我的上传就包含了2列,ID和Category
经过试了一番下来,最后还是都用了默认的!
创建知识库
这里有三种检索方式,分别是向量检索,全文检索和混合检索!
向量检索里Top k是说在检索时找到k个最相似的文本块,score阈值是相似度来过滤文本块的;Rerank是做文本块的排序,dify种有好几个比如cohere的;
文本检索就是全文字符关键词检索这种方式去找,,这种是弥补向量查询的劣势:比如查询很短的文本,不常用的文本或专业名次,都是向量数据库不擅长的;所以是推荐混合检索
使用时的检索方式
这里可以理解的是,第一个种,就是向量检索后直接找了一个文本块;第二种是会返回多个文本块,结合rerank给到下一步去分析;逻辑上来讲,第二个更加不容易漏掉信息
个人实际体验:
混合检索,检索出来的不准,最后切回了向量检索
Multipath的也没有达到找出更多准确信息,最后切换回了N-to-1
我一开始是上传的txt文件,后面发现大模型在理解字段时不好去分,最后更改上传为excel
PS:这个只是一个项目的体验,只做参考,或者不同的文本效果会不一样
2)内容抓取
我尝试使用了jinareader、Web Scraper,对比下来jinareader和Web Scraper的抓取有的信息够,有的信息抓了很多不相关的
个人实际体验:最后自己使用了GoogleSearch,效果相对好
3)绘图
这块尝试了line、bar和pie,本地项目使用的是pie 饼图
是因为line和bar的输入这块代码判断存在bug,算bug吗,我去看了源码是调用的python的源码,我觉得算
以line为列,下面是一个要求的输入
输入data: 2,3,4; 20,18,32 #组间以英文分号隔开,组内是英文逗号
输入axis: tag1;tag2;tag3; #以英文分号隔开
这里直接把如上的2,3,4去做浮点化了,就报错了;应该先对组内英文逗号隔开处理!
个人体验建议:
建议dify官方优化下修复下/另外显示上能优化下界面呈现等,现在就朴素的没有使用的冲动哈哈哈
4)外部调用api
这块做的很方便,每次发布完可以直接调;log也很方便,可以协助排查问题,看trace
来实践
1、Workflow1 DSL给网站做分类
app:
description: websiteCat
icon: "\U0001F916"
icon_background: '#FFEAD5'
mode: workflow
name: websiteCat
workflow:
features:
file_upload:
image:
enabled: false
number_limits: 3
transfer_methods:
local_file
remote_url
opening_statement: ''
retriever_resource:
enabled: false
sensitive_word_avoidance:
enabled: false
speech_to_text:
enabled: false
suggested_questions: []
suggested_questions_after_answer:
enabled: false
text_to_speech:
enabled: false
language: ''
voice: ''
graph:
edges:
data:
sourceType: start
targetType: tool
id: 1717578663813-1717579685305
source: '1717578663813'
sourceHandle: source
target: '1717579685305'
targetHandle: target
type: custom
data:
sourceType: tool
targetType: llm
id: 1717579685305-1717579853399
source: '1717579685305'
sourceHandle: source
target: '1717579853399'
targetHandle: target
type: custom
data:
sourceType: llm
targetType: end
id: 1717579967241-1717579117964
source: '1717579967241'
sourceHandle: source
target: '1717579117964'
targetHandle: target
type: custom
data:
sourceType: llm
targetType: knowledge-retrieval
id: 1717579853399-1717580066887
source: '1717579853399'
sourceHandle: source
target: '1717580066887'
targetHandle: target
type: custom
data:
sourceType: knowledge-retrieval
targetType: llm
id: 1717580066887-1717579967241
source: '1717580066887'
sourceHandle: source
target: '1717579967241'
targetHandle: target
type: custom
nodes:
data:
desc: ''
selected: false
title: Start
type: start
variables:
label: urls
max_length: 256
options: []
required: true
type: text-input
variable: urls
height: 90
id: '1717578663813'
position:
x: 112.4877286321854
y: 197.44333138123648
positionAbsolute:
x: 112.4877286321854
y: 197.44333138123648
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
data:
desc: ''
outputs:
value_selector:
'1717579967241'
text
variable: text
selected: false
title: End
type: end
height: 90
id: '1717579117964'
position:
x: 799.2832195913106
y: -74.46777660074163
positionAbsolute:
x: 799.2832195913106
y: -74.46777660074163
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
data:
desc: "\u6293\u53D6\u7F51\u7AD9\u4FE1\u606F"
provider_id: google
provider_name: google
provider_type: builtin
selected: true
title: GoogleSearch
tool_configurations:
result_type: link
tool_label: GoogleSearch
tool_name: google_search
tool_parameters:
query:
type: mixed
value: '{{#1717578663813.urls#}}'
type: tool
height: 120
id: '1717579685305'
position:
x: 198.59871386335033
y: -74.46777660074163
positionAbsolute:
x: 198.59871386335033
y: -74.46777660074163
selected: true
sourcePosition: right
targetPosition: left
type: custom
width: 244
data:
context:
enabled: true
variable_selector:
'1717579685305'
text
desc: "\u603B\u7ED3\u7F51\u7AD9\u5173\u952E\u8BCD"
model:
completion_params:
temperature: 0
mode: chat
name: moonshot-v1-8k
provider: moonshot
prompt_template:
id: 4aa3aab4-11f5-41dc-a382-dd9831658f78
role: system
text: "Please extract the corresponding title or description keywords based\
\ on the content of {{#1717579685305.text#}}\n\n1\uFF09Note. If this website\
\ sells many different categories of things, it can be classified as shopping\n\
2) Extract no more than 3 keywords, and the output format is:\n xx, xx,\
\ xx\n"
id: dcd22bf7-6d91-4e6e-a843-ed266659d8d7
role: user
text: /
selected: false
title: LLM
type: llm
variables: []
vision:
enabled: false
height: 128
id: '1717579853399'
position:
x: 463.64815768773747
y: 197.44333138123648
positionAbsolute:
x: 463.64815768773747
y: 197.44333138123648
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
data:
context:
enabled: true
variable_selector:
'1717580066887'
result
desc: "\u6DA6\u8272\u7ED9\u51FA\u6700\u7EC8\u7684\u7F51\u7AD9\u6807\u7B7E"
model:
completion_params:
temperature: 0
mode: chat
name: moonshot-v1-8k
provider: moonshot
prompt_template:
id: 0bb1c5a6-930e-4f74-be63-06474a77fe48
role: system
text: 'Please combine the website''s keywords{{#1717579853399.text#}} with
the classification labels/proofreading retrieved from the knowledge base
to provide the most suitable classification labels retrieved from the
knowledge base{{#context#}},
Attention:
This tag should come from a tag retrieved from the knowledge base.
Please select only the tag that best expresses this website and output
only one category tag
If you don''t know which category it is, just output Uncategorized
Output only needs to output the website category without detailed
ID
output format:
xx'
selected: false
title: LLM 2
type: llm
variables: []
vision:
enabled: false
height: 128
id: '1717579967241'
position:
x: 799.2832195913106
y: 197.44333138123648
positionAbsolute:
x: 799.2832195913106
y: 197.44333138123648
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
data:
dataset_ids:
2307d992-5284-4c0b-b270-8436f497bfea
desc: "\u68C0\u7D22\u5DF2\u6709\u7684\u7F51\u7AD9\u5206\u7C7B\u6807\u7B7E"
multiple_retrieval_config:
reranking_model:
model: rerank-english-v2.0
provider: cohere
score_threshold: 0.8
top_k: 5
query_variable_selector:
'1717579853399'
text
retrieval_mode: single
selected: false
single_retrieval_config:
model:
completion_params: {}
mode: chat
name: moonshot-v1-8k
provider: moonshot
title: Knowledge Retrieval
type: knowledge-retrieval
height: 122
id: '1717580066887'
position:
x: 503.45190546262575
y: -74.46777660074163
positionAbsolute:
x: 503.45190546262575
y: -74.46777660074163
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
viewport:
x: 40.67810562789691
y: 254.34055609621913
zoom: 0.6607535491528895
2、workflow2统计分类展示
app:
description: Batch&Visusalization
icon: "\U0001F916"
icon_background: '#FFEAD5'
mode: workflow
name: Batch&Visusalization
workflow:
features:
file_upload:
image:
enabled: false
number_limits: 3
transfer_methods:
local_file
remote_url
opening_statement: ''
retriever_resource:
enabled: false
sensitive_word_avoidance:
enabled: false
speech_to_text:
enabled: false
suggested_questions: []
suggested_questions_after_answer:
enabled: false
text_to_speech:
enabled: false
language: ''
voice: ''
graph:
edges:
data:
sourceType: start
targetType: tool
id: 1717587078340-1717596240088
source: '1717587078340'
sourceHandle: source
target: '1717596240088'
targetHandle: target
type: custom
data:
sourceType: tool
targetType: end
id: 1717596240088-1717587757492
source: '1717596240088'
sourceHandle: source
target: '1717587757492'
targetHandle: target
type: custom
nodes:
data:
desc: ''
selected: false
title: Start
type: start
variables:
label: tags
max_length: 33024
options: []
required: true
type: paragraph
variable: tags
label: count
max_length: 256
options: []
required: true
type: text-input
variable: count
height: 116
id: '1717587078340'
position:
x: 92.06703286554381
y: 80
positionAbsolute:
x: 92.06703286554381
y: 80
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
data:
desc: ''
provider_id: chart
provider_name: chart
provider_type: builtin
selected: false
title: Pie Chart
tool_configurations: {}
tool_label: Pie Chart
tool_name: pie_chart
tool_parameters:
categories:
type: mixed
value: '{{#1717587078340.tags#}}'
data:
type: mixed
value: '{{#1717587078340.count#}}'
type: tool
height: 54
id: '1717596240088'
position:
x: 413.7520931625256
y: 162.0976621428633
positionAbsolute:
x: 413.7520931625256
y: 162.0976621428633
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
data:
desc: ''
outputs:
value_selector:
'1717596240088'
files
variable: text
selected: false
title: End
type: end
height: 90
id: '1717587757492'
position:
x: 663.9837808989655
y: 169.00026104872103
positionAbsolute:
x: 663.9837808989655
y: 169.00026104872103
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
viewport:
x: -76.8851566982803
y: 181.9020021881131
zoom: 1.0705699835378846
3、本地的python代码
import requests
import csv
from collections import Counter
import webbrowser
import time
# Step 1: Read site.csv file
sites = []
with open('sites.csv', mode='r', encoding='utf-8') as file:
reader = csv.DictReader(file)
for row in reader:
sites.append(row['site'])
# Step 2: Request to get the category for each site
headers = {
"Authorization": "Bearer xxxxxxxx",
"Content-Type": "application/json"
}
categories = []
for site in sites:
body = {
"inputs": {"urls": site},
"response_mode": "blocking",
"user": "abc-123"
}
response = requests.post("http://localhost/v1/workflows/run", headers=headers, json=body)
if response.status_code == 200:
response_data = response.json()
if response_data['data'] and response_data['data']['outputs']:
category = response_data['data']['outputs']['text']
categories.append(category)
else:
print(f"No 'outputs' in response for site: {site}")
print(response_data)
else:
print(f"Failed to get category for site: {site}")
print(response.text)
#time.sleep(5) # Wait for 5 seconds before the next request
# Step 3: Count the categories and sort them
category_counts = Counter(categories)
sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)
# Form the required strings
sorted_sites = ";".join(site for site, count in sorted_categories)
sorted_counts = ";".join(str(count) for site, count in sorted_categories)
# Step 4: Call the visualization API
headers_vis = {
"Authorization": "Bearer xxxxxxxxxxxx",
"Content-Type": "application/json"
}
body_vis = {
"inputs": {"tags": sorted_sites, "count": sorted_counts},
"response_mode": "blocking",
"user": "abc-123"
}
response_vis = requests.post("http://localhost/v1/workflows/run", headers=headers_vis, json=body_vis)
if response_vis.status_code == 200:
response_data_vis = response_vis.json()
if response_data_vis['data'] and response_data_vis['data']['outputs']:
image_url = response_data_vis['data']['outputs']['text'][0]['url']
print(f"Visualization URL: {image_url}")
webbrowser.open(image_url)
else:
print("No 'outputs' in visualization response.")
print(response_data_vis)
else:
print("Failed to generate visualization.")
print(response_vis.text)
写在最后
本次实践,Googlesearch的免费API额度100次request过程;过程中发现了dify的line好难用,最后经过一番查找python处理找到了小小bug,昨晚还第一次在Github上提了pr...
此处有点快乐,使用过程种,也期望dify能越来越完善吧,工作流能嵌套调用,增加上数据库?国产之光,期待!
最近就是在哲学,就是竟然能看进去,我也是有点佩服我自己了...
快到端午节了,俺们都端午安康!
I have an idea to share with my wife this week, which is that collecting happiness can become a hobby, just like playing the piano.
参考
https://docs.dify.ai/features/retrieval-augment/retrieval
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-08-18
当产品经理谈到用LLM Agent构建新一代智能体的时候,他们在说什么?
2024-08-15
对话AI教育从业者们:AI如何解决因材施教的难题?
2024-08-03
工业应用中的向量数据库与知识向量化存储方案
2024-07-25
两大深度学习框架TensorFlow与PyTorch对比
2024-07-17
让生成式 AI 触手可及:NVIDIA NIM on VKE 部署实践
2024-07-16
中文大模型基准测评2024上半年报告
2024-07-16
一文看懂人工智能的起源、发展、三次浪潮与未来趋势
2024-07-14
"自拍" 秒变 "证件照" 看Coze如何实现
2024-05-14
2024-04-26
2024-05-22
2024-04-12
2024-07-18
2024-03-30
2024-05-10
2024-08-13
2024-04-25
2024-04-26