AI知识库

53AI知识库

学习大模型的前沿技术与行业应用场景


【AI】【Agent】来实践
发布日期:2024-06-07 06:07:16 浏览次数: 1692


本次就是聚焦到Dify(开源版)进行实践(上次是宏观体验),包含了内容抓取,知识库,绘图和工作流的外部调用,一起来Get!

看效果

如上是对一批网站按照特定分类后做的可视化展示!图像是通过Dify生成的!

话原理

本次的实践流程上,是拿到一批网站,请求网站的既定分类(调用dify的网站分类工作流),再处理汇总数据,调用可视化生成工具(dify的可视化呈现工作流)

其中涉及到几个工具的使用,做个小结

1)知识库

我上传了既定的分类文件,我的上传就包含了2列,ID和Category

经过试了一番下来,最后还是都用了默认的!

创建知识库

这里有三种检索方式,分别是向量检索,全文检索和混合检索!

向量检索里Top k是说在检索时找到k个最相似的文本块,score阈值是相似度来过滤文本块的;Rerank是做文本块的排序,dify种有好几个比如cohere的;

文本检索就是全文字符关键词检索这种方式去找,,这种是弥补向量查询的劣势:比如查询很短的文本,不常用的文本或专业名次,都是向量数据库不擅长的;所以是推荐混合检索

使用时的检索方式

这里可以理解的是,第一个种,就是向量检索后直接找了一个文本块;第二种是会返回多个文本块,结合rerank给到下一步去分析;逻辑上来讲,第二个更加不容易漏掉信息

个人实际体验:

混合检索,检索出来的不准,最后切回了向量检索

Multipath的也没有达到找出更多准确信息,最后切换回了N-to-1

我一开始是上传的txt文件,后面发现大模型在理解字段时不好去分,最后更改上传为excel

PS:这个只是一个项目的体验,只做参考,或者不同的文本效果会不一样

2)内容抓取

我尝试使用了jinareader、Web Scraper,对比下来jinareader和Web Scraper的抓取有的信息够,有的信息抓了很多不相关的

个人实际体验:最后自己使用了GoogleSearch,效果相对好

3)绘图

这块尝试了line、bar和pie,本地项目使用的是pie 饼图

是因为line和bar的输入这块代码判断存在bug,算bug吗,我去看了源码是调用的python的源码,我觉得算

以line为列,下面是一个要求的输入

输入data: 2,3,4; 20,18,32          #组间以英文分号隔开,组内是英文逗号

输入axis: tag1;tag2;tag3;   #以英文分号隔开

这里直接把如上的2,3,4去做浮点化了,就报错了;应该先对组内英文逗号隔开处理!

个人体验建议:

建议dify官方优化下修复下/另外显示上能优化下界面呈现等,现在就朴素的没有使用的冲动哈哈哈

4)外部调用api

这块做的很方便,每次发布完可以直接调;log也很方便,可以协助排查问题,看trace

来实践

1、Workflow1 DSL给网站做分类

app:  description: websiteCat  icon: "\U0001F916"  icon_background: '#FFEAD5'  mode: workflow  name: websiteCatworkflow:  features:    file_upload:      image:        enabled: false        number_limits: 3        transfer_methods:        - local_file        - remote_url    opening_statement: ''    retriever_resource:      enabled: false    sensitive_word_avoidance:      enabled: false    speech_to_text:      enabled: false    suggested_questions: []    suggested_questions_after_answer:      enabled: false    text_to_speech:      enabled: false      language: ''      voice: ''  graph:    edges:    - data:        sourceType: start        targetType: tool      id: 1717578663813-1717579685305      source: '1717578663813'      sourceHandle: source      target: '1717579685305'      targetHandle: target      type: custom    - data:        sourceType: tool        targetType: llm      id: 1717579685305-1717579853399      source: '1717579685305'      sourceHandle: source      target: '1717579853399'      targetHandle: target      type: custom    - data:        sourceType: llm        targetType: end      id: 1717579967241-1717579117964      source: '1717579967241'      sourceHandle: source      target: '1717579117964'      targetHandle: target      type: custom    - data:        sourceType: llm        targetType: knowledge-retrieval      id: 1717579853399-1717580066887      source: '1717579853399'      sourceHandle: source      target: '1717580066887'      targetHandle: target      type: custom    - data:        sourceType: knowledge-retrieval        targetType: llm      id: 1717580066887-1717579967241      source: '1717580066887'      sourceHandle: source      target: '1717579967241'      targetHandle: target      type: custom    nodes:    - data:        desc: ''        selected: false        title: Start        type: start        variables:        - label: urls          max_length: 256          options: []          required: true          type: text-input          variable: urls      height: 90      id: '1717578663813'      position:        x: 112.4877286321854        y: 197.44333138123648      positionAbsolute:        x: 112.4877286321854        y: 197.44333138123648      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    - data:        desc: ''        outputs:        - value_selector:          - '1717579967241'          - text          variable: text        selected: false        title: End        type: end      height: 90      id: '1717579117964'      position:        x: 799.2832195913106        y: -74.46777660074163      positionAbsolute:        x: 799.2832195913106        y: -74.46777660074163      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    - data:        desc: "\u6293\u53D6\u7F51\u7AD9\u4FE1\u606F"        provider_id: google        provider_name: google        provider_type: builtin        selected: true        title: GoogleSearch        tool_configurations:          result_type: link        tool_label: GoogleSearch        tool_name: google_search        tool_parameters:          query:            type: mixed            value: '{{#1717578663813.urls#}}'        type: tool      height: 120      id: '1717579685305'      position:        x: 198.59871386335033        y: -74.46777660074163      positionAbsolute:        x: 198.59871386335033        y: -74.46777660074163      selected: true      sourcePosition: right      targetPosition: left      type: custom      width: 244    - data:        context:          enabled: true          variable_selector:          - '1717579685305'          - text        desc: "\u603B\u7ED3\u7F51\u7AD9\u5173\u952E\u8BCD"        model:          completion_params:            temperature: 0          mode: chat          name: moonshot-v1-8k          provider: moonshot        prompt_template:        - id: 4aa3aab4-11f5-41dc-a382-dd9831658f78          role: system          text: "Please extract the corresponding title or description keywords based\            \ on the content of {{#1717579685305.text#}}\n\n1\uFF09Note. If this website\            \ sells many different categories of things, it can be classified as shopping\n\            2) Extract no more than 3 keywords, and the output format is:\n xx, xx,\            \ xx\n"        - id: dcd22bf7-6d91-4e6e-a843-ed266659d8d7          role: user          text: /        selected: false        title: LLM        type: llm        variables: []        vision:          enabled: false      height: 128      id: '1717579853399'      position:        x: 463.64815768773747        y: 197.44333138123648      positionAbsolute:        x: 463.64815768773747        y: 197.44333138123648      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    - data:        context:          enabled: true          variable_selector:          - '1717580066887'          - result        desc: "\u6DA6\u8272\u7ED9\u51FA\u6700\u7EC8\u7684\u7F51\u7AD9\u6807\u7B7E"        model:          completion_params:            temperature: 0          mode: chat          name: moonshot-v1-8k          provider: moonshot        prompt_template:        - id: 0bb1c5a6-930e-4f74-be63-06474a77fe48          role: system          text: 'Please combine the website''s keywords{{#1717579853399.text#}} with            the classification labels/proofreading retrieved from the knowledge base            to provide the most suitable classification labels retrieved from the            knowledge base{{#context#}},            Attention:            1) This tag should come from a tag retrieved from the knowledge base.            Please select only the tag that best expresses this website and output            only one category tag            2) If you don''t know which category it is, just output Uncategorized            3) Output only needs to output the website category  without detailed            explanation,without ID            output  format:            xx'        selected: false        title: LLM 2        type: llm        variables: []        vision:          enabled: false      height: 128      id: '1717579967241'      position:        x: 799.2832195913106        y: 197.44333138123648      positionAbsolute:        x: 799.2832195913106        y: 197.44333138123648      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    - data:        dataset_ids:        - 2307d992-5284-4c0b-b270-8436f497bfea        desc: "\u68C0\u7D22\u5DF2\u6709\u7684\u7F51\u7AD9\u5206\u7C7B\u6807\u7B7E"        multiple_retrieval_config:          reranking_model:            model: rerank-english-v2.0            provider: cohere          score_threshold: 0.8          top_k: 5        query_variable_selector:        - '1717579853399'        - text        retrieval_mode: single        selected: false        single_retrieval_config:          model:            completion_params: {}            mode: chat            name: moonshot-v1-8k            provider: moonshot        title: Knowledge Retrieval        type: knowledge-retrieval      height: 122      id: '1717580066887'      position:        x: 503.45190546262575        y: -74.46777660074163      positionAbsolute:        x: 503.45190546262575        y: -74.46777660074163      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    viewport:      x: 40.67810562789691      y: 254.34055609621913      zoom: 0.6607535491528895

2、workflow2统计分类展示

app:  description: Batch&Visusalization  icon: "\U0001F916"  icon_background: '#FFEAD5'  mode: workflow  name: Batch&Visusalizationworkflow:  features:    file_upload:      image:        enabled: false        number_limits: 3        transfer_methods:        - local_file        - remote_url    opening_statement: ''    retriever_resource:      enabled: false    sensitive_word_avoidance:      enabled: false    speech_to_text:      enabled: false    suggested_questions: []    suggested_questions_after_answer:      enabled: false    text_to_speech:      enabled: false      language: ''      voice: ''  graph:    edges:    - data:        sourceType: start        targetType: tool      id: 1717587078340-1717596240088      source: '1717587078340'      sourceHandle: source      target: '1717596240088'      targetHandle: target      type: custom    - data:        sourceType: tool        targetType: end      id: 1717596240088-1717587757492      source: '1717596240088'      sourceHandle: source      target: '1717587757492'      targetHandle: target      type: custom    nodes:    - data:        desc: ''        selected: false        title: Start        type: start        variables:        - label: tags          max_length: 33024          options: []          required: true          type: paragraph          variable: tags        - label: count          max_length: 256          options: []          required: true          type: text-input          variable: count      height: 116      id: '1717587078340'      position:        x: 92.06703286554381        y: 80      positionAbsolute:        x: 92.06703286554381        y: 80      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    - data:        desc: ''        provider_id: chart        provider_name: chart        provider_type: builtin        selected: false        title: Pie Chart        tool_configurations: {}        tool_label: Pie Chart        tool_name: pie_chart        tool_parameters:          categories:            type: mixed            value: '{{#1717587078340.tags#}}'          data:            type: mixed            value: '{{#1717587078340.count#}}'        type: tool      height: 54      id: '1717596240088'      position:        x: 413.7520931625256        y: 162.0976621428633      positionAbsolute:        x: 413.7520931625256        y: 162.0976621428633      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    - data:        desc: ''        outputs:        - value_selector:          - '1717596240088'          - files          variable: text        selected: false        title: End        type: end      height: 90      id: '1717587757492'      position:        x: 663.9837808989655        y: 169.00026104872103      positionAbsolute:        x: 663.9837808989655        y: 169.00026104872103      selected: false      sourcePosition: right      targetPosition: left      type: custom      width: 244    viewport:      x: -76.8851566982803      y: 181.9020021881131      zoom: 1.0705699835378846

3、本地的python代码

import requestsimport csvfrom collections import Counterimport webbrowserimport time# Step 1: Read site.csv filesites = []with open('sites.csv', mode='r', encoding='utf-8') as file:    reader = csv.DictReader(file)    for row in reader:        sites.append(row['site'])# Step 2: Request to get the category for each siteheaders = {    "Authorization": "Bearer xxxxxxxx",    "Content-Type": "application/json"}categories = []for site in sites:    body = {        "inputs": {"urls": site},        "response_mode": "blocking",        "user": "abc-123"    }    response = requests.post("http://localhost/v1/workflows/run", headers=headers, json=body)    if response.status_code == 200:        response_data = response.json()        if response_data['data'] and response_data['data']['outputs']:            category = response_data['data']['outputs']['text']            categories.append(category)        else:            print(f"No 'outputs' in response for site: {site}")            print(response_data)    else:        print(f"Failed to get category for site: {site}")        print(response.text)    #time.sleep(5)  # Wait for 5 seconds before the next request# Step 3: Count the categories and sort themcategory_counts = Counter(categories)sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)# Form the required stringssorted_sites = ";".join(site for site, count in sorted_categories)sorted_counts = ";".join(str(count) for site, count in sorted_categories)# Step 4: Call the visualization APIheaders_vis = {    "Authorization": "Bearer xxxxxxxxxxxx",    "Content-Type": "application/json"}body_vis = {    "inputs": {"tags": sorted_sites, "count": sorted_counts},    "response_mode": "blocking",    "user": "abc-123"}response_vis = requests.post("http://localhost/v1/workflows/run", headers=headers_vis, json=body_vis)if response_vis.status_code == 200:    response_data_vis = response_vis.json()    if response_data_vis['data'] and response_data_vis['data']['outputs']:        image_url = response_data_vis['data']['outputs']['text'][0]['url']        print(f"Visualization URL: {image_url}")        webbrowser.open(image_url)    else:        print("No 'outputs' in visualization response.")        print(response_data_vis)else:    print("Failed to generate visualization.")    print(response_vis.text)

写在最后

本次实践,Googlesearch的免费API额度100次request过程;过程中发现了dify的line好难用,最后经过一番查找python处理找到了小小bug,昨晚还第一次在Github上提了pr...

此处有点快乐,使用过程种,也期望dify能越来越完善吧,工作流能嵌套调用,增加上数据库?国产之光,期待!

最近就是在哲学,就是竟然能看进去,我也是有点佩服我自己了...

快到端午节了,俺们都端午安康!

I have an idea to share with my wife this week, which is that collecting happiness can become a hobby, just like playing the piano.

参考

https://docs.dify.ai/features/retrieval-augment/retrieval


53AI,企业落地应用大模型首选服务商

产品:大模型应用平台+智能体定制开发+落地咨询服务

承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

与创始人交个朋友

回到顶部

 
扫码咨询