我要投稿

多模态代理：CrewAI、Groq 和 Replicate AI 的创新融合

发布日期：2024-09-29 07:24:44 浏览次数： 2183

作者：大模型之路

微信搜一搜，关注“大模型之路”

在人工智能（AI）领域，多模态代理的概念正逐渐受到关注。这些代理能够处理并整合来自不同模态（如文本、图像、语音等）的信息，以执行复杂的任务。本文将详细介绍如何使用CrewAI框架、Groq硬件加速器和Replicate AI的模型来构建一个多模态AI代理，该代理能够执行文本到语音、基于文本的图像生成、图像描述以及网络搜索等多种任务。

多模态AI代理的设计旨在提高AI系统的灵活性和实用性。通过结合不同模态的信息，这些代理能够更准确地理解用户意图，并生成更符合需求的响应。在本项目中，我们将利用CrewAI框架来组织和管理多个专业化的代理，每个代理都具备独特的工具和能力。

系统架构

CrewAI

CrewAI是一个开源的智能代理协作框架，采用Multi-Agent架构，模拟人类专家团队的协作模式，让智能代理能够共同工作以解决复杂问题（Multi-Agent架构-CrewAI详解）。CrewAI允许用户创建和管理多个具有不同专业技能和职责的智能代理，这些代理在CrewAI的协调下共同工作，以实现复杂的任务目标。

Groq

Groq以其卓越的硬件和软件集成能力，特别是其LPU™ Inference Engine，为AI应用提供了前所未有的计算速度和效率。在处理大规模数据集和复杂计算任务时，Groq的LPU™能够显著加速AI计算，提高代理的性能（使用CrewAI和Groq构建SQL Agent：赋能智能数据分析的未来）。此外，Groq还支持大型语言模型（LLM），如Llama3，进一步提升了自然语言处理能力和理解用户意图的能力。

Replicate AI

Replicate AI提供了丰富的预训练模型，包括文本到语音、图像生成、图像描述等，这些模型可以直接用于构建多模态代理。Replicate AI还简化了模型的部署和扩展过程，使得多模态代理能够轻松地应用于实际场景。

Tavily-Python

Tavily-Python是一个开源库，用于网络搜索和信息检索。在多模态代理中，Tavily-Python被用于执行网络搜索任务，以获取与用户查询相关的信息。

多模态代理的构建

环境搭建

首先，需要为多模态代理开发一套工具，使其能够安全、高效地与各种数据源进行交互。这包括安装必要的Python库，如CrewAI、Groq和Replicate AI的客户端库。同时，还需要设置API密钥和配置环境变量，以确保代理能够正常访问和使用这些服务。

架构设计

多模态代理的架构设计应该允许跨不同模式的数据进行有效的处理和集成。通常，一个多模态代理会包含多个子代理，每个子代理负责处理一种或多种数据类型。例如，可以创建以下类型的代理：

文本处理代理：负责处理文本数据，包括文本到语音的转换、文本分析等。
图像处理代理：负责处理图像数据，包括图像生成、图像描述等。
音频处理代理：负责处理音频数据，包括语音识别、语音合成等。
网络搜索代理：负责从 Web 检索相关信息以回答查询等。

这些代理在CrewAI的协调下共同工作（Multi-Agent架构：探索AI协作的新纪元），以实现复杂的任务目标。例如，当用户输入一段文本描述时，文本处理代理可以将其转换为语音输出；同时，图像处理代理可以根据文本描述生成相应的图像；音频处理代理则可以处理用户的语音输入，实现语音交互。

代码实现

在代码实现阶段，我们首先需要安装必要的依赖项，包括CrewAI、Groq、Replicate AI和Tavily-Python等库。然后，我们设置API密钥，并创建所需的工具函数。

接下来，我们定义代理的角色和任务。每个代理都被赋予一个特定的角色，并配置相应的工具集。例如，文本到语音代理使用Replicate AI的文本到语音模型，而图像生成代理则使用Replicate AI的图像生成模型。

最后，我们设置路由器代理（Router Agent），它负责分析用户查询，并根据查询内容决定下一步的行动。路由器代理将任务分配给相应的代理，并收集它们的输出，最终生成用户所需的响应。

安装所需的依赖项

!pip install -qU langchain langchain_community tavily-python langchain-groq groq replicate!pip install -qU crewai crewai[tools]

设置API key

import osfrom google.colab import userdataos.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')os.environ['REPLICATE_API_TOKEN'] = userdata.get('REPLICATE_API_TOKEN')os.environ['TAVILY_API_KEY'] = userdata.get('TAVILY_API_KEY')os.environ['GROQ_API_KEY'] = userdata.get('GROQ_API_KEY')

创建tool、agent、task及辅助函数（具体函数含义可参考代码中注释）：

## 创建websearch toolfrom langchain_community.tools.tavily_search import TavilySearchResultsdef web_search_tool(question: str) -> str:"""This tool is useful when we want web search for current events."""# Function logic here# Step 1: Instantiate the Tavily client with your API keywebsearch = TavilySearchResults()# Step 2: Perform a search queryresponse = websearch.invoke({"query":question})return response

## 创建ext to speechtoolimport replicate#def text2speech(text:str) -> str:"""This tool is useful when we want to convert text to speech."""# Function logic hereoutput = replicate.run("cjwbw/seamless_communication:668a4fec05a887143e5fe8d45df25ec4c794dd43169b9a11562309b2d45873b0",input={"task_name": "T2ST (Text to Speech translation)","input_text": text,"input_text_language": "English","max_input_audio_length": 60,"target_language_text_only": "English","target_language_with_speech": "English"})return output["audio_output"]

#Create text to imagedef text2image(text:str) -> str:"""This tool is useful when we want to generate images from textual descriptions."""# Function logic hereoutput = replicate.run("xlabs-ai/flux-dev-controlnet:f2c31c31d81278a91b2447a304dae654c64a5d5a70340fba811bb1cbd41019a2",input={"steps": 28,"prompt": text,"lora_url": "","control_type": "depth","control_image": "https://replicate.delivery/pbxt/LUSNInCegT0XwStCCJjXOojSBhPjpk2Pzj5VNjksiP9cER8A/ComfyUI_02172_.png","lora_strength": 1,"output_format": "webp","guidance_scale": 2.5,"output_quality": 100,"negative_prompt": "low quality, ugly, distorted, artefacts","control_strength": 0.45,"depth_preprocessor": "DepthAnything","soft_edge_preprocessor": "HED","image_to_image_strength": 0,"return_preprocessed_image": False})print(output)return output[0]

## text to imagedef image2text(image_url:str,prompt:str) -> str:"""This tool is useful when we want to generate textual descriptions from images."""# Functionoutput = replicate.run("yorickvp/llava-13b:80537f9eead1a5bfa72d5ac6ea6414379be41d4d4f6679fd776e9535d1eb58bb",input={"image": image_url,"top_p": 1,"prompt": prompt,"max_tokens": 1024,"temperature": 0.2})return "".join(output)

from crewai_tools import tool## Router Tool@tool("router tool")def router_tool(question:str) -> str:"""Router Function"""prompt = f"""Based on the Question provide below determine the following:1. Is the question directed at generating image ?2. Is the question directed at describing the image ?3. Is the question directed at converting text to speech?.4. Is the question a generic one and needs to be answered searching the web?Question: {question}
RESPONSE INSTRUCTIONS:- Answer either 1 or 2 or 3 or 4.- Answer should strictly be a string.- Do not provide any preamble or explanations except for 1 or 2 or 3 or 4.
OUTPUT FORMAT:1"""response = llm.invoke(prompt).contentif response == "1":return 'text2image'elif response == "3":return 'text2speech'elif response == "4":return 'web_search'else:return 'image2text'

@tool("retriver tool")def retriver_tool(router_response:str,question:str,image_url:str) -> str:"""Retriver Function"""if router_response == 'text2image':return text2image(question)elif router_response == 'text2speech':return text2speech(question)elif router_response == 'image2text':return image2text(image_url,question)else:return web_search_tool(question)

##设置LLMfrom langchain_groq import ChatGroqllm = ChatGroq(model_name="llama-3.1-70b-versatile",temperature=0.1,max_tokens=1000,)

## 创建Router agentfrom crewai import AgentRouter_Agent = Agent(role='Router',goal='Route user question to a text to image or text to speech or web search',backstory=("You are an expert at routing a user question to a text to image or text to speech or web search.""Use the text to image to generate images from textual descriptions.""Use the text to speech to convert text to speech.""Use the image to text to generate text describing the image based on the textual description.""Use the web search to search for current events.""You do not need to be stringent with the keywords in the question related to these topics. Otherwise, use web-search."),verbose=True,allow_delegation=False,llm=llm,tools=[router_tool],)

##Retriever AgentRetriever_Agent = Agent(role="Retriever",goal="Use the information retrieved from the Router to answer the question and image url provided.",backstory=("You are an assistant for directing tasks to respective agents based on the response from the Router.""Use the information from the Router to perform the respective task.""Do not provide any other explanation"),verbose=True,allow_delegation=False,llm=llm,tools=[retriver_tool],)

## 创建 router taskfrom crewai import Taskrouter_task = Task(description=("Analyse the keywords in the question {question}""If the question {question} instructs to describe a image then use the image url {image_url} to generate a detailed and high quality images covering all the nuances secribed in the textual descriptions provided in the question {question}.""Based on the keywords decide whether it is eligible for a text to image or text to speech or web search.""Return a single word 'text2image' if it is eligible for generating images from textual description.""Return a single word 'text2speech' if it is eligible for converting text to speech.""Return a single word 'image2text' if it is eligible for describing the image based on the question {question} and iamge url{image_url}.""Return a single word 'web_search' if it is eligible for web search.""Do not provide any other premable or explaination."),expected_output=("Give a choice 'web_search' or 'text2image' or 'text2speech'or 'image2text' based on the question {question} and image url {image_url}""Do not provide any preamble or explanations except for 'text2image' or 'text2speech' or 'web_search' or 'image2text'."),agent=Router_Agent,)

#创建 retrieve taskretriever_task = Task(description=("Based on the response from the 'router_task' generate response for the question {question} with the help of the respective tool.""Use the web_serach_tool to retrieve information from the web in case the router task output is 'web_search'.""Use the text2speech tool to convert the test to speech in english in case the router task output is 'text2speech'.""Use the text2image tool to convert the test to speech in english in case the router task output is 'text2image'.""Use the image2text tool to describe the image provide in the image url in case the router task output is 'image2text'."),expected_output=("You should analyse the output of the 'router_task'""If the response is 'web_search' then use the web_search_tool to retrieve information from the web.""If the response is 'text2image' then use the text2image tool to generate a detailed and high quality images covering all the nuances secribed in the textual descriptions provided in the question {question}.""If the response is 'text2speech' then use the text2speech tool to convert the text provided in the question {question} to speech""If the response is 'image2text' then use the 'image2text' tool to describe the image based on the question {question} and {image_url}."),agent=Retriever_Agent,context=[router_task],)

## 设置 crewfrom crewai import Crew,Processcrew = Crew(agents=[Router_Agent,Retriever_Agent],tasks=[router_task,retriever_task],verbose=True,)

## 启动inputs ={"question":"Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor","image_url":" "}result = crew.kickoff(inputs=inputs)
######################Response#############################[2024-08-25 04:14:22][DEBUG]: == Working Agent: Router [2024-08-25 04:14:22][INFO]: == Starting Task: Analyse the keywords in the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visorIf the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor instructs to describe a image then use the image url to generate a detailed and high quality images covering all the nuances secribed in the textual descriptions provided in the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor.Based on the keywords decide whether it is eligible for a text to image or text to speech or web search.Return a single word 'text2image' if it is eligible for generating images from textual description.Return a single word 'text2speech' if it is eligible for converting text to speech.Return a single word 'image2text' if it is eligible for describing the image based on the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor and iamge url .Return a single word 'web_search' if it is eligible for web search.Do not provide any other premable or explaination.

> Entering new CrewAgentExecutor chain...Thought: The question contains keywords like "Generate an image based upon this text" and a detailed description of the image, so it seems like the user wants to generate an image from the given text.
Action: router toolAction Input: {"question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor"} 
text2image
Thought: The question contains keywords like "Generate an image based upon this text" and a detailed description of the image, so it seems like the user wants to generate an image from the given text.
Action: router toolAction Input: {"question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor"} 
I tried reusing the same input, I must stop using this action input. I'll try something else instead.


Thought: The question contains keywords like "Generate an image based upon this text" and a detailed description of the image, so it seems like the user wants to generate an image from the given text.
Action: router toolAction Input: {"question": "a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor"} 
text2image
Thought: I now know the final answerFinal Answer: text2image
> Finished chain. [2024-08-25 04:14:26][DEBUG]: == [Router] Task output: text2image

 [2024-08-25 04:14:26][DEBUG]: == Working Agent: Retriever [2024-08-25 04:14:26][INFO]: == Starting Task: Based on the response from the 'router_task' generate response for the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor with the help of the respective tool.Use the web_serach_tool to retrieve information from the web in case the router task output is 'web_search'.Use the text2speech tool to convert the test to speech in english in case the router task output is 'text2speech'.Use the text2image tool to convert the test to speech in english in case the router task output is 'text2image'.Use the image2text tool to describe the image provide in the image url in case the router task output is 'image2text'.

> Entering new CrewAgentExecutor chain...Thought: I need to use the information from the Router to determine the task to perform.
Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""}['https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webp']
https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webp
Thought: I need to use the information from the Router to determine the task to perform.Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""} 
I tried reusing the same input, I must stop using this action input. I'll try something else instead.


Thought: I need to use the information from the Router to determine the task to perform.
Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""} 
I tried reusing the same input, I must stop using this action input. I'll try something else instead.


Thought: I need to use the information from the Router to determine the task to perform.
Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""} 
I tried reusing the same input, I must stop using this action input. I'll try something else instead.


Thought: I now know the final answerFinal Answer: https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webp
> Finished chain. [2024-08-25 04:15:07][DEBUG]: == [Retriever] Task output: https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webp

import requestsfrom PIL import Imagefrom io import BytesIOimport matplotlib.pyplot as plt
# URL of the imageimage_url = result.raw
# Fetch the imageresponse = requests.get(image_url)
# Check if the request was successfulif response.status_code == 200:# Open the image using PILimg = Image.open(BytesIO(response.content))
# Display the image using matplotlibplt.imshow(img)plt.axis('off')# Hide the axisplt.show()else:print("Failed to retrieve image. Status code:", response.status_code)

应用场景

自动驾驶

在自动驾驶领域，多模态代理可以处理来自车辆传感器（如摄像头、雷达和激光雷达）的多种数据类型，实现更全面的环境感知和决策制定。例如，代理可以同时处理图像和音频数据，识别道路上的行人、车辆和障碍物，并根据这些信息做出避障、变道等决策。

虚拟助手

在虚拟助手领域，多模态代理可以实现更加自然和智能的交互体验。代理可以同时处理用户的文本输入和语音输入，理解用户的意图和需求，并给出相应的回答和建议。此外，代理还可以根据用户的表情和动作等图像数据，进一步理解用户的情绪和需求，提供更加个性化的服务。

通过结合CrewAI框架、Groq硬件加速器和Replicate AI的模型，我们成功构建了一个多模态AI代理。该代理能够执行多种复杂的任务，包括文本到语音、基于文本的图像生成、图像描述以及网络搜索等。这种多模态代理的设计不仅提高了AI系统的灵活性和实用性，还为未来的AI应用提供了广阔的可能性（LLM Agent在商业中的应用：探索自主智能的新前沿）。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业