我要投稿

从数据到决策：利用生成式 AI 和数据产品

发布日期：2024-08-29 06:09:26 浏览次数： 2645

作者：令日十条

微信搜一搜，关注“令日十条”

From data to decisions: Leveraging Generative AI and data products

从数据到决策：利用生成式 AI 和数据产品

Define what a data product is
定义什么是数据产品
Propose a set of data product design principles
提出一套数据产品设计原则
Present an overview of data product types
提供数据产品类型的概述
Provide examples of common data products in the pharmaceutical industry提供医药行业常用数据产品示例
Explain how data products can help activate Gen AI use cases
说明数据产品如何帮助激活 Gen AI 用例
Present a modern, domain-driven data lake reference architecture
提供一个现代的、领域驱动的数据湖参考架构
Outline how the data value chain may be revolutionized through
Gen AI概述如何通过 Gen AI 彻底改变数据价值链

Data products

数据产品

Let’s start with defining what a data product is. It is a curated collection of data components that is organized and presented in a way that is easy to understand and use, building a better experience and enhancing trust for data consumers. It offers superior, consistent, and reliable data understanding and access, which allows consumers to get answers to their questions (or a chain of questions) to support business decisions and outcomes.

让我们从定义什么是数据产品开始。它是数据组件的精选集合，以易于理解和使用的方式进行组织和呈现，从而构建更好的体验并增强数据使用者的信任。它提供卓越、一致和可靠的数据理解和访问，使消费者能够获得他们的问题（或一连串问题）的答案，以支持业务决策和结果。

Figure 1 — Key characteristics of Data Products.

图 1 — 数据产品的主要特征。

A data product can also be described using a set of key characteristics. These are presented in Figure 1. During the conference, data products were mentioned in several presentations and panel discussions. The analogy where data products are compared to dishes was especially popular, so we’ll keep using that here as well to illustrate the key characteristics — data products are the carrots and tomatoes that are used by a selection of chefs:

还可以使用一组关键特征来描述数据产品。这些内容如图 1 所示。在会议期间，在几次演讲和小组讨论中提到了数据产品。将数据产品与菜肴进行比较的类比特别受欢迎，因此我们在这里也将继续使用它来说明关键特征——数据产品是一些厨师使用的胡萝卜和西红柿：

Inherent value. A data product in and of itself is valuable. If you have high-quality carrots and tomatoes, these have value of themselves, even if we don’t know exactly what we’d do with them. Stick them in front of a chef, and ideas for a dish will emerge instantaneously.
内在价值 .数据产品本身就很有价值。如果你有高质量的胡萝卜和西红柿，它们本身就有价值，即使我们不知道我们到底会用它们做什么。把它们放在厨师面前，一道菜的想法就会立即浮现出来。
Business impact. We must have some idea about how the carrots and tomatoes are going to be used. Perhaps to garnish a broader dish, to be served as raw vegetable snacks, or to be added to a soup. We might not know the exact dishes, but we do have a reasonable idea about their most common uses and can estimate their impact through these applications.
业务影响。我们必须对如何使用胡萝卜和西红柿有所了解。也许是为了装饰更广泛的菜肴，作为生蔬菜小吃，或者添加到汤中。我们可能不知道确切的菜肴，但我们确实对它们最常见的用途有一个合理的了解，并且可以估计它们通过这些应用程序的影响。
Discoverable. They are easy to find and accessible for the intended users. For chefs experimenting with dishes, there is a register that shows what foodstuffs are available and where to find them, including carrots and tomatoes. You don’t want to have to drive an hour to get some — they should be located reasonably close to where they may be needed.
可发现 .它们很容易找到，并且对于目标用户来说是可以访问的。对于尝试菜肴的厨师来说，有一个登记册，显示哪些食物可用以及在哪里可以找到它们，包括胡萝卜和西红柿。您不想开车一个小时才能买到一些东西——它们应该位于离可能需要它们的地方相当近的地方。
Understandable. They are clear, well-labeled, and unambiguous. The chef does not have to wonder what kind of carrots they are or where the tomatoes came from. If needed, one can take a look at the packaging to discover where they were grown and what the nutritional value is.
可理解的。它们清晰、标记清晰且明确无误。厨师不必想知道它们是什么胡萝卜或西红柿来自哪里。如果需要，可以查看包装，了解它们的生长地点以及营养价值。
Addressable. If you are a chef running a professional kitchen, you want to know in what fridge you can find the carrots and tomatoes. This should not change overnight. A kitchen performing at a high pace needs reliable inputs — those carrots and tomatoes should be in the same fridge, every day, where they are expected to be.
寻址。如果你是一名经营专业厨房的厨师，你想知道在哪个冰箱里可以找到胡萝卜和西红柿。这种情况不应该在一夜之间改变。高速运转的厨房需要可靠的投入——这些胡萝卜和西红柿应该每天放在同一个冰箱里，而且它们应该在同一个冰箱里。
Trusted and curated. Chefs lack the time to sort out imperfect carrots and tomatoes, such as those that are under- or oversized, or have mold, bugs, or discolorations. They expect that the rotten parts have been removed and that they can trust the quality of the ingredients given to them, so that they can focus on making the best possible dish.
值得信赖和精心策划。厨师没有时间来分拣不完美的胡萝卜和西红柿，例如那些尺寸过小或过大的胡萝卜和西红柿，或者有霉菌、虫子或变色的胡萝卜和西红柿。他们希望腐烂的部分已被去除，并且可以信任提供给他们的食材的质量，这样他们就可以专注于制作最好的菜肴。
Secure. Not just everyone should have access to the fridge. If that were the case, there’d be a chance that the food could be used up or tampered with. At the same time, access should be provided to those who should have access — a fridge without a door is of no use.
安全。不是每个人都应该可以使用冰箱。如果是这样的话，食物就有可能被用完或篡改。与此同时，应该向那些应该有访问权限的人提供访问权限——没有门的冰箱是没有用的。
Product orientation. The carrots and tomatoes are managed as a product with customers and a lifecycle. Some chefs might develop a liking for bigger carrots or tomatoes with a particular texture. They might need more or less of them. Whatever the demand, it is important that the supply and preparation takes into account the estimated and desired use.
产品导向 .胡萝卜和西红柿作为一个产品进行管理，有客户和生命周期。一些厨师可能会喜欢具有特定质地的大胡萝卜或西红柿。他们可能需要或多或少。无论需求如何，重要的是，供应和准备工作必须考虑到估计的和期望的用途。

Design principles

设计原则

Figure 2 — Design principles behind the creation and maintenance of Data Products.

图 2 — 数据产品创建和维护背后的设计原则。

Having established what data products are and what they are supposed to be able to enable, a set of design principles emerges behind successful implementations. They are illustrated in Figure 2 and further explained below:

在确定了什么是数据产品以及它们应该能够实现什么之后，成功实施的背后就出现了一套设计原则。图 2 对它们进行了说明，并在下面进一步说明：

Autonomy and cohesion: Each data product functions as an autonomous, atomic unit that includes all necessary components such as code for data ingestion, transformation, sample data, unit tests, data quality tests, and infrastructure-as-code for provisioning. It also enforces access policies, ensuring that it remains a self-contained entity outputting a single denormalized dataset.
自主性和内聚性：每个数据产品都充当一个自主的原子单元，其中包括所有必要的组件，例如用于数据引入、转换、样本数据的代码、单元测试、数据质量测试以及用于预配的基础结构即代码。它还强制执行访问策略，确保它仍然是一个独立的实体，输出单个非规范化数据集。
Common development framework: The central IT department supports domain teams by developing a specification language based on the Open Application Model (OAM) for declarative data product definitions. This allows teams to autonomously create and manage their data products using a shared platform that handles the CI/CD pipeline and capability registry.
通用开发框架：中央 IT 部门通过开发基于开放应用程序模型（OAM）的规范语言来支持领域团队，用于声明性数据产品定义。这使团队可以使用处理 CI/CD 管道和功能注册表的共享平台自主创建和管理其数据产品。
Consistent metadata management: To enhance the searchability and interoperability of data products, a uniform cataloging process is established across domains. This includes standard metadata like unique names, descriptions, ownership, data sharing agreements, data classifications, and distribution rights.
一致的元数据管理：为了提高数据产品的可搜索性和互操作性，在各个领域之间建立了统一的编目流程。这包括标准元数据，如唯一名称、描述、所有权、数据共享协议、数据分类和分发权。
Automated governance and access control: Data product teams can specify access policies programmatically using role-based or attribute-based control methods. The platform integrates corporate identity management with data storage solutions, automating the execution of access controls and ensuring secure data distribution.
自动化治理和访问控制：数据产品团队可以使用基于角色或基于属性的控制方法以编程方式指定访问策略。该平台将企业身份管理与数据存储解决方案集成，自动执行访问控制并确保安全的数据分发。
Data sharing protocols: Data products support various sharing methods, prioritizing native mechanisms of the storage platform for similar producer-consumer environments (e.g., Redshift, Snowflake). When different storage platforms are used, data copying is considered a last resort, with strict adherence to governance and access controls to maintain security.
数据共享协议：数据产品支持各种共享方法，优先考虑存储平台的原生机制，用于类似的生产者-消费者环境（例如，Redshift、Snowflake）。当使用不同的存储平台时，数据复制被认为是最后的手段，并严格遵守治理和访问控制以维护安全性。

Types, levels and examples of data products

数据产品的类型、级别和示例

Having explored key characteristics of and recommended design principles behind data products, let’s consider that not all are created equal. Data products exist in different shapes and types, which is sometimes what complicates the concept in people’s eyes as to what they are and what they are not.

在探讨了数据产品的关键特征和推荐的设计原则之后，让我们考虑一下，并非所有产品都是平等的。数据产品以不同的形状和类型存在，这有时会使人们眼中的概念复杂化，即它们是什么，它们不是什么。

Data products may live in different “stages.” Sometimes, these are referred to as amedallion architecture, where a data product can be promoted from bronze, to silver, to gold.

数据产品可能存在于不同的“阶段”。有时，这些被称为奖章架构，其中数据产品可以从青铜升级到白银，再到黄金。

Such classifications are perfectly in line with the one we’ll maintain for the purposes of this point of view, as presented in Figure 3. We’ve defined 4 subsequent levels:

这种分类与我们将在此观点中维护的分类完全一致，如图 3 所示。我们定义了 4 个后续级别：

Level 1 — Raw/Staged data: This initial level involves raw data from various sources, which is standardized and subjected to basic quality controls such as format standardization and null checks. It also includes the addition of audit columns like load ID and date, maintaining a comprehensive history by each load date to track data lineage.
级别 1 — 原始/暂存数据：此初始级别涉及来自各种来源的原始数据，这些数据是标准化的，并受到基本质量控制，例如格式标准化和空检查。它还包括添加加载 ID 和日期等审计列，在每个加载日期之前维护全面的历史记录以跟踪数据沿袭。
Level 2 — Conformed data: At this level, raw data is processed and transformed into a normalized dimensional data model. This stage consolidates historical data and ensures data integrity and consistency through rigorous normalization and confirmation processes, facilitating easier access and analysis.
第 2 级 — 符合要求的数据：在此级别，原始数据被处理并转换为规范化的维度数据模型。此阶段整合历史数据，并通过严格的规范化和确认过程确保数据的完整性和一致性，从而简化访问和分析。
Level 3 — Analytics-ready data: Data at this stage is cross-functional, integrated with master identifiers, and organized into denormalized, flat datasets. This level focuses on ensuring data consistency across different subject areas, integrating common business rules, and precalculating key performance indicators (KPIs) to support analytics.
第 3 级 — 分析就绪的数据：此阶段的数据是跨功能的，与主标识符集成，并组织成非规范化的平面数据集。此级别侧重于确保不同主题领域的数据一致性、集成通用业务规则以及预计算关键绩效指标（KPI）以支持分析。
Level 4 — Fit for purpose data: The most refined level, designed to meet specific needs of consuming applications, often customized for particular business functions like marketing analytics, patient analytics, and return on investment (ROI) calculations in industries such as pharmaceuticals. This data is tailored to drive specific business actions and decisions.
级别 4 — 适合目的的数据：最精细的级别，旨在满足消费应用程序的特定需求，通常针对特定业务功能进行定制，例如制药等行业的营销分析、患者分析和投资回报率（ROI）计算。这些数据是为推动特定的业务行动和决策而量身定制的。

Figure 3 — Subsequent levels of data products, increasingly tailored to meet specific business needs.

图 3 — 后续级别的数据产品，越来越多地为满足特定业务需求而定制。

The first two levels are classified as source-oriented data products as the data continues to be structured mostly in line with how it was sourced. The last two levels are consumer-oriented as, indeed, the data products have been more substantially transformed for specific uses of the data. Let’s take a look at the pharmaceutical industry to investigate what possible source- and consumer-oriented data products might be.

前两个级别被归类为面向源的数据产品，因为数据的结构仍然主要与数据的来源方式一致。最后两个层次是面向消费者的，因为事实上，数据产品已经针对数据的特定用途进行了更实质性的转变。让我们看一下制药行业，以调查可能存在的面向源和消费者的数据产品。

Source-oriented data products

面向源的数据产品

Source-oriented data products are pivotal for gathering and managing diverse sets of data relevant to business operations and patient care. For instance, master data products are crucial, encompassing databases like customer masters which detail information on healthcare professionals (HCPs), patients, consumers, and their affiliations. Master data may also include product masters that catalog all pertinent details about pharmaceutical products being developed or sold, and employee masters that maintain records on employees, their training, performance evaluations, and customer relationships.

以源为导向的数据产品对于收集和管理与业务运营和患者护理相关的各种数据集至关重要。例如，主数据产品至关重要，包括客户主数据等数据库，这些数据库详细介绍了医疗保健专业人员（HCP）、患者、消费者及其隶属关系的信息。主数据还可能包括对正在开发或销售的药品的所有相关详细信息进行编目的产品主数据，以及维护员工、培训、绩效评估和客户关系记录的员工主数据。

Another example of a group of source-oriented data products comprises sales data. These compile sales figures across various frequencies, business lines, and regions, enhancing the understanding of market reach and performance. They may also track personal activity metrics such as the number of calls made, samples distributed, and involvement in speaker programs, which are essential for assessing the effectiveness of sales strategies.

一组面向源的数据产品的另一个示例包括销售数据。这些汇编了不同频率、业务线和地区的销售数据，增强了对市场覆盖面和绩效的理解。他们还可以跟踪个人活动指标，例如拨打的电话数量、分发的样本以及参与演讲者计划，这对于评估销售策略的有效性至关重要。

Data products focused on claims and electronic medical records (EMR) are essential for a comprehensive view of healthcare interactions. These include data products for hospital claims, pharmacy claims, and payer claims from sources like Optum and Truven. Each dataset offers insights into billing and reimbursement patterns that are critical for financial planning and compliance. Specifically, EMR data products, such as those from Flatiron or Humedica, integrate clinical data like prescriptions (Rx) and diagnoses (Dx) from various healthcare providers, offering a rich source of real-world evidence that can support clinical studies and patient care strategies.

专注于索赔和电子病历（EMR）的数据产品对于全面了解医疗保健交互至关重要。其中包括来自 Optum 和 Truven 等来源的医院索赔、药房索赔和付款人索赔的数据产品。每个数据集都提供了对计费和报销模式的见解，这些模式对于财务规划和合规性至关重要。具体来说，EMR 数据产品，例如来自Flatiron 或 Humedica 的数据产品，整合了来自各种医疗保健提供者的处方（Rx）和诊断（Dx）等临床数据，提供了丰富的真实世界证据来源，可以支持临床研究和患者护理策略。

Consumer-oriented data products

面向消费者的数据产品

Consumer-oriented data products are designed to support specific business functions and decision-making processes that directly interact with and influence customer relations and market strategies. For example, the HCP 360 data product provides a comprehensive view of healthcare professionals (HCPs), integrating data across multiple touchpoints to support use cases like field reporting, account profiling, segmentation, and omnichannel orchestration. This product helps pharmaceutical companies tailor their engagement strategies, optimize promotional responses, and enhance overall HCP relationship management.

面向消费者的数据产品旨在支持直接与客户关系和市场策略交互并影响其的特定业务功能和决策过程。例如，HCP 360 数据产品提供了医疗保健专业人员（HCP）的全面视图，集成了多个接触点的数据，以支持现场报告、客户分析、细分和全渠道编排等用例。该产品可帮助制药公司定制其参与策略，优化促销响应，并增强整体 HCP 关系管理。

Another essential data product may be Value Access & Pricing, which offers insights into the complex dynamics of drug pricing and market access. This product supports a range of analytical applications including contract analytics, copay analytics, and distribution channel analysis. It also aids in more strategic areas such as government affairs, health economics, outcomes research, and access strategy formulation. The data helps companies navigate the regulatory and competitive landscape, predict healthcare pathways, and develop protocols and policies that optimize product pricing and access.

另一个重要的数据产品可能是价值获取和定价，它提供了对药品定价和市场准入的复杂动态的见解。该产品支持一系列分析应用，包括合同分析、共付额分析和分销渠道分析。它还有助于更具战略性的领域，如政府事务、卫生经济学、结果研究和获取战略制定。这些数据可帮助公司了解监管和竞争格局，预测医疗保健途径，并制定优化产品定价和可及性的协议和政策。

Field Performance is a data product geared towards optimizing sales force activities and effectiveness. It provides metrics and analytics necessary for managing incentive compensation, setting sales goals, crediting sales activities, and reporting field performance. It supports the optimization of sample distribution, enhancing the effectiveness of the sales force. This data product is crucial for pharmaceutical companies looking to maximize the efficiency and impact of their sales teams, ensuring that resources are aligned with market opportunities and company objectives.

Field Performance 是一种数据产品，旨在优化销售人员的活动和效率。它提供管理激励薪酬、设定销售目标、记入销售活动和报告现场绩效所需的指标和分析。它支持优化样本分配，提高销售团队的效率。对于希望最大限度地提高销售团队效率和影响力的制药公司来说，该数据产品至关重要，可以确保资源与市场机会和公司目标保持一致。

These are just examples — for a full list and more details, also for other sectors besides pharma, reach out to either of us.

这些只是例子——如需完整列表和更多详细信息，以及制药以外的其他行业，请联系我们中的任何一个。

The linkage to Generative AI

与生成式人工智能的联系

One of the driving forces behind the growing interest in data products is the emergence of generative AI, a type of artificial intelligence that learns from vast amounts of data to create content or generate new data that resembles the original input. This technology can produce anything from text and images to code and music, simulating human-like creativity.

人们对数据产品的兴趣日益浓厚的驱动力之一是生成式人工智能的出现，这是一种人工智能，它从大量数据中学习以创建内容或生成类似于原始输入的新数据。这项技术可以产生任何东西，从文本和图像到代码和音乐，模拟类似人类的创造力。

However, the successful deployment of generative AI hinges critically on solid data foundations. Without access to high-quality data from reliable sources, these AI models can become inefficient and potentially biased, leading to outcomes that create harm rather than value. Ensuring the integrity and quality of the data is paramount; without it, you cannot effectively activate the intended use case. Moreover, the deployment of generative AI requires strict ethical and regulatory vigilance and strategic expertise to ensure accuracy, legal compliance, and alignment with business objectives. This is needed to mitigate risks of bias and operational errors, reinforcing the importance of quality data and thoughtful oversight in generative AI projects.

然而，生成式人工智能的成功部署在很大程度上取决于坚实的数据基础。如果无法从可靠来源获得高质量的数据，这些 AI 模型可能会变得效率低下，并可能产生偏见，从而导致造成伤害而不是价值的结果。确保数据的完整性和质量至关重要;没有它，您将无法有效地激活预期的用例。此外，生成式人工智能的部署需要严格的道德和监管警惕性以及战略专业知识，以确保准确性、法律合规性以及与业务目标的一致性。这是为了降低偏见和操作错误的风险，加强生成式人工智能项目中高质量数据和深思熟虑的监督的重要性。

Figure 4 — Generative AI use cases require a minimum maturity across a select set of data management capabilities. Source: https://readmedium.com/navigating-the-data-management-landscape-in-the-age-of-gen-ai-82a5337a8c00.图

4 — 生成式 AI 用例需要一组选定的数据管理功能的最低成熟度。资料来源：https://readmedium.com/navigating-the-data-management-landscape-in-the-age-of-gen-ai-82a5337a8c00.

We can break this out in a few dimensions. In order to train and deploy models, the Gen AI applications need to have access to enough data that is sufficiently diverse. They may require vast volumes of data if the expected output is complex and instable. The stability here refers to the fact that identical models may produce different results when given an identical prompt, simply given the nature of how Gen AI models work. In some cases, that variability is fatal, in which case sufficient data is required to train the models. The data also needs to be sufficiently diverse. Gen AI, even more so than most other modelling techniques, refers the diversity of the data it has been and is given. If your model is trained on social interactions where 95% were with people from 25 years or younger, it might not perform as effectively later on when exposed to people that are older than 80.

我们可以将其分为几个维度。为了训练和部署模型，Gen AI 应用程序需要能够访问足够多样化的足够数据。如果预期的输出复杂且不稳定，则可能需要大量数据。这里的稳定性是指这样一个事实，即相同的模型在给定相同的提示时可能会产生不同的结果，这仅仅考虑到 Gen AI 模型的工作方式的性质。在某些情况下，这种可变性是致命的，在这种情况下，需要足够的数据来训练模型。数据还需要足够多样化。与大多数其他建模技术相比，Gen AI 更能体现它已经存在和被给出的数据的多样性。如果你的模型是在社交互动上训练的，其中 95% 是与 25 岁或更年轻的人进行的，那么当暴露于 80 岁以上的人时，它可能不会表现得那么有效。

For a similar reason, data quality is massively important. This is the biggest problem also becausegarbage in remains garbage out. With Gen AI, even when given bad data, the responses tend to still be elegant and complete-sounding. In some cases, they turn out to be made up. The quality of the answers will reflect the quality of the data it has been given. This also holds when the data it is fed is unstructured. In that case, it’s not that easy to implement data quality checks as you could on structured data, but nonetheless it is critical to verify that the right unstructured is provided.

出于同样的原因，数据质量非常重要。这也是最大的问题，因为垃圾进还是垃圾出。对于Gen AI，即使给出了糟糕的数据，反应往往仍然优雅而完整。在某些情况下，它们被证明是编造的。答案的质量将反映出所提供数据的质量。当它馈送的数据是非结构化的时，这也适用。在这种情况下，实施数据质量检查并不像对结构化数据那样容易，但尽管如此，验证是否提供了正确的非结构化数据至关重要。

Beyond these more general foundations around data, depending on the use case in question, there may be more specific requirements. The model might require annotated data, for example for training purposes. It might need a sufficient amount of historical data, or separate data that can be used for validation and testing. And if your use cases require real-time data, for example in many of the use cases with live call center agents, then data needs to be made available real-time, quickly and reliably. This is not just about integrating data sources, but also about making sure that only the right data is shared, and only with people or applications that should have access to it.

除了这些更通用的数据基础之外，根据所讨论的用例，可能还有更具体的要求。模型可能需要带注释的数据，例如用于训练目的。它可能需要足够数量的历史数据，或者可用于验证和测试的单独数据。如果您的用例需要实时数据，例如在许多具有实时呼叫中心座席的用例中，则需要实时、快速且可靠地提供数据。这不仅涉及集成数据源，还涉及确保仅共享正确的数据，并且仅与应该有权访问它的人或应用程序共享。

A lot of the things mentioned in previous paragraphs are age-old data governance challenges, and they have not gone away. They are very well understood problems with data about how it can be managed appropriately and brought to the right, appropriate use cases. Here we come back to data assets and data products, as this is a concept that is gaining ever more traction and many companies have been able to activate various use cases based on a selected set of data products. The key thing to understand is that it’s aboutnot governing all data everywhere up to the same standards, but instead focusing on that particular data that is most strategic, most important. Once you know what data is the most critical, you can prioritize managing that exact data as an asset or product. This will drive a maximized ROI on your investments in foundational data capabilities.

前几段中提到的很多事情都是由来已久的数据治理挑战，它们并没有消失。它们是非常容易理解的数据问题，即如何对其进行适当的管理，并将其带到正确的、适当的用例中。在这里，我们回到数据资产和数据产品，因为这是一个越来越受到关注的概念，许多公司已经能够根据一组选定的数据产品激活各种用例。要理解的关键是，它不是按照相同的标准管理所有数据，而是专注于最具战略意义、最重要的特定数据。一旦您知道哪些数据是最关键的，您就可以优先将这些确切的数据作为资产或产品进行管理。这将最大限度地提高您在基础数据功能方面的投资回报率。

How to quickly measure your organization’s data readiness for Gen AI

如何快速衡量组织对 Gen AI 的数据准备情况

Our research has revealed distinct patterns and best practices among companies that have successfully built foundational maturity and achieved initial business impacts with their use of generative AI, compared to those that continue to struggle and lag behind. The following 13 capability areas were identified as critical for business success:

我们的研究揭示了一些公司的不同模式和最佳实践，这些公司通过使用生成式人工智能成功地建立了基础成熟度，并取得了初步的业务影响，而那些公司则继续苦苦挣扎和落后。以下 13 个能力领域被确定为对业务成功至关重要：

Strategy and vision: Establishes the foundational framework for Generative AI initiatives, including the creation of a strategic plan, setting AI goals, and allocating investments and budgets.
战略和愿景：建立生成式人工智能计划的基础框架，包括制定战略计划、设定人工智能目标以及分配投资和预算。
Organizational structure and operating model: Defines roles, responsibilities, and the centralization of operations. This includes setting up decision-making frameworks, implementing change management programs, and managing stakeholders.
组织结构和运营模式：定义角色、职责和运营的集中化。这包括建立决策框架、实施变革管理计划和管理利益相关者。
Center of excellence (CoE): Focuses on building a specialized team to lead and support Generative AI efforts, including training on best practices, and deploying tools and accelerators to streamline processes.
卓越中心（CoE）：专注于建立一个专业团队来领导和支持生成式 AI 工作，包括最佳实践培训，以及部署工具和加速器以简化流程。
Use cases and applications: Identifies potential Generative AI applications, links them with necessary data sources, assesses feasibility, and establishes business ownership for each use case.
用例和应用程序：识别潜在的生成式 AI 应用程序，将它们与必要的数据源链接，评估可行性，并为每个用例建立业务所有权。
Data: Ensures the availability of diverse and high-quality data, maintaining historical and annotated datasets, and providing real-time data access for ongoing validation and testing.
数据：确保多样化和高质量数据的可用性，维护历史和带注释的数据集，并为正在进行的验证和测试提供实时数据访问。
ROI and value generation: Develops methods to measure the benefits of Generative AI projects, defines relevant KPIs and metrics, and crafts detailed business cases to underscore the value.
投资回报率和价值生成：开发衡量生成式人工智能项目收益的方法，定义相关的 KPI 和指标，并制作详细的商业案例来强调价值。
Model building and training: Involves selecting appropriate foundational models, training these models with robust datasets, and continuously evaluating and monitoring their performance.
模型构建和训练：涉及选择适当的基础模型，使用强大的数据集训练这些模型，并持续评估和监控其性能。
Deployment and operation: Reengineers processes to integrate Generative AI solutions, monitors performance and utilization, and automates workflows to enhance operational efficiency.
部署和运营：重新设计流程以集成生成式 AI 解决方案，监控性能和利用率，并自动化工作流程以提高运营效率。
Talent and skills: Focuses on attracting and retaining skilled professionals, providing training and opportunities for reskilling or upskilling, and fostering interdisciplinary teams.
人才和技能：专注于吸引和留住熟练的专业人士，提供再培训或提升技能的机会，并培养跨学科团队。
Governance, ethics and compliance: Addresses ethical considerations, ensures AI transparency, complies with regulatory standards, and sets policies for responsible AI usage.
治理、道德和合规性：解决道德考虑，确保 AI 透明度，遵守监管标准，并为负责任的 AI 使用制定政策。
Technology infrastructure: Equips organizations with the necessary Generative AI tools, robust data platforms, adequate computational resources, and supports system integration and exploration.
技术基础设施：为组织提供必要的生成式人工智能工具、强大的数据平台、充足的计算资源，并支持系统集成和探索。
Data security: Implements stringent security measures such as encryption, strict access controls, safeguards against data leakage, and conducts regular security audits.
数据安全：实施严格的安全措施，如加密、严格的访问控制、防止数据泄露，并定期进行安全审计。
Innovation, ecosystem and partnerships: Encourages ongoing research, fosters external collaborations, and forms technology alliances to stay at the forefront of Generative AI development and application.
创新、生态系统和伙伴关系：鼓励正在进行的研究，促进外部合作，并形成技术联盟，以保持在生成式人工智能开发和应用的最前沿。

In response to the rising interest in generative AI, at ZS we have built an accelerator to swiftly evaluate and identify maturity levels and gaps in the 13 above-referenced foundational data capabilities. For more information on this, reach out toShri Salem or Willem Koenders. For a more detailed discussion of the foundational data capabilities required for generative AI use cases, read further here.

为了应对人们对生成式人工智能日益增长的兴趣，ZS 建立了一个加速器，以快速评估和识别上述 13 种基础数据能力的成熟度水平和差距。有关更多信息，请联系 Shri Salem 或 Willem Koenders 。有关生成式 AI 用例所需的基础数据功能的更详细讨论，请在此处进一步阅读。

Gen AI supporting data management

Gen AI 支持数据管理

We have established that robust data management and governance are critical to enabling generative AI within organizations to effectively activate relevant use cases, especially with data products as key enablers. However, it’s also interesting and important to explore the reverse interaction — how generative AI can be integrated into and enhance the data management landscape.

我们已经确定，强大的数据管理和治理对于在组织内实现生成式人工智能以有效激活相关用例至关重要，尤其是在数据产品作为关键推动因素的情况下。然而，探索反向交互也很有趣和重要——生成式人工智能如何集成到数据管理领域并增强数据管理领域。

Within a modern domain-driven data lake architecture, as depicted in Figure 5, the data mesh lies at the heart of the system. This mesh connects various data products, typically organized within specific domains. Elements such as an augmented data catalog and knowledge graphs play crucial roles in managing metadata and democratizing access to these data products, which are showcased in a data marketplace and made available for diverse applications including AI/ML, business intelligence, or integration into downstream business processes.

如图 5 所示，在现代域驱动的数据湖架构中，数据网格位于系统的核心。该网格连接各种数据产品，这些产品通常在特定域内进行组织。增强的数据目录和知识图谱等元素在管理元数据和民主化访问这些数据产品方面发挥着至关重要的作用，这些数据产品在数据市场中展示，并可用于各种应用程序，包括 AI/ML、商业智能或集成到下游业务流程中。

Figure 5 — An example of a domain-driven reference architecture for a data lake. © ZS Associates.

Now, such a data lake architecture can help to establish and operate a data value chain, of which Figure 6 presents a simplified view. This data value chain involves several key stages that transform raw data into valuable insights for decision-making. It starts withData Acquisition, where data is collected from various sources such as sales transactions, sensors, or user feedback. This is followed by Data Transformation, where the gathered data is cleaned to remove errors, transformed to standardize formats, and organized for easy analysis. After the data is processed, it moves to the Consumption and Analytics stage where it is analyzed to extract useful information, such as identifying trends or making predictions, which informs business decisions. Throughout these stages, Operations and Maintenance ensure that the data processes run smoothly, systems are updated, and issues are addressed promptly. This ongoing support enhances the efficiency and effectiveness of the data systems, ensuring the reliability and utility of data across the value chain. Generative AI has the potential to transform this data value chain in each of these 4 components, which is what we’ll explore next.

现在，这样的数据湖架构可以帮助建立和运营数据价值链，图 6 提供了一个简化的视图。该数据价值链涉及几个关键阶段，这些阶段将原始数据转化为有价值的决策见解。它从数据采集开始，其中从各种来源收集数据，例如销售交易、传感器或用户反馈。接下来是数据转换，其中收集的数据被清理以消除错误，转换以标准化格式，并进行组织以便于分析。处理完数据后，它会进入“消耗和分析”阶段，在那里进行分析以提取有用的信息，例如识别趋势或做出预测，从而为业务决策提供信息。在这些阶段中，运营和维护确保数据处理顺利运行，系统得到更新，问题得到及时解决。这种持续的支持提高了数据系统的效率和有效性，确保了数据在整个价值链中的可靠性和实用性。生成式 AI 有可能在这 4 个组件中的每一个组件中改变这个数据价值链，这就是我们接下来要探讨的内容。

Data acquisition

数据采集

Generative AI can significantly enhance the data acquisition process by analyzing and tagging existing data sources in relation to specific use cases. By evaluating data models and metadata details, it can auto-generate ontologies based on domain contexts provided as external inputs. This AI-driven approach acts as a prompt-driven catalog, storing intricate details such as data source specifics and KPI definitions, facilitating a deeper understanding and organization of data assets.

生成式 AI 可以通过分析和标记与特定用例相关的现有数据源来显着增强数据采集过程。通过评估数据模型和元数据详细信息，它可以根据作为外部输入提供的域上下文自动生成本体。这种 AI 驱动的方法充当提示驱动的目录，存储复杂的详细信息，例如数据源详细信息和 KPI 定义，从而促进对数据资产的更深入理解和组织。

Additionally, generative AI can cross-reference available sources with those in the marketplace to identify gaps and create a prioritized list of suggestions. This not only streamlines the data acquisition strategy but also ensures that the data ecosystem is robust and aligned with organizational needs, making the process of integrating new data sources more efficient and targeted.

此外，生成式人工智能可以将可用资源与市场上的来源进行交叉引用，以识别差距并创建按优先级排列的建议列表。这不仅简化了数据采集策略，而且确保了数据生态系统的稳健性并与组织需求保持一致，从而使集成新数据源的过程更加高效和有针对性。

Data transformation

数据转换

In the transformation stage, generative AI can revolutionize the way code is developed and maintained. By creating a cookbook of prompts, it enables the generation of a code library that can ingest industry-standard datasets, apply specific processes (such as those unique to the pharmaceutical industry), and produce a base orchestration code that is compatible across various cloud platforms. This capability also includes seamless migration of code from one programming language to another, such as from SAS to Python or Spark, by simply feeding the existing code library into the system.

在转型阶段，生成式人工智能可以彻底改变代码的开发和维护方式。通过创建提示说明书，它可以生成一个代码库，该代码库可以引入行业标准数据集，应用特定流程（例如制药行业独有的流程），并生成与各种云平台兼容的基本编排代码。此功能还包括将代码从一种编程语言无缝迁移到另一种编程语言，例如从 SAS 迁移到 Python 或 Spark，只需将现有代码库输入系统即可。

Generative AI further enhances developer support by evaluating scripts, summarizing them, acting as a debugger during development, and automatically adding code comments. These features significantly reduce manual effort, minimize errors, and improve the efficiency of data transformation processes.

生成式 AI 通过评估脚本、总结脚本、在开发过程中充当调试器以及自动添加代码注释，进一步增强了对开发人员的支持。这些功能大大减少了人工工作，最大限度地减少了错误，并提高了数据转换过程的效率。

Consumption & analytics

消费与分析

Generative AI can transform the consumption and analytics stage by automating the configuration of business settings based on existing data points. This includes tasks such as product mastering, geo-tagging, and customer segmentation, which are typically resource-intensive.

生成式 AI 可以通过基于现有数据点自动配置业务设置来改变消费和分析阶段。这包括产品掌握、地理标记和客户细分等任务，这些任务通常是资源密集型的。

By profiling external sources and mastered cross-references, generative AI can also suggest potential matches or merges with high accuracy, thus enhancing the quality of data integration.

通过分析外部来源和掌握的交叉引用，生成式人工智能还可以高精度地建议潜在的匹配或合并，从而提高数据集成的质量。

Additionally, it contextualizes self-serve capabilities, enabling users to input natural language queries and receive automated insights. This augmented analytics approach reduces the burden of data interpretation and supports anomaly detection, making data more actionable and decision-making more informed.

此外，它还将自助服务功能情境化，使用户能够输入自然语言查询并接收自动化见解。这种增强分析方法减轻了数据解释的负担，并支持异常检测，使数据更具可操作性，决策更明智。

Operations & maintenance

运营与维护

Generative AI can greatly improve operations and maintenance by automating routine activities and reducing the cost associated with “keep the lights on” (KTLO) operations. For example, it can provide detailed root cause analyses (RCA) of operational failures and share these insights with relevant stakeholders, enhancing transparency and accountability. Or, by analyzing historical data loads and comparing them with current run-timings, generative AI can predict potential SLA breaches and alert the necessary teams before issues become critical.

生成式 AI 可以通过自动化日常活动并降低与“保持灯火通明”（KTLO）操作相关的成本来极大地改善运营和维护。例如，它可以提供运营失败的详细根本原因分析（RCA），并与相关利益相关者分享这些见解，从而提高透明度和问责制。或者，通过分析历史数据负载并将其与当前运行时间进行比较，生成式 AI 可以预测潜在的 SLA 违规行为，并在问题变得严重之前提醒必要的团队。

Additionally, generative AI can be used to govern access control and apply data restrictions based on user roles and personas, ensuring that data security and compliance are maintained across the board.

此外，生成式 AI 可用于管理访问控制，并根据用户角色和角色应用数据限制，确保全面维护数据安全性和合规性。

Closure

关闭

As we have explored in this article, the integration of generative AI within data management strategies is a transformative shift that offers advancements in how data is acquired, transformed, and utilized. As companies continue to navigate this terrain, the symbiotic relationship between data governance and AI technologies will become crucial for achieving long-term success.

正如我们在本文中所探讨的，将生成式人工智能集成到数据管理策略中是一种变革性的转变，它在数据的获取、转换和利用方式上取得了进步。随着公司继续在这一领域中航行，数据治理和人工智能技术之间的共生关系对于取得长期成功至关重要。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业