我要投稿

斯坦福AI指数报告

发布日期：2024-08-18 21:04:18 浏览次数： 2121 作者：鹤啸九天

引言

分享下斯坦福AI实验室的2024年年度报告

（1）HAI介绍

斯坦福人工智能实验室（Stanford HAI）每年都会发布AI指数报告

2024年4月15日发布了2024年度AI指数报告（链接见文末），回顾了2023年全球大模型发展特点。

报告背后的资助方：

（2）报告摘要

报告大纲：

Chapter 1: Research and Development
Chapter 2: Technical Performance
Chapter 3: Responsible AI
Chapter 4: Economy
Chapter 5: Science and Medicine
Chapter 6: Education
Chapter 7: Policy and Governance
Chapter 8: Diversity
Chapter 9: Public Opinion

总结：

AI能力：虽然只在部分任务上超越人类，推动科研，提高了生产力，创造部分岗位，但大家对AI也更加敏感、紧张
训练成本越来越高，工业界继续主导前沿AI研究，学术界跟随、合作
全球AI强国分布：美国＞中国＞欧盟+英国
AI投资继续火热
LLM缺乏稳健的评估标准，美国AI法规剧增

（3）2023年度大模型

先回顾下2023年发布的知名大模型

（4）要点分析

TOP TAKEAWAYS

（4.1）大模型能力部分超过人类

1. 大模型能力概要：AI 只在某些任务上超越人类。

AI beats humans on some tasks, but not on all.

AI has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning.

AI 已经在一些基准测试中超越人类，如图像分类、视觉推理和英语理解等领域。

然而，在竞赛级别的数学、视觉常识推理和规划等更复杂任务上仍落后。

以文生图为例，AI生成的图片让人真假难辨。

一图展示Midjourney迭代效果

图片分割：Segment Anything（SAM）让图片分割效果提升了一个档次。

图片编辑：ControlNet能按要求调整图片

Instruct-NeRF2NeRF技术让图片进一步动起来，数字人直播成为可能。

（4.2）工业界继续主导前沿 AI 研究

Industry continues to dominate frontier AI research.

In 2023, industry produced 51 notable machine learning models, while academia contributed only 15. There were also 21 notable models resulting from industry-academia collaborations in 2023, a new high.

2023年共发布了149个基础模型，是2022年发布数量的两倍以上。

知名模型中，工业界累计发布了51个知名的机器学习模型，而学术界仅贡献了15个。

此外，2023年，工业与学术界合作产生了21个模型，创下新高。

大模型知名机构：

2019年-2023年累积指标

学术界清华、斯坦福和上海AI实验室
非盈利机构：Eleather AI、AI2
其余是各大巨头

单看2023年度的知名机构，工业界继续拉开差距

学术界寥寥无几，就上海AI实验室、UC伯克利和斯坦福
1个非盈利机构AI2
其余全部都是大公司

国家分布上，美国傲视群雄，中国随后，接着是英国、欧洲各国

2-8定律同样适用，强者愈强，两极分化。

149个大模型中，开源模型比例剧增，2021年是33.3%，2022年是44.4%，2023年高达65.7%。

2023年开源模型爆发，表现优异，但评测发现，闭源模型优于开源模型24.2%。

Closed LLMs significantly outperform open ones. On 10 select AI benchmarks, closed models outperformed open ones, with a median performance advantage of 24.2%. Differences in the performance of closed and open models carry important implications for AI policy debates.

（4.3）模型训练成本大幅上升

前沿模型的训练成本大幅上升。

据 AI 指数估计，知名AI 模型的训练成本再创新高。

例如，OpenAI 的 GPT-4 用了7800万美元的计算资源进行训练，而谷歌的 Gemini Ultra 则耗资1.91亿美元用于计算。

由于OpenAI或谷歌Gemini没有透露模型参数量，部分内容属于猜测。

Frontier models get way more expensive.

According to AI Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAI’s GPT-4 used an estimated $78 million worth of compute to train, while Google’s Gemini Ultra cost $191 million for compute.

Parameter Trends

LLMs are getting bigger, more expensive to train and a lot more efficient fairly quickly along with better AI chips for compute and Nvidia’s stock price that has created a Generative AI bubble in the U.S. market especially.

Since OpenAI or Google Gemini aren’t transparent about the number of parameters in their models, some of this is speculation. The trend means you may need Billions in the future to keep up as a pure-play LLM foundational model builder.

LLM越大、训练成本越高，而且随着AI芯片越来越强大，训练效率迅速提高，还有英伟达的股价

这一切正在催生了生成式AI泡沫，尤其是美国。

这一趋势意味着，未来可能需要数十亿美元才能跟上LLM基础模型构建步伐。

（4.4）美国领先中国、欧盟和英国

美国领先于中国、欧盟和英国，成为顶尖 AI 模型的主要来源。

2023年，美国机构发布了61个知名 AI模型，领先于欧盟的21个和中国的15个。

The United States leads China, the EU, and the U.K. as the leading source of top AI models.

In 2023, 61 notable AI models originated from U.S.-based institutions, far outpacing the European Union’s 21 and China’s 15.

其中，中国新一代生成式AI模型构建者：

Baichuan
01.AI
MiniMax
Moonshot AI
zhipu

为什么是这几家？过去几年投资荒年中，这几家都得到外国投资者的充分资助。

中国正在缩小与OpenAI等公司之间的差距。

关于专利

从2021年到2022年，全球人工智能专利授权数量急剧增加了62.7%。2010年以来，授予的人工智能专利数量增长了超过31倍。

China dominates AI patents.

In 2022, China led global AI patent origins with 61.1%, significantly outpacing the United States, which accounted for 20.9% of AI patent origins. Since 2010, the U.S. share of AI patents has decreased from 54.1%.

而中国在专利数量上独占鳌头。

（别笑）

2022年，中国以61.1%的全球AI专利来源遥遥领先美国的20.9%。

2010年以来，美国在AI专利中的份额从54.1%下降到20.9%。

（评论：专利含金量有限，并不能代表真正实力。）

Since 2011, the number of AI-related projects on GitHub has seen a consistent increase, growing from 845 in 2011 to approximately 1.8 million in 2023. Notably, there was a sharp 59.3% rise in the total number of GitHub AI projects in 2023 alone. The total number of stars for AI-related projects on GitHub also significantly increased in 2023, more than tripling from 4.0 million in 2022 to 12.2 million.

类似的，GitHub上AI项目、出版物也持续增加。

2011年以来，GitHub上AI相关项目数量持续增加，从845个增长到2023年的约180万个。

注意：仅2023年，GitHub上AI相关项目的总数急剧增加了59.3%, 星标总数也在2023年显著增加，从2022年的400万增至1220万

2010年至2022年间，人工智能出版物总数将近翻了三倍，从2010年的约8.8万篇增至2022年的超过24万篇。去年的增长率为1.1%。

Is China late to Generative AI?

According to the Index Report, for the first time since 2019, the European Union and the United Kingdom together have surpassed China in the number of notable AI models produced (Figure 1.3.3). Since 2003, the United States has produced more models than other major geographic regions such as the United Kingdom, China, and Canada (Figure 1.3.4).

报告看好中国。

2024年之前，大家会认为中国被完全甩在后面。但中国可能会在几年内主导开源模型。

基础性大型语言模型方面，中国落后于OpenAI和谷歌的Gemini，但通过像Meta的Llama 1、2和3这样的开源LLM，中国正在缩小这一差距。

随着对开源LLM和更高效模型（Open-Weight）关注度的激增，中国有望在生成式AI领域进行创新。

中国对生成式AI的态度表现得更为稳健，定位为长期战略，不像硅谷某些大公司试图炒作、推动股价。

2024年，中国正在寻求在由科技巨头微软、谷歌和亚马逊塑造的更广泛的美国人工智能市场中赶上OpenAI的领先地位，以及资金充裕的初创公司，包括Anthropic、Mistral、Cohere和埃隆·马斯克的x.AI，据传他们即将完成一轮融资，将获得近80亿美元的资金。

（4.5）LLM安全性评估是个大问题

LLM 安全性缺乏稳健和标准化评估。

负责任 AI 方面极其缺乏普世标准。

OpenAI、谷歌和 Anthropic等前沿公司主要针对不同基准测试自家模型。这种局面导致难以系统比较顶尖 AI 模型的负责任能力。

1. Robust and standardized evaluations for LLM responsibility are seriously lacking. New research from the AI Index reveals a significant lack of standardization in responsible AI reporting. Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top AI models.

政治类假新闻很容易生产，但难以鉴别。CounterCloud项目展示了散播假新闻有多容易。

2. Political deepfakes are easy to generate and difficult to detect. Political deepfakes are already affecting elections across the world, with recent research suggesting that existing AI deepfake methods perform with varying levels of accuracy. In addition, new projects like CounterCloud demonstrate how easily AI can create and disseminate fake content.

刚开始，研究人员会设计红队AI模型来测试对抗攻击，如今，大模型学聪明了，已经很难漏出“破绽”，无限重复随机单词的方法已失效。

3. Researchers discover more complex vulnerabilities in LLMs. Previously, most efforts to red team AI models focused on testing adversarial prompts that intuitively made sense to humans. This year, researchers found less obvious strategies to get LLMs to exhibit harmful behavior, like asking the models to infinitely repeat random words.

世界各地都开始关注AI风险，调查显示，公司最关心的AI风险问题是隐私、数据安全、可信度。有些公司开始着手降低风险，然而，大部分公司目前只能缓解小部分。

4. Risks from AI are becoming a concern for businesses across the globe. A global survey on responsible AI highlights that companies’ top AI-related concerns include privacy, data security, and reliability. The survey shows that organizations are beginning to take steps to mitigate these risks. Globally, however, most companies have so far only mitigated a small portion of these risks.

大模型生成内容容易侵权。很多研究者发现LLMs可能生成《纽约时报》或电影里某个片段，而不管是否侵犯内容版权。

5. LLMs can output copyrighted material. Multiple researchers have shown that the generative outputs of popular LLMs may contain copyrighted material, such as excerpts from The New York Times or scenes from movies. Whether such output constitutes copyright violations is becoming a central legal question.

6. AI developers score low on transparency, with consequences for research. The newly introduced Foundation Model Transparency Index shows that AI developers lack transparency, especially regarding the disclosure of training data and methodologies. This lack of openness hinders efforts to further understand the robustness and safety of AI systems.

（4.6）生成式AI投资激增

生成式 AI 投资激增

尽管去年整体 AI 私人投资下降，生成式 AI 的投资却激增，从2022年的8倍增长到252亿美元。

生成式 AI 领域的主要参与者（OpenAI、Anthropic、Hugging Face 和 Inflection）报告了大额融资轮次。

Generative AI investment skyrockets.

Despite a decline in overall AI private investment last year, funding for generative AI surged, nearly octupling from 2022 to reach $25.2 billion. Major players in the generative AI space, including OpenAI, Anthropic, Hugging Face, and Inflection, reported substantial fundraising rounds.

（4.7）AI提升了生产力

AI 提高了生产力，并带来更高质量工作。

2023年，几项研究评估了 AI 对劳动力的影响，表明 AI 使工作者能够更快地完成任务并提高其产出质量。

这些研究还展示了 AI 桥接低技能和高技能工作者之间技能差距的潜力，但也警告：未经适当监督使用 AI 可能导致表现下降。

The data is in: AI makes workers more productive and leads to higher quality work.

In 2023, several studies assessed AI’s impact on labor, suggesting that AI enables workers to complete tasks more quickly and to improve the quality of their output. These studies also demonstrated AI’s potential to bridge the skill gap between low- and high-skilled workers. Still other studies caution that using AI without proper oversight can lead to diminished performance.

（4.8）AI加速科学进步

AI加速科学进步。

2022年，AI 开始推动科学发现。

2023年科学相关的 AI 应用的不断推出：从使算法排序更高效的 AlphaDev 到促进材料发现过程的 GNoME。

8. Scientific progress accelerates even further, thanks to AI.

In 2022, AI began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related AI applications—from AlphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery.

（4.9）美国AI法规剧增

美国 AI 法规数量急剧增加。

过去1-5年中，美国AI 相关法规数量显著增加。

2023年，AI 相关法规数量达到25项，远高于2016年的1项。

仅去年一年，AI 相关法规总数增长了56.3%。

9. The number of AI regulations in the United States sharply increases.

The number of AI-related regulations in the U.S. has risen significantly in the past year and over the last five years. In 2023, there were 25 AI-related regulations, up from just one in 2016. Last year alone, the total number of AI-related regulations grew by 56.3%.

（4.10）大众对AI越来越紧张

全球各地对 AI 潜在影响更加敏感和紧张。

Ipsos 调查显示，过去一年中，认为 AI 将在未来3-5年内显著影响生活的人所占比例从60%上升至66%。

此外，52% 的人对 AI 产品和服务表示紧张，比例比2022年上升了13个百分点。

皮尤数据，52% 的美国人表示对 AI 感到更担忧而不是兴奋，这一比例从2022年的38%上升。

10. People across the globe are more cognizant of AI’s potential impact—and more nervous.

A survey from Ipsos shows that, over the last year, the proportion of those who think AI will dramatically affect their lives in the next three to five years has increased from 60% to 66%. Moreover, 52% express nervousness toward AI products and services, marking a 13 percentage point rise from 2022. In America, Pew data suggests that 52% of Americans report feeling more concerned than excited about AI, rising from 38% in 2022.

（5）LLM前沿发展

Frontier AI Research

鉴于AI在复杂任务上依然吃力，多模态大模型登场。如Google的 Gemini 和 OpenAI’s GPT-4（还有最近发布的GPT-4o）

（5.1）评测集变化

为了适配复杂的多模态任务，配套设施也需要升级，如：更高质量的数据集、评估标准，这需要人工重度参与。

①数据生产：由于数据至关重要，生产环节开始引入AI模型，如SegmentAnything 和 Skoltech用于图像生成和3D建模。

New AI models such as SegmentAnything and Skoltech are being used to generate specialized data for tasks like image segmentation and 3D reconstruction. Data is vital for AI technical improvements. The use of AI to create more data enhances current capabilities and paves the way for future algorithmic improvements, especially on harder tasks.

②评测集：从ImageNet, SQuAD, SuperGLUE升级到SWE-bench（编码）、HEIM（图像生成）、MMMU（通用推理）、MoCa（模型推理）、AgentBench（智能体行为）和HaluEval（幻觉）

AI models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones. In 2023, several challenging new benchmarks emerged, including SWE-bench for coding, HEIM for image generation, MMMU for general reasoning, MoCa for moral reasoning, AgentBench for agent-based behavior, and HaluEval for hallucinations

③人工介入：评测集生产像ImageNet 和 SQuAD一样，越来越多的引入人工，如 Chatbot Arena Leaderboard，纯人工评估

With generative models producing high-quality text, images, and more, benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD. Public sentiment about AI is becoming an increasingly important consideration in tracking AI progress.

（5.2）RLHF

RLHF之后衍生出很多改进版本，如 RLAIF、DPO、ORPO等。

这些方法效果如何？

RLHF与RLAIF对比：RLAIF基本趋近RLHF

模型无害性上，RLAIF安全性最好，SFT最差。

PPO系列改进版对比：DPO比PPO/SFT更好

温度越高，效果越差，尤其是PPO，超过0.25时急剧下跌

（5.3）涌现能力

被传得神乎其神的涌现能力，背后到底是什么原理？

斯坦福的论文显示：指标上所谓的涌现能力，其实跟评测指标密切相关

如果指标是非线性、非连续（如多选题），可能出现“涌现”
如果指标是线性、连续，那么“涌现”就会消失

详见论文：Are Emergent Abilities of Large Language Models a Mirage?，https://arxiv.org/pdf/2304.15004.pdf

（5.4）GPT-4效果差了？

关于有人诟病的GPT-4效果下滑，斯坦福和MIT的论文证实了：

详见: https://arxiv.org/pdf/2307.09009.pdf

（5.5）大模型自我纠错

大模型自我纠错能力如何？

众所周知，大模型容易产生幻觉，尤其是法律、医学领域广泛存在。

当前研究主要集中在幻觉产生原因上，如何识别幻觉易发领域、评估幻觉程度的研究却很少。

2023年出来的数据集HaluEval，3.5w条样本，专门用于检测大模型幻觉。

ChatGPT生成的内容中，高达19.5%的回复无法验证是否幻觉，涉及语言、气候和科技领域。

实验结果：ChatGPT、Claude处于第一梯队

许多大模型都在幻觉问题上苦苦挣扎。

面对幻觉难题，一种典型解法：让大模型检测生成内容是否幻觉，并酌情矫正。

自我纠错方法真的有用吗？

Deepmind发布的论文：LARGE LANGUAGE MODELS CANNOT SELF-CORRECT REASONING YET（大语言模型还不能在推理上自我纠错），https://arxiv.org/pdf/2310.01798.pdf

提到：LLMs并不能判断自己的推理结果是否正确，如果强行让其纠错，反而会引导模型选择其他选项，降低准确率。

实验结果：让LLM（GPT-4和Llama-2）多次纠错后，准确率都下滑，次数越多，下滑越厉害。GPT-4比Llama-2降幅小。

所以，GPT-4在没有外部信息指导情况下，纠错能力堪忧。

通过纠错实现内在幻觉矫正的路子行不通了。

（5.6）智能体

大模型驱动的智能体Agent能在特定环境中自动决策，已经能操作《我的世界》这类复杂游戏，处理在线购物、研究助理这类真实任务。

More technical research in agentic AI. Creating AI agents, systems capable of autonomous operation in specific environments, has long challenged computer scientists. However, emerging research suggests that the performance of autonomous AI agents is improving. Current agents can now master complex games like Minecraft and effectively tackle real-world tasks, such as online shopping and research assistance

（5.7）机器人

除了催生各种虚拟智能体Agent，大模型还促进了具身智能（实体机器人）的发展，一系列大模型应用于机器人，如PaLM-E 和 RT-2，操控更加灵活。

The fusion of language modeling with robotics has given rise to more flexible robotic systems like PaLM-E and RT-2. Beyond their improved robotic capabilities, these models can ask questions, which marks a significant step toward robots that can interact more effectively with the real world.

RT-2在各个任务上的成功率

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业