微信扫码
与创始人交个朋友
我要投稿
Date | Title | Paper | Code | Recom |
2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc) | https://arxiv.org/pdf/2303.06865.pdf | https://github.com/FMInference/FlexGen | ⭐️ |
2023.05 | [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc) | https://arxiv.org/pdf/2305.09781.pdf | https://github.com/flexflow/FlexFlow/tree/inference | ⭐️ |
2023.05 | [FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc) | https://arxiv.org/pdf/2305.05920.pdf | ⚠️ | ⭐️ |
2023.09 | ?[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) | https://arxiv.org/pdf/2309.06180.pdf | https://github.com/vllm-project/vllm | ⭐️⭐️ |
2023.09 | [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc) | https://arxiv.org/pdf/2309.17453.pdf | https://github.com/mit-han-lab/streaming-llm | ⭐️ |
2023.09 | [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc) | https://sites.google.com/view/medusa-llm | https://github.com/FasterDecoding/Medusa | ⭐️ |
2023.10 | ?[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA) | https://nvidia.github.io/TensorRT-LLM/ | https://github.com/NVIDIA/TensorRT-LLM | ⭐️⭐️ |
2023.11 | ?[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft) | https://arxiv.org/pdf/2401.08671.pdf | https://github.com/microsoft/DeepSpeed | ⭐️⭐️ |
2023.12 | ?[PETALS] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc) | https://arxiv.org/pdf/2312.08361.pdf | https://github.com/bigscience-workshop/petals | ⭐️⭐️ |
2023.12 | [PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU) | https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf | https://github.com/SJTU-IPADS/PowerInfer | ⭐️ |
Date | Title | Paper | Code | Recom |
2020.05 | ?[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA) | https://arxiv.org/pdf/1909.08053.pdf | https://github.com/NVIDIA/Megatron-LM | ⭐️⭐️ |
2023.10 | [LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc) | https://arxiv.org/pdf/2310.03294.pdf | https://github.com/RulinShao/LightSeq | ⭐️ |
2023.10 | ?[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA) | https://nvidia.github.io/TensorRT-LLM/ | https://github.com/NVIDIA/TensorRT-LLM | ⭐️⭐️ |
2023.12 | ?[PETALS] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc) | https://arxiv.org/pdf/2312.08361.pdf | https://github.com/bigscience-workshop/petals | ⭐️⭐️ |
2024.01 | [inferflow]INFERFLOW: AN EFFICIENT AND HIGHLY CONFIGURABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS(@Tencent AI Lab) | https://arxiv.org/pdf/2401.08294.pdf | https://github.com/inferflow/inferflow | ⭐️ |
• vLLM :开箱即用,集成最新的Page Attention,支持单卡和多卡模式。上手简单,可快速实现高性能推理。
• DeepSpeed-FastGen/DeepSpeed :微软最新框架,集成了模型的训练、推理全过程。相较于其它优化框架调节难度高、模型代码可能需要重新训练的问题,DeepSpeed可从Pytorch代码中直接引入优化模块,只需少量代码修改即可实现高性能推理。注意:在小规模模型状况下,DeepSpeed分布式的训练和推理方法在资源分配时可能有额外的时间开销。笔者在尝试训练小规模模型(如简单全连接神经网络、ResNet50等)时,发现分配资源的时间在框架运行中是难以忽略的,DeepSpeed反而会有更长的时间开销。
• PowerInfer :基于llama.cpp的基本数据结构,独创冷热神经元分类技术,有效分离算力需求较高的神经元到显存进行高速推理、冷神经元存储到CPU进行低功耗推理。适用于配置较低的服务器,如非专业用途的个人主机等。但目前近支持ReLU激活函数的部分模型,且不支持中文。注意到Tensor-RT和Megatron-LM也是来自NVIDIA的非常火热的框架。但笔者尝试部署这些框架时,发现其软件需求较为复杂(Tensor-RT)或在一般计算资源下可实施性较低(Megatron-LM)。感兴趣的同学也可以了解下,毕竟是来源于最大AI硬件厂商NVIDIA的,他们的工程师会了解更接近硬件的优化方法吧(^*^)。
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-05-14
2024-04-26
2024-03-30
2024-04-12
2024-05-10
2024-07-18
2024-05-22
2024-05-28
2024-04-25
2024-04-26
2024-11-13
2024-11-13
2024-11-12
2024-11-11
2024-11-08
2024-11-07
2024-11-06
2024-11-06