An interesting open-source project AirLLM


Optimizes inference memory usage, allowing a 70B large model to run on a single GPU with 4GB of VRAM
Without quantization, distillation, or pruning.
Running 405B Llama3.1 on 8GB VRAM
I want GLM 5.2 even more. Wouldn't that mean my 40+G shared VRAM could also run 700+B?
Star🌟 21.3k
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments