OpenAI Discovers New Method to Halve Inference Costs

robot
Abstract generation in progress
According to a source familiar with the discussions, there is previously undisclosed news: earlier this month, OpenAI engineers informed some colleagues that, relying on several newly developed optimization technologies, they have found a solution that can reduce model inference costs by more than half. After applying this new technology to scenarios where free/paid account visitors use ChatGPT, the number of required Nvidia graphics processing units (GPUs) was reduced to just a few hundred — a remarkably low figure. It is currently unclear what specific technical means OpenAI used to achieve this significant improvement in computational efficiency. Common optimization methods in the industry generally include: quantization compression, key-value caching, batch processing of user queries instead of computing them individually, and redirecting some requests to lower-power lightweight models or model shards for responses.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned