Gensyn Testnet is online, how to make AI training more efficient and more decentralized?

Gensyn Testnet launch, how to make AI training more efficient and more decentralized?

AI is currently the most关注的细分赛道 in the crypto industry, among which the distributed AI computing network Gensyn, led by a16z with a total financing scale of 50 million dollars, is undoubtedly a competitive project. Recently, Gensyn officially launched its Testnet, although it was more than a year later than originally planned, it has finally entered a new phase with the launch of the Testnet.

As a customized Ethereum Rollup designed specifically for machine learning, the Gensyn Testnet integrates off-chain execution, validation, and communication frameworks, aiming to provide key functionalities such as persistent identity, participation tracking, ownership maintenance, payments, remote execution coordination, trustless verification, training process recording, and crowdfunding for large-scale training tasks for decentralized AI systems.

The first phase of the Testnet focuses on tracking participation within the RL Swarm. RL Swarm is an application for collaborative reinforcement learning post-training, where its nodes can be bound to on-chain identities, ensuring that the contributions of each participating node are accurately recorded.

RL Swarm: Core Functions and Collaborative Training

In the Gensyn Testnet, RL Swarm, as a core application, is a model collaborative training system built on a decentralized network. Unlike traditional independent training of a single model, RL Swarm allows multiple models to communicate, critique, and improve each other within the network, thereby enhancing overall performance together. Its core philosophy is “collective intelligence,” which aims to achieve more efficient training results through collaboration and feedback among node models.

It can be simply understood that models like DeepSeek-R1 can iteratively improve their inference performance through self-criticism during inference training, while RL Swarm extends this mechanism to a group of multiple models, achieving the effect of “many hands make light work.”

Based on the RL Swarm system, the model not only relies on its own feedback but also identifies its shortcomings and optimizes itself by observing and evaluating the performance of other models. Each model node that joins the Swarm participates in a three-stage process: first, independently solving the problem and outputting ideas and answers; second, reviewing the answers of other nodes and providing feedback; and finally, the models vote to select the optimal solution, which is then used to correct their own outputs. This collaborative mechanism not only enhances the performance of each model but also drives the evolution of the entire group of models. Models that join the Swarm can retain their improved local weights after leaving, gaining tangible benefits.

Gensyn Testnet launched, how to make AI training more efficient and more decentralized?

In addition, Gensyn has open-sourced RL Swarm’s code, allowing anyone to run a node, start or join an existing Swarm without permission. Swarm’s underlying communication uses the gossip protocol provided by Hivemind, which supports decentralized messaging and learning signal sharing between models. Whether it’s a home laptop or a cloud GPU, you can participate in collaborative training by joining an RL Swarm node.

Infrastructure three pillars: Execution, Communication and Verification

Currently, RL Swarm is still just an experimental demonstration, showcasing a large-scale, scalable machine learning approach rather than a final product form. Over the past four years, Gensyn’s core work has actually been focused on building the underlying infrastructure, which has entered v0.1 phase after the release of the Testnet and is already operational. According to the official introduction, Gensyn’s overall architecture is divided into three parts: execution, communication, and verification.

Execution: Consistency and Distributed Computing

Gensyn believes that the future of machine learning is no longer limited to traditional monolithic models, but is composed of fragmented parameters distributed across devices globally. To achieve this goal, the Gensyn team has developed an underlying execution architecture that ensures consistency across devices. Key technologies involved include:

  • Distributed Parameter Storage and Training: By partitioning large-scale models into multiple parameter blocks and distributing them across different devices, Gensyn achieves fragmented deployment of the model, reducing the memory requirements on a single node.
  • Reinforcement Learning Post-Training (RL Post-Training): Research shows that when models are trained collaboratively as a group, communicating and critiquing each other’s answers, the overall learning efficiency significantly improves. Gensyn demonstrates this concept using RL Swarm, allowing models to progress rapidly through collective discussion, further validating the effectiveness of distributed execution.
  • Reproducible Operators (RepOps): To ensure that different hardware (such as Nvidia A100 and H100) can produce completely consistent computational results, Gensyn has developed the RepOps library, which achieves bitwise reproducibility across platforms by fixing the execution order of floating-point operations.

Communication: Efficient Information Exchange

In large-scale distributed training scenarios, efficient communication between nodes is crucial. Traditional data parallel methods, while able to reduce communication overhead to some extent, face scalability issues due to the requirement for each node to store a complete model, which is limited by memory constraints. To address this, Gensyn has proposed a brand new solution:

  • SkipPipe – Dynamic Jump Pipeline Parallelism: The SkipPipe technology skips certain stages in the traditional pipeline by dynamically selecting the computation layers that microbatches pass through, thereby reducing unnecessary waiting time. Its innovative scheduling algorithm can evaluate the availability of each path in real-time, reducing node idle time and significantly shortening overall training duration. Test data shows that in a Decentralization environment, SkipPipe can reduce training time by approximately 55%, and in the case of some node failures, the model performance is only reduced by about 7%.
  • Communication Standards and Cross-Node Collaboration Gensyn has built a communication protocol similar to TCP/IP, enabling participants from all over the world to efficiently and seamlessly transmit data and exchange information regardless of the devices used. This open standard provides a solid network foundation for decentralized collaborative training.

Verification: Ensure trust and security

In a trustless distributed network, confirming the authenticity and validity of the computation results submitted by each participant is a significant challenge. Gensyn introduces a specialized verification protocol aimed at ensuring that all computing power providers deliver correct work results through a low-cost and efficient mechanism:

  • Verde Verification Protocol: Verde is the first verification system specifically designed for modern machine learning. Its core lies in utilizing a lightweight dispute resolution mechanism to quickly pinpoint the step at which discrepancies arise between the model and the verifier during the training process. Unlike traditional verification methods that require re-running the entire task, Verde only needs to recompute the disputed operations, significantly reducing the verification overhead.
  • refereed delegation: After adopting this method, if there is a problem with a vendor’s output, validators can persuade a neutral arbitrator through an efficient dispute resolution game, ensuring that the correctness of the entire computation result is guaranteed as long as there is at least one honest node present.
  • Storage and Hashing Intermediate State: To support the above verification process, participants only need to store and hash partial intermediate training checkpoints instead of the full data, which reduces resource usage and enhances the scalability and real-time capabilities of the system.
ETH0.02%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)