More than three years ago, when I was still using Sovits, the voice model required separation (removing background noise) to isolate the dry voice before training.


Then, the dataset needed to be filtered, removing parts with high background noise, and training would begin.
Typically, training around 8,000 steps yields the best voice fidelity; if it exceeds 8,000 steps and the score remains below 25, the dataset and training are basically useless.
If you insist on continuing, training all the way past 14,000 steps, a so-called "divergence" will occur, ultimately resulting in the output voice being either "severely electronic" or "unrecognizable as human or ghost."
Does this resemble the development process of quantitative trading?
The process of extracting the dry voice is like giving the machine a dataset for self-learning and prediction models.
Removing parts with high background noise is like filtering out invalid market data (such as 1-minute surges and crashes).
Training for 8,000 steps avoids severe overfitting; training beyond 14,000 steps leads to "divergence" (severe overfitting), ultimately making the real-world results as random as flipping a coin.
Although not in the same field, the underlying logic is the same.
In the future, it’s hard to say whether the ones who will beat us are not industry insiders themselves, but people crossing over from other fields...
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin