I'm trying to implement CALM paper, and I have some questions. [P]
![I'm trying to implement CALM paper, and I have some questions. [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2Fkr4u22yfx8ah1.png%3Fwidth%3D140%26height%3D83%26auto%3Dwebp%26s%3D784c46c82400e669571b4d8a7dcdc997ad0fba57&w=3840&q=75)
| Hello, I'm trying to implement the Pocket TTS by kyutai-labs represented by this paper. Since they have didn't released the training/fine-tuning code. I'm trying to implement it on my own for learning some stuff. I have read the paper, tried to implement it with much more smaller parameters with smaller amount of data. I implemented this text to speech with one speaker on LJSpeech (1) and LibriSpeech clean subset but its hardly failing. For (1), Since it's a single speaker dataset I didn't added the voice cloning just simple text and target latents. flow matching loss became nearly 0.20 mse , EOS loss became very low like (x)e-(y) levels. But when infer with the model saved at 2800th epoch, It barily generating a meaningfull text even the text within its training set. Tried different techniques like Scheduled sampling for eliminate exposure bias (model was hallucinating sometimes and repeats same phrases twice), it didn't worked. Added std gaussian noise to ground truths, didn't worked. After struggling with lots of implementation I decided to move forward with quite larger dataset LibriSpeech because I thought that scale of the data was small. For (2), I read the paper again. No scheduled sampling, added the head multiplication etc, and implemented the paper in the librispeech dataset. I tried audio condition+ text tokens + BOS + target latents, and swapped the audio prompt with text tokens. I observed a tradeoff in this setup: if I put text tokens near to target latents, model generates better text but voice is not even close to audio prompt,and gibberish speak with better voice cloning when I put audio condition tokens near to target latents. And found out that loss is very spiky, and grad norm is exploding too you can see below the images. loss and lr values for setup 1 (LJSpeech) values for setup 2 (LibriSpeech) I used Pocket TTS' orijinal Mimi Audio Encoder by extracting it from Original model. What is your suggestions? Should I read paper over and over again? Should I increase the data amount by collecting from different sources(authors says that they used 88.000 hours of publicly available data)? Any system design problem? Trainings performed on RTX 5080 desktop gpu. I want to move on to bigger dataset but can't burn GPU credits for non-expected result. When should I increase dataset and start training on bigger clusters that could give me satisfyable results? [link] [comments] |
Want to read more?
Check out the full article on the original site