VoiceCraft - speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.
To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
Running locally with a 3080 it takes 8 second to create 13 seconds of high quality voice.
https://jasonppy.github.io/VoiceCraft_web/
Thalidomide Vintage Ad Shirt $22.14 |
It's nice but how would one implement it in a nice gui. I can only code with LLM
Look into gradio, you can see how similar projects have used it. But I'm sure we don't have to wait long until we get a webui.
i got it working with jyputer but it's so inconvinient. i need an helpful autist to port it to gradio
you need to sign an agreement to download
>you need to sign an agreement to download
you need to sign an agreement to download
>you need to sign an agreement to download
Hmm, it seems though you only need Gigaspeech if you're training a model yourself and it's not needed for inference, and SpeechColab appears unaffiliated with VoiceCraft's team. If you really want the audio dataset, just submit fake info like the madlad anon who used an .edu address to give us Llama, just as Prometheus gave humans fire.
how new are you? that's how huggingface works
I havent used hugginface before
But is there a way to do this anonymously?
Literally just use a fake email
>Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
How do I get this without any authentication or signing any agreements?
Anon if you get sued for abusing the weights there's no torrent exception.
https://huggingface.co/pyp1/VoiceCraft/tree/main
sorry I misread, yeah no way around that.
can this be downloaded in such a way that I dont need to create any kind of account anywhere so nobody ever knew it was me?
(except for CIA who knows my ISP, but I have no trouble with them so its ok)
you don't need an account to download these models. When you start the gay jyputer shit it download them itself
so please give me instructions how to do the whole thing? in 10 steps?
what do I need?
in addition to Linux laptop
on windows install wsl anad then conda in wsl if you dont have it already. git cloen the repo. then follow the github environment set up. then start jupyter and change os.environ["CUDA_VISIBLE_DEVICES"]="0"
and add
os.system(f"mfa model download dictionary english_us_arpa")
os.system(f"mfa model download acoustic english_us_arpa ")
by the end of cell 3 the first time you run it then delete it
or someone just opened a pull request for some docker shit i havent looked into ti
https://github.com/jasonppy/VoiceCraft/pull/25
Are you dyslexic?
No, I’m not going to do that.
where can I download the exe?
Bumping because this is cool but I should have just done it on WSL instead of trying to make it work on windows, I just wasted a couple of hours being a moron
went for WSL, got this moronic error picrel
>it was conda update that fricked it all
I hate python
your fault for not properly using environments
>HuggingFace Spaces demo coming
looking forward to that special thread. if it captures emma watson voice nicely, she will read good night stories for me every night
From me its aurora aksnes, her voice is so cute :3
needs xformers, are AMD users out of luck?
>are AMD users out of luck?
In ML contexts: yes.
Werks on my end.
Did you replace all the dependencies etc by hand or does this work out of the box with environments set up?
The docker solution got mad because I have no nvidea interfaces for it.
it works on CPU anyway
How slow is it?
Whenever somebody says "it works on CPU" in ML, it works the same way as it worked for this guy
on CPU (7800x3d) it's faster than what the 3060 guy has
https://github.com/ROCm/xformers
How many petabytes of RAM one's PC needs to run this? I have been using Microsoft TTS free API to read shit for me, but I think sending them all that bunch of data for free might not be good.
2.4e-5
/vsg/ bros... we're so back
https://vocaroo.com/1cuuHDbdemww
come to think of it, what happened to the /vsg/ threads?
jesus frick can you frick off back to le reddit with your stupid fricking halfwit questions? you have absolutely no idea what you're even talking about, read a fricking book you low IQ Black person tourist
looks like you hate anonymous, prolly a microsoft shill
>happened to the /vsg/ threads?
There were zero happenings, aside from 11 labs, so everyone dropped it. Literal dead general.
Now there's this, and suno.ai doing music-via-prompting-as-a-service. Shits wild.
So, maybe VSG is finally back. I'm excited, eagerly awaiting the day TTS one-shot can do sillytavern rp and work with LLMs, and get it right. Capability is there now, but piss poor quality.
gui when?
frick you eleven labs
>Puyuan Peng, Po-Yao Huang, Daniel Li
in light of recent events please audit the code VERY CLOSELY before running the software
Did chinese people frick with some other code recently or something? What recent events?
No, nothing to worry about gweilo
Some chinese guy by the name of Jia Tan pushed a really well obfuscated backdoor into a compression library used by some 50+% of linux packages and system binaries. This is probably the most significant CVE ever. The backdoor has lived in the git repo for several months and was even updated without anyone noticing. Evidence shows that he has been collaborating with other Chinese to push similar updates to other open source repos, including the Linux kernel source itself.
this only affected bleeding edge Black folk. no one sane is affected. and this is off-topic
>this backdoor being used in system libraries for the past several months and went unnoticed until now only because some autist saw his ssh logins were taking 500ms longer than before isn't really a big deal! nothing wrong here! also not on topic!
Are you Chinese? Are you the developer of VoiceCraft? Regardless, you can go ahead and play with your AI toys, I'm just answering the other anon's question which was prompted by a warning to be careful since open source != perfectly safe.
Welcome to the promised land of local, /vsg/. Try not to burn out your GPUs too fast while you're here.
>/sdg/ not a vertical cliff
https://vocaroo.com/12vTi6URKLNU
Eleven libs seething
Not bad! Use a better prompt like a copy pasta.
Oh shit, is voice craft actually the real deal?
>high quality
>16kHz
Into the trash it goes
i will never install conda
i will never install linux
i will never use windows subsystem for linux
i will run this on windows
bro I'll just get it running on windo-
>InterpolationResolutionError: KeyError raised while resolving interpolation: "Environment variable 'USER' not found
okay, I'll just set it mysel-
>AttributeError: module 'os' has no attribute 'uname'
what? why does that function not exist on windows? at least I can just replace os with platform and it'll wor-
>AttributeError: 'uname_result' object has no attribute 'sysname'
WSL time
ML Python "people" doing everything in their power to make sure their code isn't portable outside of their own specific machine. Every fricking time.
Is it English only?
>no web ui or one click installer
I'm honestly too lazy to give a frick
Where's a proper webui version?
didnt openai also just release another one? I just want the best dagoth ur, offline, runtime doesnt matter. Which of the plethora of xtts bark wiener-qui tortoise and whatevers is the best? Judging from the huggingface arena ill go with xtts
How much VRAM is the minimum?
>enter poorgaygus maximus
it's release day 1 homie, go harass /lmg/ a little and come back in a few weeks
I don't know
the repeat 3 batch 4 script I snagged from lmg took 16gb of VRAM + 31gb of shared ram
it's a shitty transformers-based TTS, so for 15 second audio you need up to 24 gb vram, so its not worth any time wasted when installing.
voicecraft vs gpt sovits?
https://vocaroo.com/1bRF3QW0bX2v
/g/entlemen, im trying to do the needful on windows. i got everything running, and then this line. it stuck forever. espeak can be run from the command line from PATH. apparently triton is not an issue here. there is some memory usage thoug. using inference_speech_editing.ipynb
so it was because the encodec_fn model was symlinked, so it didnt work because muh file permission
after some editing and removing linux specific commands in the code, i was able to run it, fully on windows, without wsl
just an idea, brownpill the hispanics about pajeets using the latam voice of goku (mario castanneda)
it takes about 1 minutes for my 3060 to run using the giga830M model
https://voca.ro/1c2vVpJtkprL
default demo voice
included is the fix for the linux anti-windows error. go "srcaudiocraftaudiocraftutilscluster.py" and comment out these lines
follow the instruction in the given jupyter notebook (double click on the markdown cell to edit mode to read the text. i forgot to do reddit spacing).
the edited notebook. it worked on my machine:
https://files.catbox.moe/fahsys.ipynb
just replace the inference_speech_editing.ipynb with this
Are you using conda and etc?
also you need to run vscode in admin mode to load the model, for some reason. also i cant get the mfa aligner to create the csv for me
the command is
mfa align -j 1 --output_format csv demo/temp --clean english_us_arpa english_us_arpa demo/temp/mfa_alignments --beam 1000 --retry_beam 2000
yes im using conda
>it takes about 1 minutes for my 3060 to run using the giga830M model
1. are you sure it's using your GPU and not cpu?
2. even with ram fallback that's slow
it used 100% of my cuda and a little bit of vram. ram use is not much, cpu use was ~60%. im using my 3060 on a potato set up with PCI gen 3 so it is slow. also the inference time varies a lot, changing the output text some bits and it only take ~20s
also the mfa seems to be because of some permission, i move it to a different drive and it seems to run but it still doesnt generate the csv :'(
3060 guy here, it uses a lot of gpu and cpu so i dont know what is wrong lol. it suddenly stopped working and i had to do a clean install
install instruction minimum for inference +
https://rentry.org/3rdkmdth
it seems like the generation length is why it takes so long? my previous gens were ~15 seconds long
What if someone torrents the shit you need auth to download?
why would anyone want to help autist morons like you? just make a burn account ffs
ok so the deal is that, this whole thing gotta be run in admin mode.
to generate the mfa csv file, you need to run an admin cmd, activate conda, run the command there instead of inside the jupyter to see the progress, else you would just be waiting with no progress bar.
to generate the csv for the demo, it takes 142s (default 1 worker - 1 voice - i guess it is faster if you do a batch of a few workers at once?).
also you need to download the models. i only found that out after searching for it in the code
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
it is pretty slow for a 7 seconds long audio. but this only needed to be done once
>To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
I'm moronic and don't know how computers werk but want to jerk off to cartoon characters saying lewd things, should I beat my head against this until I manage to get it to work or is there a significantly easier path available?
waitTM
are the results from this better than using xtts2 + rvc?
Funny how all of these guys start popping up a month after coqui dies
its nice but nobody can do anything with this shit until SOMEONE MAKES A FRICKING C++ LIBRARY FOR FRICK SAKE
>WhisperSpeech
>Vits
>Metavoice
>OpenVoice
>StyleTTS
>Tortoise TTS
Not a single fricking one has a decent C++ library.
This is just a guess, but AI built on C++ is virtually impossible to scale to different systems without the most cancerous form of containerization.
no, look at llama.CPP, look at whisper.cpp, look at stable diffusion.cpp
yeah but it took python bindings for people to adopt it and find faults or improvements.
>C++
>not C
I'll take either. Just a single implementation so I can use it in a game or something
if you're too lazy to write your own library for it yourself then you're too lazy to make a meaningful game in the first place anon
the truth hurts but it's something you can change
be the change you want to see in the world
One is vastly more difficult than the other moron
moronic logic.
they would both be the same difficulty if you weren't some glorified rpgmaker drag and drop frickchuckle that doesn't deserve to call yourself a game developer
actually I use scratch
Can we get some decent examples in this thread? I gotta be honest, the one's here aren't the ELEVENLABS KILLER?!?!? shit I was expecting
>To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
This is nothing new, but it is still nice.
Is it better than XTTS2+RVC?
This is a human voice of persona 3:
https://voca.ro/1iyFPj4eF84W
This is a cloned voice using XTTS2+RVC:
https://voca.ro/12q8WElDmO7Q
I haven't tested voicecraft, but i wonder if it's better than this.
https://voca.ro/17dXFZXLrnTS
this one took 7.4 seconds + the 200 seconds needed to build the mfa
https://voca.ro/1hHudQpOUUpm
when the sentence is completely different
sounds more soulful
Interesting, but the cloning is polluted by room noise/reverb.
It's not really "clean", i don't know if it's on purpose or not but it affects the quality, specially on the second one.
Overall, it seems to be a great TTS and has a lot of potential if mix it with RVC, i'll wait for a GUI to test it because i hate conda notebooks.
>This is a human voice of persona 3:
>https://voca.ro/1iyFPj4eF84W
For comparison's sake, I did this on VALLE-X, using the clip your provided, it took 6.3 seconds to generate (3060ti)
https://voca.ro/1gyaCkwzXIei (Little Tom Miiverse post)
Though to be fair— I did first turn the audio in an npz (the format VALLE-X uses, only took 3 seconds to create) and I had to regen 3 times (other two had random erratic pronunciations, the usual shit)
Honestly, so far I'm kinda not that sold on VoiceCraft, though if it's less prone to erratic glitches, that's pretty good at least.
the results sound better than xtts2, but i've yet to find any clear documentation on how to configure it for better results. everything i've done with it is with just default parameters from this repository which allows you to use it with rvc in a webui.
https://github.com/Vali-98/XTTS-RVC-UI
I think this would provide better results IF you could have more control than just feeding it an input value and an index value, but as it stands, the base xtts2 output on its own is inferior to voicecraft, but with rvc it's way better while being faster. the length of the prompt doesn't affect the time it takes to compute. you can't really control the expressiveness, but i don't think you can with voicecraft either. the only options i've seen offer expression control is openvoice where you can use emojis or w/e to tell what the emotional state is, or bark with emojis as well, or *sad* *upset* etc.
https://github.com/myshell-ai/OpenVoice
https://github.com/suno-ai/bark
my current opinion is that bark is inferior in terms of tonal reproduction, but the expressiveness is higher. someone correct me if i'm wrong.
>the results sound better than xtts2
Does it really?
It feels as if voicecraft is trying to mask the robotic voice by giving the effect of a person using a shitty mic and speaking far from it.
I don't think there is a single crisp, high-quality voicecraft example.
>single crisp, high-quality
thats because the audio sample is only 16k
wow that's cool! let me git clone this open-source repo and run it on my machine!
T H E E X E
this is why me and my parents have a safe word. so in case of some crazy phone call, we can use that safe word to know if the other person is real or a bot
imagine having parents in 2024
https://voca.ro/1bu9LWfnfj8j
very interesting. first is the gen time being low (16 seconds for 7 secs vs 1+ minute for 15 seconds gen. it seems the gen time is related to how long the output is, not how long the input speech). second is that the closer the new sentence is, the better (obviously). finally once it hit the substituted words, the later, even if the same as the input, would sound worse
>once it hit the substituted words, the later, even if the same as the input, would sound worse
i think this is due to the later original parts have to be adapted to match the flow of the substitution
https://voca.ro/14X7eEsal92L
for this 21 seconds, it took 2m and 30s. "Well, I think we're doing very well. The new polls just came out." is from the original audio clip. the rest is generated.
seems to me the longer the original text is the better quality the gens because it uses those text only as input. the rest is ignored
It seems like we're not quite there yet, but this is a step up.
Is anyone working on voice synthesis? Not just TTS or RVC, I mean writing something like "female, 35 years old, Irish accent, [sassy:flirty:0.5]" and getting a novel voice you can then feed into the pipeline?
haven't seen anything like that, but that would be very cool
Thinking it over, a lot of the work in terms of collecting and cleaning audio has already been done in order to train individual voices for RVC. What would then need to happen is to collect all (or at least some reasonable fraction) of those voices, and recaption them not with character/actor names but with the qualities of the voice itself. Villain, hero/heroine, accent, age, etc. That in and of itself isn't trivial, but then after that I'm not sure how that dataset would turn into a general model. There's something there though.
Shit like VoiceCraft is heavily looked down upon by most AI researchers because it's too powerful. Now imagine if you could just write "desperate, 20 year old, female, crying" and get something as good... It would be the dream of scammers.
god i fricking resent how all of the cool shit is gimped by the existence of buttholes
https://voca.ro/1ebPXNRaZwMk
Guys I can only get so far as getting the audicraft notebook working. Is there a working notebook for voice craft?
I had more fun with speech in the couple weeks it was around than all the other boring coonetshit combined.
I'm glad voice synth is still getting attention. This is far better than the previous open source stuff. Elevanlabs still wins in quality and cloning, but this isn't too far behind. It has a certain low quality to it though that sounds like it was trained solely on C-SPAN callers.
https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf
>Gigaspeech training set (Chen et al., 2021a) is used as the training data, which contains 9k hours of audiobooks, podcasts, and YouTube videos at 16kHz audio sampling rate. Audio files that shorter than 2 seconds are dropped.
>The training of the 830M VOICECRAFT model took about 2 weeks on 4 NVIDIA A40 GPUs.
that should be ~$650 on vast.ai to do. so someone training a new model with a better dataset for under $1k is possible
Nice
Gradio UI when? I don't want to have to make one from scratch and I hate poopiter notbooks.
Now: https://github.com/friendlyFriend4000/VoiceCraft
I don't know if this repo is yours, but I keep getting this:
Traceback (most recent call last):
File "<env_path>/lib/python3.9/site-packages/gradio/routes.py", line 534, in predict
output = await route_utils.call_process_api(
File "<env_path>/lib/python3.9/site-packages/gradio/route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
(...)
File "<project_path>/data/tokenizer.py", line 140, in tokenize_audio
wav, sr = torchaudio.load(audio_path, frame_offset=offset, num_frames=num_frames)
File "<env_path>/lib/python3.9/site-packages/torch/_ops.py", line 502, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: Invalid argument: num_frames must be -1 or greater than 0.
it's mine. is your cut off at 0s by chance?
Yes. I did set the cutoff manually later, though, but I still cannot get the same results as the colab code
back in my day they made rollercoaster tycoon with fricking assembly
interesting issue
dafuq
its cells...tormented
at first i thought it's a bot with that numbername, but i have witnessed schizophrenics say things like that
VoiceCraft for ComfyUI, in progress.
https://github.com/kijai/ComfyUI-VoiceCraft
why the frick would you add it to an imagegen UI instead of a textgen one?
comfyui isn't an imagegen specific ui, it's a node based editor for any model.
cumrag is desperate for clout now that SAI is sinking
That was very easy. Downloaded the models automatically an everything. Unfortunately I do not have enough VRAM to generate with this. Might frick around with CPU.
did you have to install espeak manually? it says in the repository that it's required, but i'm not sure if that means i have to install it using the comfyui manager, or git clone it somewhere else and then point to it.
how to install espeak:
https://bootphon.github.io/phonemizer/install.html
thanks. do you also happen to know what node allows you to point to the library path?
I did, I just grabbed the msi and it worked.
If you're using the example json file / workflow from the repo, there should be a node that has the load library grey dot on it. Drag that dot out and it will give you the option to wire it up to a string primative node.
I didn't do this though. I think the default autodetected it.
yeah i just found the primitive node. okay now it's just a matter of actually utilizing the feature set on here. does the audio tensor need to be pointed to as well?
there's a workflow set up in the demo folder
>non commerical license
>weights not released
So what's the point?
?
everything you wrote is false
Stupid little homosexual
Is there a way to mix voice models? In an attempt to come up with hardly recognized voices I can use in games.
No one click exe installer, no download
it's that simple
The community will be limited to a handful of autistic morons with no artistic sensibility until such a thing is made, so it is doomed to wallow in obscurity and mediocrity until a heroic man of the people makes these things available to non-codegays.
this is correct. things need to be idiot proof as in click a button and it just works. not because there are non-codegays but becuase 80% of the population is moronic and handholding
Once that happens, somebody will make a racist robocall in Biden's voice on election day to entire black neighborhoods. Then all hell breaks loose.
Oh no that's horrible
What if some chuds actually does this?
Hypothetically how would one support this chud?
weird, how i already made this yesterday and today you are talking about biden and Black folk
https://voca.ro/1ifYes5VJ362
sad that biden was the only audio file i had on the disc. I don't really care about biden. president sdon't really have much agency anyway
>still barely any example gens posted
guys...
Because there's no exe.
This doesnt make a lick of sense
Someone post an exe that does everything for me
Okay, so it seems like you can use this thing to train voice models but what software do you actually use to implement the voice model?