VoiceCraft

Posted on March 29, 2024 by Anon

VoiceCraft - speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.

Running locally with a 3080 it takes 8 second to create 13 seconds of high quality voice.

https://jasonppy.github.io/VoiceCraft_web/

Mike Stoklasa's Worst Fan Shirt $21.68

Thalidomide Vintage Ad Shirt $22.14

Mike Stoklasa's Worst Fan Shirt $21.68

2 months ago

Reply

Anonymous

It's nice but how would one implement it in a nice gui. I can only code with LLM
- 2 months ago
  
  Reply
  
  Anonymous
  
  Look into gradio, you can see how similar projects have used it. But I'm sure we don't have to wait long until we get a webui.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    i got it working with jyputer but it's so inconvinient. i need an helpful autist to port it to gradio
    - 2 months ago
      
      Reply
      
      Anonymous
      
      >HuggingFace Spaces demo coming
      looking forward to that special thread. if it captures emma watson voice nicely, she will read good night stories for me every night
      
      How many petabytes of RAM one's PC needs to run this? I have been using Microsoft TTS free API to read shit for me, but I think sending them all that bunch of data for free might not be good.
      
      gui when?
      
      you need to sign an agreement to download
      >you need to sign an agreement to download
      you need to sign an agreement to download
      >you need to sign an agreement to download
      - 2 months ago
        
        Anonymous
        
        Hmm, it seems though you only need Gigaspeech if you're training a model yourself and it's not needed for inference, and SpeechColab appears unaffiliated with VoiceCraft's team. If you really want the audio dataset, just submit fake info like the madlad anon who used an .edu address to give us Llama, just as Prometheus gave humans fire.
      - 2 months ago
        
        Anonymous
        
        how new are you? that's how huggingface works
      - 2 months ago
        
        Anonymous
        
        I havent used hugginface before
        But is there a way to do this anonymously?
      - 2 months ago
        
        Anonymous
        
        Literally just use a fake email
2 months ago

Reply

Anonymous

>Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
How do I get this without any authentication or signing any agreements?
- 2 months ago
  
  Reply
  
  Anonymous
  
  Anon if you get sued for abusing the weights there's no torrent exception.
- 2 months ago
  
  Reply
  
  Anonymous
  
  https://huggingface.co/pyp1/VoiceCraft/tree/main
- 2 months ago
  
  Reply
  
  Anonymous
  
  sorry I misread, yeah no way around that.
- 2 months ago
  
  Reply
  
  Anonymous
  
  It's nice but how would one implement it in a nice gui. I can only code with LLM
  
  https://huggingface.co/pyp1/VoiceCraft/tree/main
  
  sorry I misread, yeah no way around that.
  
  can this be downloaded in such a way that I dont need to create any kind of account anywhere so nobody ever knew it was me?
  
  (except for CIA who knows my ISP, but I have no trouble with them so its ok)
  - 2 months ago
    
    Reply
    
    Anonymous
    
    you don't need an account to download these models. When you start the gay jyputer shit it download them itself
    - 2 months ago
      
      Reply
      
      Anonymous
      
      so please give me instructions how to do the whole thing? in 10 steps?
      what do I need?
      in addition to Linux laptop
      - 2 months ago
        
        Anonymous
        
        on windows install wsl anad then conda in wsl if you dont have it already. git cloen the repo. then follow the github environment set up. then start jupyter and change os.environ["CUDA_VISIBLE_DEVICES"]="0"
        
        and add
        os.system(f"mfa model download dictionary english_us_arpa")
        os.system(f"mfa model download acoustic english_us_arpa ")
        by the end of cell 3 the first time you run it then delete it
        
        or someone just opened a pull request for some docker shit i havent looked into ti
        https://github.com/jasonppy/VoiceCraft/pull/25
      - 2 months ago
        
        Anonymous
        
        Are you dyslexic?
      - 2 months ago
        
        Anonymous
        
        No, I’m not going to do that.
2 months ago

Reply

Anonymous

where can I download the exe?
2 months ago

Reply

Anonymous

Bumping because this is cool but I should have just done it on WSL instead of trying to make it work on windows, I just wasted a couple of hours being a moron
- 2 months ago
  
  Reply
  
  Anonymous
  
  went for WSL, got this moronic error picrel
  - 2 months ago
    
    Reply
    
    Anonymous
    
    >it was conda update that fricked it all
    I hate python
    - 2 months ago
      
      Reply
      
      Anonymous
      
      your fault for not properly using environments
2 months ago

Reply

Anonymous

>HuggingFace Spaces demo coming
looking forward to that special thread. if it captures emma watson voice nicely, she will read good night stories for me every night
- 2 months ago
  
  Reply
  
  Anonymous
  
  From me its aurora aksnes, her voice is so cute :3
2 months ago

Reply

Anonymous

needs xformers, are AMD users out of luck?
- 2 months ago
  
  Reply
  
  Anonymous
  
  >are AMD users out of luck?
  In ML contexts: yes.
- 2 months ago
  
  Reply
  
  Anonymous
  
  Werks on my end.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Did you replace all the dependencies etc by hand or does this work out of the box with environments set up?
    The docker solution got mad because I have no nvidea interfaces for it.
- 2 months ago
  
  Reply
  
  Anonymous
  
  it works on CPU anyway
  - 2 months ago
    
    Reply
    
    Anonymous
    
    How slow is it?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Whenever somebody says "it works on CPU" in ML, it works the same way as it worked for this guy
    - 2 months ago
      
      Reply
      
      Anonymous
      
      on CPU (7800x3d) it's faster than what the 3060 guy has
- 2 months ago
  
  Reply
  
  Anonymous
  
  https://github.com/ROCm/xformers
2 months ago

Reply

Anonymous

How many petabytes of RAM one's PC needs to run this? I have been using Microsoft TTS free API to read shit for me, but I think sending them all that bunch of data for free might not be good.
- 2 months ago
  
  Reply
  
  Anonymous
  
  2.4e-5
2 months ago

Reply

Anonymous

/vsg/ bros... we're so back
- 2 months ago
  
  Reply
  
  Anonymous
  
  https://vocaroo.com/1cuuHDbdemww
- 2 months ago
  
  Reply
  
  Anonymous
  
  come to think of it, what happened to the /vsg/ threads?
  
  I havent used hugginface before
  But is there a way to do this anonymously?
  
  jesus frick can you frick off back to le reddit with your stupid fricking halfwit questions? you have absolutely no idea what you're even talking about, read a fricking book you low IQ Black person tourist
  - 2 months ago
    
    Reply
    
    Anonymous
    
    looks like you hate anonymous, prolly a microsoft shill
  - 2 months ago
    
    Reply
    
    Anonymous
    
    >happened to the /vsg/ threads?
    There were zero happenings, aside from 11 labs, so everyone dropped it. Literal dead general.
    Now there's this, and suno.ai doing music-via-prompting-as-a-service. Shits wild.
    So, maybe VSG is finally back. I'm excited, eagerly awaiting the day TTS one-shot can do sillytavern rp and work with LLMs, and get it right. Capability is there now, but piss poor quality.
2 months ago

Reply

Anonymous

gui when?
2 months ago

Reply

Anonymous

frick you eleven labs
2 months ago

Reply

Anonymous

>Puyuan Peng, Po-Yao Huang, Daniel Li
in light of recent events please audit the code VERY CLOSELY before running the software
- 2 months ago
  
  Reply
  
  Anonymous
  
  Did chinese people frick with some other code recently or something? What recent events?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    No, nothing to worry about gweilo
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Some chinese guy by the name of Jia Tan pushed a really well obfuscated backdoor into a compression library used by some 50+% of linux packages and system binaries. This is probably the most significant CVE ever. The backdoor has lived in the git repo for several months and was even updated without anyone noticing. Evidence shows that he has been collaborating with other Chinese to push similar updates to other open source repos, including the Linux kernel source itself.
    - 2 months ago
      
      Reply
      
      Andres Freud
      
      this only affected bleeding edge Black folk. no one sane is affected. and this is off-topic
      - 2 months ago
        
        Anonymous
        
        >this backdoor being used in system libraries for the past several months and went unnoticed until now only because some autist saw his ssh logins were taking 500ms longer than before isn't really a big deal! nothing wrong here! also not on topic!
        Are you Chinese? Are you the developer of VoiceCraft? Regardless, you can go ahead and play with your AI toys, I'm just answering the other anon's question which was prompted by a warning to be careful since open source != perfectly safe.
2 months ago

Reply

Anonymous

Welcome to the promised land of local, /vsg/. Try not to burn out your GPUs too fast while you're here.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >/sdg/ not a vertical cliff
2 months ago

Reply

Anonymous

https://vocaroo.com/12vTi6URKLNU
- 2 months ago
  
  Reply
  
  Anonymous
  
  Eleven libs seething
- 2 months ago
  
  Reply
  
  Anonymous
  
  Not bad! Use a better prompt like a copy pasta.
2 months ago

Reply

Anonymous

Oh shit, is voice craft actually the real deal?
2 months ago

Reply

Anonymous

>high quality
>16kHz
Into the trash it goes
2 months ago

Reply

Anonymous

i will never install conda
i will never install linux
i will never use windows subsystem for linux
i will run this on windows
- 2 months ago
  
  Reply
  
  Anonymous
  
  bro I'll just get it running on windo-
  >InterpolationResolutionError: KeyError raised while resolving interpolation: "Environment variable 'USER' not found
  okay, I'll just set it mysel-
  >AttributeError: module 'os' has no attribute 'uname'
  what? why does that function not exist on windows? at least I can just replace os with platform and it'll wor-
  >AttributeError: 'uname_result' object has no attribute 'sysname'
  WSL time
  - 2 months ago
    
    Reply
    
    Anonymous
    
    ML Python "people" doing everything in their power to make sure their code isn't portable outside of their own specific machine. Every fricking time.
2 months ago

Reply

Anonymous

Is it English only?
2 months ago

Reply

Anonymous

>no web ui or one click installer
I'm honestly too lazy to give a frick
2 months ago

Reply

Anonymous

Where's a proper webui version?
2 months ago

Reply

Anonymous

didnt openai also just release another one? I just want the best dagoth ur, offline, runtime doesnt matter. Which of the plethora of xtts bark wiener-qui tortoise and whatevers is the best? Judging from the huggingface arena ill go with xtts
2 months ago

Reply

Anonymous

How much VRAM is the minimum?
- 2 months ago
  
  Reply
  
  Anonymous
  
  >enter poorgaygus maximus
  it's release day 1 homie, go harass /lmg/ a little and come back in a few weeks
- 2 months ago
  
  Reply
  
  Anonymous
  
  I don't know
  the repeat 3 batch 4 script I snagged from lmg took 16gb of VRAM + 31gb of shared ram
- 2 months ago
  
  Reply
  
  Anonymous
  
  it's a shitty transformers-based TTS, so for 15 second audio you need up to 24 gb vram, so its not worth any time wasted when installing.
2 months ago

Reply

Anonymous

voicecraft vs gpt sovits?
2 months ago

Reply

Anonymous

https://vocaroo.com/1bRF3QW0bX2v
2 months ago

Reply

Anonymous

/g/entlemen, im trying to do the needful on windows. i got everything running, and then this line. it stuck forever. espeak can be run from the command line from PATH. apparently triton is not an issue here. there is some memory usage thoug. using inference_speech_editing.ipynb
- 2 months ago
  
  Reply
  
  Anonymous
  
  so it was because the encodec_fn model was symlinked, so it didnt work because muh file permission
  after some editing and removing linux specific commands in the code, i was able to run it, fully on windows, without wsl
2 months ago

Reply

Anonymous

just an idea, brownpill the hispanics about pajeets using the latam voice of goku (mario castanneda)
2 months ago

Reply

Anonymous

it takes about 1 minutes for my 3060 to run using the giga830M model

https://voca.ro/1c2vVpJtkprL
default demo voice

included is the fix for the linux anti-windows error. go "srcaudiocraftaudiocraftutilscluster.py" and comment out these lines

follow the instruction in the given jupyter notebook (double click on the markdown cell to edit mode to read the text. i forgot to do reddit spacing).

the edited notebook. it worked on my machine:
https://files.catbox.moe/fahsys.ipynb
just replace the inference_speech_editing.ipynb with this
- 2 months ago
  
  Reply
  
  Anonymous
  
  Are you using conda and etc?
- 2 months ago
  
  Reply
  
  Anonymous
  
  also you need to run vscode in admin mode to load the model, for some reason. also i cant get the mfa aligner to create the csv for me
  
  the command is
  mfa align -j 1 --output_format csv demo/temp --clean english_us_arpa english_us_arpa demo/temp/mfa_alignments --beam 1000 --retry_beam 2000
  
  Are you using conda and etc?
  
  yes im using conda
- 2 months ago
  
  Reply
  
  Anonymous
  
  >it takes about 1 minutes for my 3060 to run using the giga830M model
  1. are you sure it's using your GPU and not cpu?
  2. even with ram fallback that's slow
  - 2 months ago
    
    Reply
    
    Anonymous
    
    it used 100% of my cuda and a little bit of vram. ram use is not much, cpu use was ~60%. im using my 3060 on a potato set up with PCI gen 3 so it is slow. also the inference time varies a lot, changing the output text some bits and it only take ~20s
    
    also the mfa seems to be because of some permission, i move it to a different drive and it seems to run but it still doesnt generate the csv :'(
- 2 months ago
  
  Reply
  
  Anonymous
  
  3060 guy here, it uses a lot of gpu and cpu so i dont know what is wrong lol. it suddenly stopped working and i had to do a clean install
  
  install instruction minimum for inference +
  
  https://rentry.org/3rdkmdth
  - 2 months ago
    
    Reply
    
    Anonymous
    
    it seems like the generation length is why it takes so long? my previous gens were ~15 seconds long
2 months ago

Reply

Anonymous

What if someone torrents the shit you need auth to download?
- 2 months ago
  
  Reply
  
  Anonymous
  
  why would anyone want to help autist morons like you? just make a burn account ffs
2 months ago

Reply

Anonymous

ok so the deal is that, this whole thing gotta be run in admin mode.
to generate the mfa csv file, you need to run an admin cmd, activate conda, run the command there instead of inside the jupyter to see the progress, else you would just be waiting with no progress bar.

to generate the csv for the demo, it takes 142s (default 1 worker - 1 voice - i guess it is faster if you do a batch of a few workers at once?).

also you need to download the models. i only found that out after searching for it in the code

mfa model download dictionary english_us_arpa

mfa model download acoustic english_us_arpa

it is pretty slow for a 7 seconds long audio. but this only needed to be done once
2 months ago

Reply

Anonymous

>To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
I'm moronic and don't know how computers werk but want to jerk off to cartoon characters saying lewd things, should I beat my head against this until I manage to get it to work or is there a significantly easier path available?
- 2 months ago
  
  Reply
  
  Anonymous
  
  waitTM
2 months ago

Reply

Anonymous

are the results from this better than using xtts2 + rvc?
2 months ago

Reply

Anonymous

Funny how all of these guys start popping up a month after coqui dies
2 months ago

Reply

Anonymous

its nice but nobody can do anything with this shit until SOMEONE MAKES A FRICKING C++ LIBRARY FOR FRICK SAKE

>WhisperSpeech
>Vits
>Metavoice
>OpenVoice
>StyleTTS
>Tortoise TTS

Not a single fricking one has a decent C++ library.
- 2 months ago
  
  Reply
  
  Anonymous
  
  This is just a guess, but AI built on C++ is virtually impossible to scale to different systems without the most cancerous form of containerization.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    no, look at llama.CPP, look at whisper.cpp, look at stable diffusion.cpp
    - 2 months ago
      
      Reply
      
      Anonymous
      
      yeah but it took python bindings for people to adopt it and find faults or improvements.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >C++
  >not C
  - 2 months ago
    
    Reply
    
    Anonymous
    
    I'll take either. Just a single implementation so I can use it in a game or something
    - 2 months ago
      
      Reply
      
      Anonymous
      
      if you're too lazy to write your own library for it yourself then you're too lazy to make a meaningful game in the first place anon
      the truth hurts but it's something you can change
      be the change you want to see in the world
      - 2 months ago
        
        Anonymous
        
        One is vastly more difficult than the other moron
        moronic logic.
      - 2 months ago
        
        Anonymous
        
        they would both be the same difficulty if you weren't some glorified rpgmaker drag and drop frickchuckle that doesn't deserve to call yourself a game developer
      - 2 months ago
        
        Anonymous
        
        actually I use scratch
2 months ago

Reply

Anonymous

Can we get some decent examples in this thread? I gotta be honest, the one's here aren't the ELEVENLABS KILLER?!?!? shit I was expecting
2 months ago

Reply

Anonymous

>To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
This is nothing new, but it is still nice.
Is it better than XTTS2+RVC?

This is a human voice of persona 3:
https://voca.ro/1iyFPj4eF84W
This is a cloned voice using XTTS2+RVC:
https://voca.ro/12q8WElDmO7Q

I haven't tested voicecraft, but i wonder if it's better than this.
- 2 months ago
  
  Reply
  
  Anonymous
  
  https://voca.ro/17dXFZXLrnTS
  this one took 7.4 seconds + the 200 seconds needed to build the mfa
  - 2 months ago
    
    Reply
    
    Anonymous
    
    https://voca.ro/1hHudQpOUUpm
    when the sentence is completely different
  - 2 months ago
    
    Reply
    
    Anonymous
    
    https://voca.ro/1hHudQpOUUpm
    when the sentence is completely different
    
    sounds more soulful
  - 2 months ago
    
    Reply
    
    Anonymous
    
    https://voca.ro/1hHudQpOUUpm
    when the sentence is completely different
    
    Interesting, but the cloning is polluted by room noise/reverb.
    It's not really "clean", i don't know if it's on purpose or not but it affects the quality, specially on the second one.
    Overall, it seems to be a great TTS and has a lot of potential if mix it with RVC, i'll wait for a GUI to test it because i hate conda notebooks.
- 2 months ago
  
  Reply
  
  Anonymous
  
  >This is a human voice of persona 3:
  >https://voca.ro/1iyFPj4eF84W
  
  For comparison's sake, I did this on VALLE-X, using the clip your provided, it took 6.3 seconds to generate (3060ti)
  https://voca.ro/1gyaCkwzXIei (Little Tom Miiverse post)
  
  Though to be fair— I did first turn the audio in an npz (the format VALLE-X uses, only took 3 seconds to create) and I had to regen 3 times (other two had random erratic pronunciations, the usual shit)
  
  Honestly, so far I'm kinda not that sold on VoiceCraft, though if it's less prone to erratic glitches, that's pretty good at least.
- 2 months ago
  
  Reply
  
  Anonymous
  
  https://voca.ro/17dXFZXLrnTS
  this one took 7.4 seconds + the 200 seconds needed to build the mfa
  
  https://voca.ro/1hHudQpOUUpm
  when the sentence is completely different
  
  the results sound better than xtts2, but i've yet to find any clear documentation on how to configure it for better results. everything i've done with it is with just default parameters from this repository which allows you to use it with rvc in a webui.
  https://github.com/Vali-98/XTTS-RVC-UI
  
  I think this would provide better results IF you could have more control than just feeding it an input value and an index value, but as it stands, the base xtts2 output on its own is inferior to voicecraft, but with rvc it's way better while being faster. the length of the prompt doesn't affect the time it takes to compute. you can't really control the expressiveness, but i don't think you can with voicecraft either. the only options i've seen offer expression control is openvoice where you can use emojis or w/e to tell what the emotional state is, or bark with emojis as well, or *sad* *upset* etc.
  https://github.com/myshell-ai/OpenVoice
  https://github.com/suno-ai/bark
  
  my current opinion is that bark is inferior in terms of tonal reproduction, but the expressiveness is higher. someone correct me if i'm wrong.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    >the results sound better than xtts2
    Does it really?
    It feels as if voicecraft is trying to mask the robotic voice by giving the effect of a person using a shitty mic and speaking far from it.
    I don't think there is a single crisp, high-quality voicecraft example.
    - 2 months ago
      
      Reply
      
      Anonymous
      
      >single crisp, high-quality
      thats because the audio sample is only 16k
2 months ago

Reply

Anonymous

wow that's cool! let me git clone this open-source repo and run it on my machine!
- 2 months ago
  
  Reply
  
  Anonymous
  
  T H E E X E
2 months ago

Reply

Anonymous

this is why me and my parents have a safe word. so in case of some crazy phone call, we can use that safe word to know if the other person is real or a bot
- 2 months ago
  
  Reply
  
  Anonymous
  
  imagine having parents in 2024
2 months ago

Reply

Anonymous

https://voca.ro/1bu9LWfnfj8j

very interesting. first is the gen time being low (16 seconds for 7 secs vs 1+ minute for 15 seconds gen. it seems the gen time is related to how long the output is, not how long the input speech). second is that the closer the new sentence is, the better (obviously). finally once it hit the substituted words, the later, even if the same as the input, would sound worse
- 2 months ago
  
  Reply
  
  Anonymous
  
  >once it hit the substituted words, the later, even if the same as the input, would sound worse
  i think this is due to the later original parts have to be adapted to match the flow of the substitution
  
  https://voca.ro/14X7eEsal92L
  for this 21 seconds, it took 2m and 30s. "Well, I think we're doing very well. The new polls just came out." is from the original audio clip. the rest is generated.
  
  seems to me the longer the original text is the better quality the gens because it uses those text only as input. the rest is ignored
2 months ago

Reply

Anonymous

It seems like we're not quite there yet, but this is a step up.
2 months ago

Reply

Anonymous

Is anyone working on voice synthesis? Not just TTS or RVC, I mean writing something like "female, 35 years old, Irish accent, [sassy:flirty:0.5]" and getting a novel voice you can then feed into the pipeline?
- 2 months ago
  
  Reply
  
  Anonymous
  
  haven't seen anything like that, but that would be very cool
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Thinking it over, a lot of the work in terms of collecting and cleaning audio has already been done in order to train individual voices for RVC. What would then need to happen is to collect all (or at least some reasonable fraction) of those voices, and recaption them not with character/actor names but with the qualities of the voice itself. Villain, hero/heroine, accent, age, etc. That in and of itself isn't trivial, but then after that I'm not sure how that dataset would turn into a general model. There's something there though.
- 2 months ago
  
  Reply
  
  Anonymous
  
  Shit like VoiceCraft is heavily looked down upon by most AI researchers because it's too powerful. Now imagine if you could just write "desperate, 20 year old, female, crying" and get something as good... It would be the dream of scammers.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    god i fricking resent how all of the cool shit is gimped by the existence of buttholes
2 months ago

Reply

Anonymous

https://voca.ro/1ebPXNRaZwMk

Guys I can only get so far as getting the audicraft notebook working. Is there a working notebook for voice craft?
2 months ago

Reply

Anonymous

I had more fun with speech in the couple weeks it was around than all the other boring coonetshit combined.
2 months ago

Reply

Anonymous

I'm glad voice synth is still getting attention. This is far better than the previous open source stuff. Elevanlabs still wins in quality and cloning, but this isn't too far behind. It has a certain low quality to it though that sounds like it was trained solely on C-SPAN callers.
- 2 months ago
  
  Reply
  
  Anonymous
  
  https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf
  >Gigaspeech training set (Chen et al., 2021a) is used as the training data, which contains 9k hours of audiobooks, podcasts, and YouTube videos at 16kHz audio sampling rate. Audio files that shorter than 2 seconds are dropped.
  >The training of the 830M VOICECRAFT model took about 2 weeks on 4 NVIDIA A40 GPUs.
  that should be ~$650 on vast.ai to do. so someone training a new model with a better dataset for under $1k is possible
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Nice
2 months ago

Reply

Anonymous

Gradio UI when? I don't want to have to make one from scratch and I hate poopiter notbooks.
- 2 months ago
  
  Reply
  
  Anonymous
  
  Now: https://github.com/friendlyFriend4000/VoiceCraft
  - 2 months ago
    
    Reply
    
    Anonymous
    
    I don't know if this repo is yours, but I keep getting this:
    
    Traceback (most recent call last):
    File "<env_path>/lib/python3.9/site-packages/gradio/routes.py", line 534, in predict
    output = await route_utils.call_process_api(
    File "<env_path>/lib/python3.9/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
    (...)
    File "<project_path>/data/tokenizer.py", line 140, in tokenize_audio
    wav, sr = torchaudio.load(audio_path, frame_offset=offset, num_frames=num_frames)
    File "<env_path>/lib/python3.9/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
    RuntimeError: Invalid argument: num_frames must be -1 or greater than 0.
    - 2 months ago
      
      Reply
      
      Anonymous
      
      it's mine. is your cut off at 0s by chance?
      - 2 months ago
        
        Anonymous
        
        Yes. I did set the cutoff manually later, though, but I still cannot get the same results as the colab code
2 months ago

Reply

Anonymous

back in my day they made rollercoaster tycoon with fricking assembly
2 months ago

Reply

Anonymous

interesting issue
- 2 months ago
  
  Reply
  
  Anonymous
  
  dafuq
- 2 months ago
  
  Reply
  
  Anonymous
  
  its cells...tormented
  - 2 months ago
    
    Reply
    
    Anonymous
    
    at first i thought it's a bot with that numbername, but i have witnessed schizophrenics say things like that
2 months ago

Reply

Anonymous

VoiceCraft for ComfyUI, in progress.
https://github.com/kijai/ComfyUI-VoiceCraft
- 2 months ago
  
  Reply
  
  Anonymous
  
  why the frick would you add it to an imagegen UI instead of a textgen one?
  - 2 months ago
    
    Reply
    
    Anonymous
    
    comfyui isn't an imagegen specific ui, it's a node based editor for any model.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    cumrag is desperate for clout now that SAI is sinking
- 2 months ago
  
  Reply
  
  Anonymous
  
  That was very easy. Downloaded the models automatically an everything. Unfortunately I do not have enough VRAM to generate with this. Might frick around with CPU.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    did you have to install espeak manually? it says in the repository that it's required, but i'm not sure if that means i have to install it using the comfyui manager, or git clone it somewhere else and then point to it.
    - 2 months ago
      
      Reply
      
      Anonymous
      
      how to install espeak:
      https://bootphon.github.io/phonemizer/install.html
      - 2 months ago
        
        Anonymous
        
        thanks. do you also happen to know what node allows you to point to the library path?
    - 2 months ago
      
      Reply
      
      Anonymous
      
      I did, I just grabbed the msi and it worked.
      
      thanks. do you also happen to know what node allows you to point to the library path?
      
      If you're using the example json file / workflow from the repo, there should be a node that has the load library grey dot on it. Drag that dot out and it will give you the option to wire it up to a string primative node.
      I didn't do this though. I think the default autodetected it.
      - 2 months ago
        
        Anonymous
        
        yeah i just found the primitive node. okay now it's just a matter of actually utilizing the feature set on here. does the audio tensor need to be pointed to as well?
      - 2 months ago
        
        Anonymous
        
        there's a workflow set up in the demo folder
2 months ago

Reply

Anonymous

>non commerical license
>weights not released

So what's the point?
- 2 months ago
  
  Reply
  
  Anonymous
  
  ?
  everything you wrote is false
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Stupid little homosexual
- 2 months ago
  
  Reply
  
  Anonymous
  
  Stupid little homosexual
2 months ago

Reply

Anonymous

Is there a way to mix voice models? In an attempt to come up with hardly recognized voices I can use in games.
2 months ago

Reply

Anonymous

No one click exe installer, no download
it's that simple
- 2 months ago
  
  Reply
  
  Anonymous
  
  The community will be limited to a handful of autistic morons with no artistic sensibility until such a thing is made, so it is doomed to wallow in obscurity and mediocrity until a heroic man of the people makes these things available to non-codegays.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    this is correct. things need to be idiot proof as in click a button and it just works. not because there are non-codegays but becuase 80% of the population is moronic and handholding
- 2 months ago
  
  Reply
  
  Anonymous
  
  Once that happens, somebody will make a racist robocall in Biden's voice on election day to entire black neighborhoods. Then all hell breaks loose.
  - 2 months ago
    
    Reply
    
    Anonymous
    
    Oh no that's horrible
    What if some chuds actually does this?
    Hypothetically how would one support this chud?
    - 2 months ago
      
      Reply
      
      Anonymous
      
      Once that happens, somebody will make a racist robocall in Biden's voice on election day to entire black neighborhoods. Then all hell breaks loose.
      
      weird, how i already made this yesterday and today you are talking about biden and Black folk
      https://voca.ro/1ifYes5VJ362
      
      sad that biden was the only audio file i had on the disc. I don't really care about biden. president sdon't really have much agency anyway
2 months ago

Reply

Anonymous

>still barely any example gens posted

guys...
- 2 months ago
  
  Reply
  
  Anonymous
  
  Because there's no exe.
2 months ago

Reply

Anonymous

This doesnt make a lick of sense
Someone post an exe that does everything for me
2 months ago

Reply

Anonymous

Okay, so it seems like you can use this thing to train voice models but what software do you actually use to implement the voice model?

Cancel reply