VoiceCraft

VoiceCraft - speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.

Running locally with a 3080 it takes 8 second to create 13 seconds of high quality voice.

https://jasonppy.github.io/VoiceCraft_web/

Mike Stoklasa's Worst Fan Shirt $21.68

Thalidomide Vintage Ad Shirt $22.14

Mike Stoklasa's Worst Fan Shirt $21.68

  1. 2 months ago
    Anonymous

    It's nice but how would one implement it in a nice gui. I can only code with LLM

    • 2 months ago
      Anonymous

      Look into gradio, you can see how similar projects have used it. But I'm sure we don't have to wait long until we get a webui.

      • 2 months ago
        Anonymous

        i got it working with jyputer but it's so inconvinient. i need an helpful autist to port it to gradio

        • 2 months ago
          Anonymous

          >HuggingFace Spaces demo coming
          looking forward to that special thread. if it captures emma watson voice nicely, she will read good night stories for me every night

          How many petabytes of RAM one's PC needs to run this? I have been using Microsoft TTS free API to read shit for me, but I think sending them all that bunch of data for free might not be good.

          gui when?

          you need to sign an agreement to download
          >you need to sign an agreement to download
          you need to sign an agreement to download
          >you need to sign an agreement to download

          • 2 months ago
            Anonymous

            Hmm, it seems though you only need Gigaspeech if you're training a model yourself and it's not needed for inference, and SpeechColab appears unaffiliated with VoiceCraft's team. If you really want the audio dataset, just submit fake info like the madlad anon who used an .edu address to give us Llama, just as Prometheus gave humans fire.

          • 2 months ago
            Anonymous

            how new are you? that's how huggingface works

          • 2 months ago
            Anonymous

            I havent used hugginface before
            But is there a way to do this anonymously?

          • 2 months ago
            Anonymous

            Literally just use a fake email

  2. 2 months ago
    Anonymous

    >Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
    How do I get this without any authentication or signing any agreements?

    • 2 months ago
      Anonymous

      Anon if you get sued for abusing the weights there's no torrent exception.

    • 2 months ago
      Anonymous

      https://huggingface.co/pyp1/VoiceCraft/tree/main

    • 2 months ago
      Anonymous

      sorry I misread, yeah no way around that.

    • 2 months ago
      Anonymous

      It's nice but how would one implement it in a nice gui. I can only code with LLM

      https://huggingface.co/pyp1/VoiceCraft/tree/main

      sorry I misread, yeah no way around that.

      can this be downloaded in such a way that I dont need to create any kind of account anywhere so nobody ever knew it was me?

      (except for CIA who knows my ISP, but I have no trouble with them so its ok)

      • 2 months ago
        Anonymous

        you don't need an account to download these models. When you start the gay jyputer shit it download them itself

        • 2 months ago
          Anonymous

          so please give me instructions how to do the whole thing? in 10 steps?
          what do I need?
          in addition to Linux laptop

          • 2 months ago
            Anonymous

            on windows install wsl anad then conda in wsl if you dont have it already. git cloen the repo. then follow the github environment set up. then start jupyter and change os.environ["CUDA_VISIBLE_DEVICES"]="0"

            and add
            os.system(f"mfa model download dictionary english_us_arpa")
            os.system(f"mfa model download acoustic english_us_arpa ")
            by the end of cell 3 the first time you run it then delete it

            or someone just opened a pull request for some docker shit i havent looked into ti
            https://github.com/jasonppy/VoiceCraft/pull/25

          • 2 months ago
            Anonymous

            Are you dyslexic?

          • 2 months ago
            Anonymous

            No, I’m not going to do that.

  3. 2 months ago
    Anonymous

    where can I download the exe?

  4. 2 months ago
    Anonymous

    Bumping because this is cool but I should have just done it on WSL instead of trying to make it work on windows, I just wasted a couple of hours being a moron

    • 2 months ago
      Anonymous

      went for WSL, got this moronic error picrel

      • 2 months ago
        Anonymous

        >it was conda update that fricked it all
        I hate python

        • 2 months ago
          Anonymous

          your fault for not properly using environments

  5. 2 months ago
    Anonymous

    >HuggingFace Spaces demo coming
    looking forward to that special thread. if it captures emma watson voice nicely, she will read good night stories for me every night

    • 2 months ago
      Anonymous

      From me its aurora aksnes, her voice is so cute :3

  6. 2 months ago
    Anonymous

    needs xformers, are AMD users out of luck?

    • 2 months ago
      Anonymous

      >are AMD users out of luck?
      In ML contexts: yes.

    • 2 months ago
      Anonymous

      Werks on my end.

      • 2 months ago
        Anonymous

        Did you replace all the dependencies etc by hand or does this work out of the box with environments set up?
        The docker solution got mad because I have no nvidea interfaces for it.

    • 2 months ago
      Anonymous

      it works on CPU anyway

      • 2 months ago
        Anonymous

        How slow is it?

      • 2 months ago
        Anonymous

        Whenever somebody says "it works on CPU" in ML, it works the same way as it worked for this guy

        • 2 months ago
          Anonymous

          on CPU (7800x3d) it's faster than what the 3060 guy has

    • 2 months ago
      Anonymous

      https://github.com/ROCm/xformers

  7. 2 months ago
    Anonymous

    How many petabytes of RAM one's PC needs to run this? I have been using Microsoft TTS free API to read shit for me, but I think sending them all that bunch of data for free might not be good.

    • 2 months ago
      Anonymous

      2.4e-5

  8. 2 months ago
    Anonymous

    /vsg/ bros... we're so back

    • 2 months ago
      Anonymous

      https://vocaroo.com/1cuuHDbdemww

    • 2 months ago
      Anonymous

      come to think of it, what happened to the /vsg/ threads?

      I havent used hugginface before
      But is there a way to do this anonymously?

      jesus frick can you frick off back to le reddit with your stupid fricking halfwit questions? you have absolutely no idea what you're even talking about, read a fricking book you low IQ Black person tourist

      • 2 months ago
        Anonymous

        looks like you hate anonymous, prolly a microsoft shill

      • 2 months ago
        Anonymous

        >happened to the /vsg/ threads?
        There were zero happenings, aside from 11 labs, so everyone dropped it. Literal dead general.
        Now there's this, and suno.ai doing music-via-prompting-as-a-service. Shits wild.
        So, maybe VSG is finally back. I'm excited, eagerly awaiting the day TTS one-shot can do sillytavern rp and work with LLMs, and get it right. Capability is there now, but piss poor quality.

  9. 2 months ago
    Anonymous

    gui when?

  10. 2 months ago
    Anonymous

    frick you eleven labs

  11. 2 months ago
    Anonymous

    >Puyuan Peng, Po-Yao Huang, Daniel Li
    in light of recent events please audit the code VERY CLOSELY before running the software

    • 2 months ago
      Anonymous

      Did chinese people frick with some other code recently or something? What recent events?

      • 2 months ago
        Anonymous

        No, nothing to worry about gweilo

      • 2 months ago
        Anonymous

        Some chinese guy by the name of Jia Tan pushed a really well obfuscated backdoor into a compression library used by some 50+% of linux packages and system binaries. This is probably the most significant CVE ever. The backdoor has lived in the git repo for several months and was even updated without anyone noticing. Evidence shows that he has been collaborating with other Chinese to push similar updates to other open source repos, including the Linux kernel source itself.

        • 2 months ago
          Andres Freud

          this only affected bleeding edge Black folk. no one sane is affected. and this is off-topic

          • 2 months ago
            Anonymous

            >this backdoor being used in system libraries for the past several months and went unnoticed until now only because some autist saw his ssh logins were taking 500ms longer than before isn't really a big deal! nothing wrong here! also not on topic!
            Are you Chinese? Are you the developer of VoiceCraft? Regardless, you can go ahead and play with your AI toys, I'm just answering the other anon's question which was prompted by a warning to be careful since open source != perfectly safe.

  12. 2 months ago
    Anonymous

    Welcome to the promised land of local, /vsg/. Try not to burn out your GPUs too fast while you're here.

    • 2 months ago
      Anonymous

      >/sdg/ not a vertical cliff

  13. 2 months ago
    Anonymous

    https://vocaroo.com/12vTi6URKLNU

    • 2 months ago
      Anonymous

      Eleven libs seething

    • 2 months ago
      Anonymous

      Not bad! Use a better prompt like a copy pasta.

  14. 2 months ago
    Anonymous

    Oh shit, is voice craft actually the real deal?

  15. 2 months ago
    Anonymous

    >high quality
    >16kHz
    Into the trash it goes

  16. 2 months ago
    Anonymous

    i will never install conda
    i will never install linux
    i will never use windows subsystem for linux
    i will run this on windows

    • 2 months ago
      Anonymous

      bro I'll just get it running on windo-
      >InterpolationResolutionError: KeyError raised while resolving interpolation: "Environment variable 'USER' not found
      okay, I'll just set it mysel-
      >AttributeError: module 'os' has no attribute 'uname'
      what? why does that function not exist on windows? at least I can just replace os with platform and it'll wor-
      >AttributeError: 'uname_result' object has no attribute 'sysname'
      WSL time

      • 2 months ago
        Anonymous

        ML Python "people" doing everything in their power to make sure their code isn't portable outside of their own specific machine. Every fricking time.

  17. 2 months ago
    Anonymous

    Is it English only?

  18. 2 months ago
    Anonymous

    >no web ui or one click installer
    I'm honestly too lazy to give a frick

  19. 2 months ago
    Anonymous

    Where's a proper webui version?

  20. 2 months ago
    Anonymous

    didnt openai also just release another one? I just want the best dagoth ur, offline, runtime doesnt matter. Which of the plethora of xtts bark wiener-qui tortoise and whatevers is the best? Judging from the huggingface arena ill go with xtts

  21. 2 months ago
    Anonymous

    How much VRAM is the minimum?

    • 2 months ago
      Anonymous

      >enter poorgaygus maximus
      it's release day 1 homie, go harass /lmg/ a little and come back in a few weeks

    • 2 months ago
      Anonymous

      I don't know
      the repeat 3 batch 4 script I snagged from lmg took 16gb of VRAM + 31gb of shared ram

    • 2 months ago
      Anonymous

      it's a shitty transformers-based TTS, so for 15 second audio you need up to 24 gb vram, so its not worth any time wasted when installing.

  22. 2 months ago
    Anonymous

    voicecraft vs gpt sovits?

  23. 2 months ago
    Anonymous

    https://vocaroo.com/1bRF3QW0bX2v

  24. 2 months ago
    Anonymous

    /g/entlemen, im trying to do the needful on windows. i got everything running, and then this line. it stuck forever. espeak can be run from the command line from PATH. apparently triton is not an issue here. there is some memory usage thoug. using inference_speech_editing.ipynb

    • 2 months ago
      Anonymous

      so it was because the encodec_fn model was symlinked, so it didnt work because muh file permission
      after some editing and removing linux specific commands in the code, i was able to run it, fully on windows, without wsl

  25. 2 months ago
    Anonymous

    just an idea, brownpill the hispanics about pajeets using the latam voice of goku (mario castanneda)

  26. 2 months ago
    Anonymous

    it takes about 1 minutes for my 3060 to run using the giga830M model

    https://voca.ro/1c2vVpJtkprL
    default demo voice

    included is the fix for the linux anti-windows error. go "srcaudiocraftaudiocraftutilscluster.py" and comment out these lines

    follow the instruction in the given jupyter notebook (double click on the markdown cell to edit mode to read the text. i forgot to do reddit spacing).

    the edited notebook. it worked on my machine:
    https://files.catbox.moe/fahsys.ipynb
    just replace the inference_speech_editing.ipynb with this

    • 2 months ago
      Anonymous

      Are you using conda and etc?

    • 2 months ago
      Anonymous

      also you need to run vscode in admin mode to load the model, for some reason. also i cant get the mfa aligner to create the csv for me

      the command is
      mfa align -j 1 --output_format csv demo/temp --clean english_us_arpa english_us_arpa demo/temp/mfa_alignments --beam 1000 --retry_beam 2000

      Are you using conda and etc?

      yes im using conda

    • 2 months ago
      Anonymous

      >it takes about 1 minutes for my 3060 to run using the giga830M model
      1. are you sure it's using your GPU and not cpu?
      2. even with ram fallback that's slow

      • 2 months ago
        Anonymous

        it used 100% of my cuda and a little bit of vram. ram use is not much, cpu use was ~60%. im using my 3060 on a potato set up with PCI gen 3 so it is slow. also the inference time varies a lot, changing the output text some bits and it only take ~20s

        also the mfa seems to be because of some permission, i move it to a different drive and it seems to run but it still doesnt generate the csv :'(

    • 2 months ago
      Anonymous

      3060 guy here, it uses a lot of gpu and cpu so i dont know what is wrong lol. it suddenly stopped working and i had to do a clean install

      install instruction minimum for inference +

      https://rentry.org/3rdkmdth

      • 2 months ago
        Anonymous

        it seems like the generation length is why it takes so long? my previous gens were ~15 seconds long

  27. 2 months ago
    Anonymous

    What if someone torrents the shit you need auth to download?

    • 2 months ago
      Anonymous

      why would anyone want to help autist morons like you? just make a burn account ffs

  28. 2 months ago
    Anonymous

    ok so the deal is that, this whole thing gotta be run in admin mode.
    to generate the mfa csv file, you need to run an admin cmd, activate conda, run the command there instead of inside the jupyter to see the progress, else you would just be waiting with no progress bar.

    to generate the csv for the demo, it takes 142s (default 1 worker - 1 voice - i guess it is faster if you do a batch of a few workers at once?).

    also you need to download the models. i only found that out after searching for it in the code

    mfa model download dictionary english_us_arpa

    mfa model download acoustic english_us_arpa

    it is pretty slow for a 7 seconds long audio. but this only needed to be done once

  29. 2 months ago
    Anonymous

    >To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
    I'm moronic and don't know how computers werk but want to jerk off to cartoon characters saying lewd things, should I beat my head against this until I manage to get it to work or is there a significantly easier path available?

    • 2 months ago
      Anonymous

      waitTM

  30. 2 months ago
    Anonymous

    are the results from this better than using xtts2 + rvc?

  31. 2 months ago
    Anonymous

    Funny how all of these guys start popping up a month after coqui dies

  32. 2 months ago
    Anonymous

    its nice but nobody can do anything with this shit until SOMEONE MAKES A FRICKING C++ LIBRARY FOR FRICK SAKE

    >WhisperSpeech
    >Vits
    >Metavoice
    >OpenVoice
    >StyleTTS
    >Tortoise TTS

    Not a single fricking one has a decent C++ library.

    • 2 months ago
      Anonymous

      This is just a guess, but AI built on C++ is virtually impossible to scale to different systems without the most cancerous form of containerization.

      • 2 months ago
        Anonymous

        no, look at llama.CPP, look at whisper.cpp, look at stable diffusion.cpp

        • 2 months ago
          Anonymous

          yeah but it took python bindings for people to adopt it and find faults or improvements.

    • 2 months ago
      Anonymous

      >C++
      >not C

      • 2 months ago
        Anonymous

        I'll take either. Just a single implementation so I can use it in a game or something

        • 2 months ago
          Anonymous

          if you're too lazy to write your own library for it yourself then you're too lazy to make a meaningful game in the first place anon
          the truth hurts but it's something you can change
          be the change you want to see in the world

          • 2 months ago
            Anonymous

            One is vastly more difficult than the other moron
            moronic logic.

          • 2 months ago
            Anonymous

            they would both be the same difficulty if you weren't some glorified rpgmaker drag and drop frickchuckle that doesn't deserve to call yourself a game developer

          • 2 months ago
            Anonymous

            actually I use scratch

  33. 2 months ago
    Anonymous

    Can we get some decent examples in this thread? I gotta be honest, the one's here aren't the ELEVENLABS KILLER?!?!? shit I was expecting

  34. 2 months ago
    Anonymous

    >To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
    This is nothing new, but it is still nice.
    Is it better than XTTS2+RVC?

    This is a human voice of persona 3:
    https://voca.ro/1iyFPj4eF84W
    This is a cloned voice using XTTS2+RVC:
    https://voca.ro/12q8WElDmO7Q

    I haven't tested voicecraft, but i wonder if it's better than this.

    • 2 months ago
      Anonymous

      https://voca.ro/17dXFZXLrnTS
      this one took 7.4 seconds + the 200 seconds needed to build the mfa

      • 2 months ago
        Anonymous

        https://voca.ro/1hHudQpOUUpm
        when the sentence is completely different

      • 2 months ago
        Anonymous

        https://voca.ro/1hHudQpOUUpm
        when the sentence is completely different

        sounds more soulful

      • 2 months ago
        Anonymous

        https://voca.ro/1hHudQpOUUpm
        when the sentence is completely different

        Interesting, but the cloning is polluted by room noise/reverb.
        It's not really "clean", i don't know if it's on purpose or not but it affects the quality, specially on the second one.
        Overall, it seems to be a great TTS and has a lot of potential if mix it with RVC, i'll wait for a GUI to test it because i hate conda notebooks.

    • 2 months ago
      Anonymous

      >This is a human voice of persona 3:
      >https://voca.ro/1iyFPj4eF84W

      For comparison's sake, I did this on VALLE-X, using the clip your provided, it took 6.3 seconds to generate (3060ti)
      https://voca.ro/1gyaCkwzXIei (Little Tom Miiverse post)

      Though to be fair— I did first turn the audio in an npz (the format VALLE-X uses, only took 3 seconds to create) and I had to regen 3 times (other two had random erratic pronunciations, the usual shit)

      Honestly, so far I'm kinda not that sold on VoiceCraft, though if it's less prone to erratic glitches, that's pretty good at least.

    • 2 months ago
      Anonymous

      https://voca.ro/17dXFZXLrnTS
      this one took 7.4 seconds + the 200 seconds needed to build the mfa

      https://voca.ro/1hHudQpOUUpm
      when the sentence is completely different

      the results sound better than xtts2, but i've yet to find any clear documentation on how to configure it for better results. everything i've done with it is with just default parameters from this repository which allows you to use it with rvc in a webui.
      https://github.com/Vali-98/XTTS-RVC-UI

      I think this would provide better results IF you could have more control than just feeding it an input value and an index value, but as it stands, the base xtts2 output on its own is inferior to voicecraft, but with rvc it's way better while being faster. the length of the prompt doesn't affect the time it takes to compute. you can't really control the expressiveness, but i don't think you can with voicecraft either. the only options i've seen offer expression control is openvoice where you can use emojis or w/e to tell what the emotional state is, or bark with emojis as well, or *sad* *upset* etc.
      https://github.com/myshell-ai/OpenVoice
      https://github.com/suno-ai/bark

      my current opinion is that bark is inferior in terms of tonal reproduction, but the expressiveness is higher. someone correct me if i'm wrong.

      • 2 months ago
        Anonymous

        >the results sound better than xtts2
        Does it really?
        It feels as if voicecraft is trying to mask the robotic voice by giving the effect of a person using a shitty mic and speaking far from it.
        I don't think there is a single crisp, high-quality voicecraft example.

        • 2 months ago
          Anonymous

          >single crisp, high-quality
          thats because the audio sample is only 16k

  35. 2 months ago
    Anonymous

    wow that's cool! let me git clone this open-source repo and run it on my machine!

    • 2 months ago
      Anonymous

      T H E E X E

  36. 2 months ago
    Anonymous

    this is why me and my parents have a safe word. so in case of some crazy phone call, we can use that safe word to know if the other person is real or a bot

    • 2 months ago
      Anonymous

      imagine having parents in 2024

  37. 2 months ago
    Anonymous

    https://voca.ro/1bu9LWfnfj8j

    very interesting. first is the gen time being low (16 seconds for 7 secs vs 1+ minute for 15 seconds gen. it seems the gen time is related to how long the output is, not how long the input speech). second is that the closer the new sentence is, the better (obviously). finally once it hit the substituted words, the later, even if the same as the input, would sound worse

    • 2 months ago
      Anonymous

      >once it hit the substituted words, the later, even if the same as the input, would sound worse
      i think this is due to the later original parts have to be adapted to match the flow of the substitution

      https://voca.ro/14X7eEsal92L
      for this 21 seconds, it took 2m and 30s. "Well, I think we're doing very well. The new polls just came out." is from the original audio clip. the rest is generated.

      seems to me the longer the original text is the better quality the gens because it uses those text only as input. the rest is ignored

  38. 2 months ago
    Anonymous

    It seems like we're not quite there yet, but this is a step up.

  39. 2 months ago
    Anonymous

    Is anyone working on voice synthesis? Not just TTS or RVC, I mean writing something like "female, 35 years old, Irish accent, [sassy:flirty:0.5]" and getting a novel voice you can then feed into the pipeline?

    • 2 months ago
      Anonymous

      haven't seen anything like that, but that would be very cool

      • 2 months ago
        Anonymous

        Thinking it over, a lot of the work in terms of collecting and cleaning audio has already been done in order to train individual voices for RVC. What would then need to happen is to collect all (or at least some reasonable fraction) of those voices, and recaption them not with character/actor names but with the qualities of the voice itself. Villain, hero/heroine, accent, age, etc. That in and of itself isn't trivial, but then after that I'm not sure how that dataset would turn into a general model. There's something there though.

    • 2 months ago
      Anonymous

      Shit like VoiceCraft is heavily looked down upon by most AI researchers because it's too powerful. Now imagine if you could just write "desperate, 20 year old, female, crying" and get something as good... It would be the dream of scammers.

      • 2 months ago
        Anonymous

        god i fricking resent how all of the cool shit is gimped by the existence of buttholes

  40. 2 months ago
    Anonymous

    https://voca.ro/1ebPXNRaZwMk

    Guys I can only get so far as getting the audicraft notebook working. Is there a working notebook for voice craft?

  41. 2 months ago
    Anonymous

    I had more fun with speech in the couple weeks it was around than all the other boring coonetshit combined.

  42. 2 months ago
    Anonymous

    I'm glad voice synth is still getting attention. This is far better than the previous open source stuff. Elevanlabs still wins in quality and cloning, but this isn't too far behind. It has a certain low quality to it though that sounds like it was trained solely on C-SPAN callers.

    • 2 months ago
      Anonymous

      https://jasonppy.github.io/assets/pdfs/VoiceCraft.pdf
      >Gigaspeech training set (Chen et al., 2021a) is used as the training data, which contains 9k hours of audiobooks, podcasts, and YouTube videos at 16kHz audio sampling rate. Audio files that shorter than 2 seconds are dropped.
      >The training of the 830M VOICECRAFT model took about 2 weeks on 4 NVIDIA A40 GPUs.
      that should be ~$650 on vast.ai to do. so someone training a new model with a better dataset for under $1k is possible

      • 2 months ago
        Anonymous

        Nice

  43. 2 months ago
    Anonymous

    Gradio UI when? I don't want to have to make one from scratch and I hate poopiter notbooks.

    • 2 months ago
      Anonymous

      Now: https://github.com/friendlyFriend4000/VoiceCraft

      • 2 months ago
        Anonymous

        I don't know if this repo is yours, but I keep getting this:

        Traceback (most recent call last):
        File "<env_path>/lib/python3.9/site-packages/gradio/routes.py", line 534, in predict
        output = await route_utils.call_process_api(
        File "<env_path>/lib/python3.9/site-packages/gradio/route_utils.py", line 226, in call_process_api
        output = await app.get_blocks().process_api(
        (...)
        File "<project_path>/data/tokenizer.py", line 140, in tokenize_audio
        wav, sr = torchaudio.load(audio_path, frame_offset=offset, num_frames=num_frames)
        File "<env_path>/lib/python3.9/site-packages/torch/_ops.py", line 502, in __call__
        return self._op(*args, **kwargs or {})
        RuntimeError: Invalid argument: num_frames must be -1 or greater than 0.

        • 2 months ago
          Anonymous

          it's mine. is your cut off at 0s by chance?

          • 2 months ago
            Anonymous

            Yes. I did set the cutoff manually later, though, but I still cannot get the same results as the colab code

  44. 2 months ago
    Anonymous

    back in my day they made rollercoaster tycoon with fricking assembly

  45. 2 months ago
    Anonymous

    interesting issue

    • 2 months ago
      Anonymous

      dafuq

    • 2 months ago
      Anonymous

      its cells...tormented

      • 2 months ago
        Anonymous

        at first i thought it's a bot with that numbername, but i have witnessed schizophrenics say things like that

  46. 2 months ago
    Anonymous

    VoiceCraft for ComfyUI, in progress.
    https://github.com/kijai/ComfyUI-VoiceCraft

    • 2 months ago
      Anonymous

      why the frick would you add it to an imagegen UI instead of a textgen one?

      • 2 months ago
        Anonymous

        comfyui isn't an imagegen specific ui, it's a node based editor for any model.

      • 2 months ago
        Anonymous

        cumrag is desperate for clout now that SAI is sinking

    • 2 months ago
      Anonymous

      That was very easy. Downloaded the models automatically an everything. Unfortunately I do not have enough VRAM to generate with this. Might frick around with CPU.

      • 2 months ago
        Anonymous

        did you have to install espeak manually? it says in the repository that it's required, but i'm not sure if that means i have to install it using the comfyui manager, or git clone it somewhere else and then point to it.

        • 2 months ago
          Anonymous

          how to install espeak:
          https://bootphon.github.io/phonemizer/install.html

          • 2 months ago
            Anonymous

            thanks. do you also happen to know what node allows you to point to the library path?

        • 2 months ago
          Anonymous

          I did, I just grabbed the msi and it worked.

          thanks. do you also happen to know what node allows you to point to the library path?

          If you're using the example json file / workflow from the repo, there should be a node that has the load library grey dot on it. Drag that dot out and it will give you the option to wire it up to a string primative node.
          I didn't do this though. I think the default autodetected it.

          • 2 months ago
            Anonymous

            yeah i just found the primitive node. okay now it's just a matter of actually utilizing the feature set on here. does the audio tensor need to be pointed to as well?

          • 2 months ago
            Anonymous

            there's a workflow set up in the demo folder

  47. 2 months ago
    Anonymous

    >non commerical license
    >weights not released

    So what's the point?

    • 2 months ago
      Anonymous

      ?
      everything you wrote is false

      • 2 months ago
        Anonymous

        Stupid little homosexual

    • 2 months ago
      Anonymous

      Stupid little homosexual

  48. 2 months ago
    Anonymous

    Is there a way to mix voice models? In an attempt to come up with hardly recognized voices I can use in games.

  49. 2 months ago
    Anonymous

    No one click exe installer, no download
    it's that simple

    • 2 months ago
      Anonymous

      The community will be limited to a handful of autistic morons with no artistic sensibility until such a thing is made, so it is doomed to wallow in obscurity and mediocrity until a heroic man of the people makes these things available to non-codegays.

      • 2 months ago
        Anonymous

        this is correct. things need to be idiot proof as in click a button and it just works. not because there are non-codegays but becuase 80% of the population is moronic and handholding

    • 2 months ago
      Anonymous

      Once that happens, somebody will make a racist robocall in Biden's voice on election day to entire black neighborhoods. Then all hell breaks loose.

      • 2 months ago
        Anonymous

        Oh no that's horrible
        What if some chuds actually does this?
        Hypothetically how would one support this chud?

        • 2 months ago
          Anonymous

          Once that happens, somebody will make a racist robocall in Biden's voice on election day to entire black neighborhoods. Then all hell breaks loose.

          weird, how i already made this yesterday and today you are talking about biden and Black folk
          https://voca.ro/1ifYes5VJ362

          sad that biden was the only audio file i had on the disc. I don't really care about biden. president sdon't really have much agency anyway

  50. 2 months ago
    Anonymous

    >still barely any example gens posted

    guys...

    • 2 months ago
      Anonymous

      Because there's no exe.

  51. 2 months ago
    Anonymous

    This doesnt make a lick of sense
    Someone post an exe that does everything for me

  52. 2 months ago
    Anonymous

    Okay, so it seems like you can use this thing to train voice models but what software do you actually use to implement the voice model?

Your email address will not be published. Required fields are marked *