Is anyone aware of techniques or research on this?

  • Bassman1805@lemmy.world
    link
    fedilink
    arrow-up
    15
    arrow-down
    1
    ·
    edit-2
    1 year ago

    A degree in audio engineering.

    It’s possible and there has been plenty of research done into it, but it’s a hard problem and not just a question of slapping an off-the-shelf EQ on it.

    • intensely_human@lemm.ee
      link
      fedilink
      arrow-up
      3
      arrow-down
      1
      ·
      1 year ago

      Here you go.

      You record the vibration of all the little receiver hairs in your ear as you speak. You simultaneously record into a microphone at the same time.

      You train a neural network to transform the sound file from the recording into the set of vibration states it corresponds with in your ear hairs.

      Then you just run the resulting model on the sound file, and magnetically vibrate the ear hairs in the way the model predicts.

      Boom! 1st person speaking voice experience.

  • j4k3@lemmy.worldOP
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    1
    ·
    1 year ago

    So to clarify, I am really curious about speech comprehension if a Text To Speech AI is trained to sound like ourselves (myself).

    Over the last couple of weeks I have been working on learning about and using offline AI. I don’t know why, but I decided to ask here first.

    If anyone is curious, this is what my current favorite LLM has to say (Llama 2 70B Q4 GGML):

    You

    what does it take to tune a recording of my voice and match it to how I hear my own voice?

    Assistant

    To tune your recorded voice so that it matches the way you perceive your own voice, there are several steps involved in audio processing. Here’s an overview of the process:

    • Recording: Firstly, ensure that you have a high-quality recording of your voice using proper equipment such as a microphone and soundproofing to minimize background noise. This will provide a good starting point for further editing.

    • Equalization (EQ): Use an equalizer to adjust the frequency balance in your recorded audio. You can boost or cut specific frequencies that are too prominent or lacking, respectively. For example, if you feel like your voice sounds nasally, you may want to reduce some of the mid-range frequencies around 1kHz and above.

    • Compression: Apply compression to even out the dynamic range in your recording. This will help balance the loudness between different parts of your speech or singing, making it sound more consistent.

    • Reverb/Delay: Add a small amount of reverb or delay to create a sense of space and depth around your voice. This can make it feel like you’re hearing yourself in a room rather than through headphones or speakers.

    • Stereo Imaging: Use stereo imaging tools to adjust the width and positioning of your voice within the mix. This will help create an illusion that your voice is coming from a specific location, making it sound more natural.

    • Monitoring: Finally, listen back to your processed audio on different devices such as headphones or speakers in various environments (e.g., quiet room vs noisy cafe) to ensure that the changes you’ve made translate well across different listening scenarios. Make further adjustments if necessary until you are satisfied with how it sounds.

    Remember, everyone perceives their own voice differently due to bone conduction and other factors, so there may be some trial-and-error involved in finding a sound that feels right for you. It’s also important to keep in mind that the goal is not necessarily to make your recorded voice identical to how it sounds when speaking but rather to create an enjoyable listening experience for others who will hear it through speakers or headphones.

    You

    Is there any formulaic approximation or rule regarding bone conduction and how humans perceive their own voice?

    Assistant

    Bone conduction is the process by which sound vibrations are transmitted directly to the inner ear via the skull, bypassing the outer and middle ears. This occurs when we speak or sing, causing us to hear our voices differently than others do because of this direct transmission pathway.

    There isn’t a specific formulaic approximation or rule that can accurately predict how each individual perceives their own voice through bone conduction since it varies from person to person due to factors such as head shape and size, vocal tract resonance, and the acoustics of the surrounding environment. However, in general, people tend to hear their voices as deeper or richer when speaking because low-frequency sounds are more efficiently transmitted through bone conduction than high frequencies.

    It’s important to note that while understanding how we perceive our own voice can be helpful for tuning a recording of it, the goal should ultimately be to create an enjoyable listening experience for others who will hear your recorded voice through speakers or headphones rather than trying to perfectly replicate what you hear when speaking.

  • ArchTemperedKoala@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    1 year ago

    My guess is there’s some subjective components to it that simple changes wouldn’t be able to adjust to, each person will have to do the changes themselves to make it match what they’re hearing…