What a privilege to be one of the last fully human beings.
Also technically wrong
I know that for the foreseeable future, the artists formerly known as humans will be a tactile hybrid of meat and chips.
Perhaps I shouldn’t have been surprised, then, when Microsoft researchers showed up to speed things up a bit into the hopelessly future.
It all seemed so naive and scientific. The researcher’s essay title was creatively opaque: “Neural Codec Language Models for Zero-Shot Text-to-Speech Synthesizers.”
What do you think this means? Is there a new and faster way A machine that transcribes the words you say?
The researchers’ summary starts off pretty well. Uses many words, phrases and acronyms. This explains why the neural codec language model is called VALL-E.
This name will surely soften you up. What’s so scary about technology that sounds like a cute little robot from a heartwarming movie?
OK, it probably is: “VALL-E exhibits in-context learning capabilities and can be used to synthesize high-quality custom speech using a 3-second recording of an invisible speaker as an acoustic stimulus.”
He often wanted to demonstrate learning abilities. Instead, I had to wait for them to emerge.
What emerges from the last sentence of the researchers is a chill. The big brains at Microsoft now only need 3 seconds to fake long sentences and big speeches that you probably don’t speak, but sound like you.
I won’t get too far into science so that neither of us benefits.
I mention that VALL-E uses an audio library created by Meta, one of the most admired and trusted companies in the world. Librilita was invited. It is a repository of 7,000 speakers that add up to 60,000 hours.
Naturally, I asked for the VALL-E job.
I heard a man speak for 3 seconds. 8 seconds I heard his version of VALL-E say, “Then they moved cautiously, searching before the cabin and finding something to prove that Warrenton had accomplished his mission.”
If there is too much difference, you refuse to notice.
It is true that many of the evocations sound like the worst pieces of eighteenth-century literature. It shows: “Thus this humane and upright-minded father comforted her unhappy daughter, and her mother, embracing her again, did all she could to soothe her feelings.”
But what can I do but listen to the many examples that researchers present? Some versions of VALL-E were more suspicious than others. The wording is not correct. They felt divided.
However, the overall effect is quite terrifying.
You have already been warned. When scammers call him and record him, play back his speech, and use his abstract voice to order expensive and unsavory products, he knows better than to talk to them.
This, it seems, is another level of subtlety. Maybe I’ve already seen too many episodes of Peacock.catch where deepfakes are offered as a normal part of the government. Microsoft is such a nice and harmless company these days that I really don’t have to worry.
However, I am not comforted by the idea that I can easily fool anyone, anyone, into saying something I don’t say. Specifically, they may reflect the “acoustic and emotional context” of the first 3 seconds of someone’s speech, the researchers say.
You’ll be relieved that researchers have identified this potential for discomfort. are providing: “Since VALL-E can synthesize speech that maintains speaker identity, this may present potential risks of model misuse, such as spoofing voice identity or impersonating a specific speaker.”
The solution? Developing a detection system, say the researchers.
This might leave one or two people wondering, “Then why did you do this?”
Often in technology, the answer is: “Because we can.”
Rachel Maga is a technology journalist currently working at Globe Live Media agency. She has been in the Technology Journalism field for over 5 years now. Her life’s biggest milestone is the inside tour of Tesla Industries, which was gifted to her by the legend Elon Musk himself.