Speech To Text To Speech

DP
What happens if you then re synthesise from text? With Sphinx, I don't know, with this algorithm?


JR
Mmm.. what would happen? That's a big question.


DP
Because, I mean, that way is not univocal. It has not a solution which is just one possibility.


JR
I think in the other direction it would have a definite mapping. I think so. Because, you know, you are gonna have a translation of these phonemes in the end, to any kind of speech synthesis algorithm you use. Clear parameters that any phoneme maps to. So I think you would get much more consistent output than you would [in the opposite direction].


DP
And what if you speak a phrase, and this would be recognised outputting a text. From that text you would re synthesise the speech, and then put it back again, forming a loop. This would never change? So this would never deviate from the original?


JR 
You mean synthesise and then it listen to itself. It becomes kind of like ?I'm siting in a room'.


HHR 
Somebody did that by the way, I think.


DP
I would, of course.


JR
Yeah, maybe we can find out what happens.


DP
It's interesting because then there's a kind of space in the algorithm itself.


HHR
Maybe "gh gh gh" at some point settles on some words..


DP
Yeah, or it is so good that it doesn't.


JR
Yeah maybe it's perfectly lossless.


DP
Yeah exactly. That would be interesting.


JR 
It would be really interesting. Well, maybe there's kind of a convergence point, depending on what the initial qualities of the speaker's voice. Maybe it converges and maybe it doesn't converge. Maybe it oscillates, I don't know. That could be really interesting.


DP
Or maybe it gets really sine toning, because stuff that is noisy part or muffled parts would probably fade away after a while. Could be, I don't know.


JR
I don't know, I'm really curious about that. Sphinx doesn't have a synthesis engine, I don't think.


DP 
Probably you would have to reimplement this. But that could be interesting, I mean, you could live code SC, you could speak something into SC and SC would resynthesise what you are speaking to him, using the same features that Sphinx would analyse. So, it's a quite literal translation of live coding. I mean, you are telling SC what to synthesise. Which is not how to control synthesis, but, quite literally, what sound it should output.


JR
It would be nice to be able to do that in a more flexible way. Because at that point you might as well just remove the language interpretation layer all together and.. I mean, that's basically the idea of the vocoder, right? You translate the sound of your voice into synthesis parameters which try to kind of reproduce that.


DP
Yeah, but with that in between step that goes through the language understanding.


JR
So there should be a level of language understanding. Or probably phonetic would be interesting.


DP
Yeah.


JR
Phonetic training is something I would really like to explore a little bit, if there's time. Cause that seems to be an area where there's quite a lot of interesting possibilities in terms of linguistic gestures to be recognised as something meaningful. The way you say something rather than what you say.

Experimentalstudio meeting, 18_10_2018

Jonathan Reus, Hanns Holger Rutz, David Pirrò

POZ [28_10_18]

This reminds me of Jürg Lehni's installation ⬀Apple Talk, that is centered around this exact same principle. From his own desrcpition of the work:

"Apple Talk confronts two Macintosh computers with the imperfections of verbal communication. Using two microphones, text-to-speech and voice recognition software, the two machines continuously transform written sentences into spoken words and back into written form in an endless loop, leaving room for error and automatic interpretation". 

The video shows how this kind of configuration tends to converge to a unique sentence, which is probably the reason why he introduceced the repetition detection.