アマノケイのまったり技術解説

合成音声系の技術的なことを中心に解説記事を書いていきます。

Exploring the mystery of the poor quality reveals Crypton's intentions【What is identity of Hatsune Miku NT?】

12/19 GMT+9

Many people seem like misunderstanding the central theme of this blog, so I should write it in advance.
 
1) How Hatsune Miku NT works.
2) The direction of Hatsune Miku NT and V4x is different (V4x has a higher level of perfection at this point, but NT's quality may improve in time).
3) Why did Crypton decide to create Hatsune Miku NT?

 

 

Hatsune Miku NT is, frankly, very subtle.

Except that you can draw the pitch freely, VOCALOID4 is still said to have better sound quality.

初音ミク NT Original+:歌声デモンストレーション】 - YouTube

Now, let's talk about the internal functions of Hatsune Miku NT, and why Crypton created it in the first place.

 

Let's think about it from the output sound

Even though we are talking about the output sound, I don't think we can solve the problem by listening to the audio to look at the waveform and spectrogram with various manipulations.

①Is Hatsune Miku NT an AI Voicebank?

There are some rumors that Hatsune Miku NT is an AI sound source created with Hatsune Miku V4x and some other data, but I think this can be denied.

First of all, to confirm that it is not an AI sound source, let's try inputting some data that would break down if it were an AI sound source.

Here is the data that I inputted [s] at 120 BPM for 3 bars in CeVIO AI.

This parameter is VOL, but it's out of order, as you can see.

f:id:crimsonbutterfly0zero0:20211217180143p:plain

On the other hand, here is the data of 29 bars of [s] in Hatsune Miku NT at 120 BPM.

As you can see from the waveform below, there is no breakdown at all.

f:id:crimsonbutterfly0zero0:20211217180358p:plain

 

It returns decent data even when you type in phonemes with patterns that are not expected of an AI probably means that this is not an AI.

(Unless Crypton created the AI with this kind of input in mind, but I don't think they designed to type in 30 bars of silent notes.)

 

②Is Hatsune Miku NT a waveform synthesis software?

So, is Hatsune Miku a waveform synthesis software like VOCALOID? Is Hatsune Miku a waveform synthesizer like VOCALOID?

I think it's half yes and a half no.

 

First of all, most waveform synthesis software, like VOCALOID and UTAU, uses a method called "corpus-based synthesis method/Unit Selection" to synthesize songs by re-sampling "raw voice waveforms" or "voices that reproduce raw waveforms."

If you ask me if Hatsune Miku NT falls into this category, I have my doubts.

 

As you can see from the sound of the UTAU default engine (although it is kind of raspy), there is nothing wrong with human pronunciation.

 

In comparison, there are many things about Hatsune Miku NT that are "strange as human pronunciation.

It's hard to say exactly what it is, but I think the transition from consonants to vowels is particularly unnatural.



In other words, likely, Hatsune Miku NT is not "simply a waveform synthesis software that cuts and pastes a physical voice" or "a waveform synthesis software that cuts and pastes a high-quality reproduction of a physical voice."

 

Then what is Hatsune Miku NT?

There are many hints about "Hatsune Miku NT" identity, but only a few people seem to have noticed them.

First, let's read this sentence from the official website.

f:id:crimsonbutterfly0zero0:20211218011843p:plain

This is a 「高品位」 voice library created with newly developed resynthesis technology.

I'll talk about "resynthesis technology" later, but I'm curious about the use of "high-dignity/purity(高品位)" instead of "high quality(高品質)" here.

NT is certainly not "high quality," and the sentence would have worked without this word, so there must be a reason why "dignity" was deliberately chosen.

※Incidentally, "resynthesis" is also a synthesizer term. And this is also quite important, but please look up the meaning on your own.

 

Next is this word.

f:id:crimsonbutterfly0zero0:20211218012653p:plain

I'm curious about the "multi-sample point."

Usually, I would think that it means "Hatsune Miku NT is recording in multiple scales," but if that's the case, "multisample" should be fine. I wondered why they added the extra "points." I did some research and found out the surprising origin.

Sample points: raw data from an A/D converter used to calculate waveform points ("All About Oscilloscopes," published by Technotronics, April 2017).

I was surprised to learn that this is a term used in oscilloscopes and the like, but the explanation is noteworthy here.
In addition, there was an explanation of "waveform points," which I'll post here.

A digital value that represents the voltage at a certain point of a signal. The waveform points can be calculated from the sample points and stored in memory. ("All About Oscilloscopes," published by Technotronics, April 2017)

To put it simply, you can get the "sample points" from a certain sample and then calculate the "waveform points".

If we apply this to the "multi-sample point" of Hatsune Miku NT, we can say that "waveforms can be calculated from real voice samples of several pitches."

In other words, Hatsune Miku NT does not directly process waveforms but rather "extracts specific data from the voice and reconstructs it based on that data."

 

People, who are somewhat familiar with text-to-speech may say, "Isn't that a vocoder?" But I am convinced that it is not a vocoder.

 

What is the nature of resynthesis technology?

As it turns out, I think of it as a "primitive synthesizer."

It is that synthesizer that can process sine waves to produce various sounds.

f:id:crimsonbutterfly0zero0:20211218014610p:plain

I think the concept is based on the formant-singing sound source used in the PLG100-SG synthesizer developed by Yamaha to go a bit further.

※(For more details, please refer to pages 20~23 of "Vocaloid Technology(「ボーカロイド技術論」)")

 

Perhaps, but I think the rough structure of Hatsune Miku NT is in the form of abstractly calculating and outputting "spectral envelopes composed of integer-order harmonics (the core of voice)" and "spectral envelopes of aperiodic components (breath)" separately from the given parameters, and then merging them.

(For a detailed explanation of this term, please refer to here)

amanokei.hatenablog.com

 

There are several reasons for this, but I'll list a few of the most promising.

 

First, here is the extracted "integer harmonics (the core of voice)/voiced sound" of Hatsune Miku V4x.

In a typical real voice, the overtones of the higher parts are often mixed in with the breath components in the higher registers and cannot be extracted.

画像

 

This is the extracted "integer harmonics (the core of voice)/voiced sound" of Hatsune Miku NT.

As you can see, even the overtones in the higher registers are clearly extracted to an unpleasant degree. This is a level that is impossible with the human voice.

画像

 

This is a sample of the "acyclic component (breath)" extracted from Hatsune Miku NT and Hatsune Miku V4x.

In Hatsune Miku V4x, the "integer harmonics (the core of voice)/voiced sound" is extracted from the original sample, and the volume is increased to make it sound like a whisper.

On the other hand, in Hatsune Miku NT, there is no correlation with the original sample at all, and it looks as if it is simulating what the "aperiodic component (breath)" in this scale would be like.

f:id:crimsonbutterfly0zero0:20211218020510p:plain

 

If you look at this, you can see that Hatsune Miku NT is not a "piece of a real voice" or simply a "piece of an imitated real voice" cut and pasted together.

Perhaps, if these assumptions are accurate, the internal structure of Hatsune Miku NT would look like this.

  1. Parameters (lyrics, pitch, and power/voltage) are input.
  2. Generate a pile of formants (spectral envelope) consisting of "integer harmonics (the core of voice)/voiced tones" according to the parameters.
  3. Based on 2, generate phoneme fragments by simulating the time direction (attack and decay timbre, etc.).
  4. Connect the generated fragments
  5. Simulate the generation of the spectral envelope of the "aperiodic component (breath)", and combine it with 4.

 

When you think about it, you can understand why the sounds at the border of phonemes become tricky.

I think that is why Crypton postponed the release of NT so many times. As a result, the current Hatsune Miku NT was released because the technology couldn't keep up with the direction Crypton had in mind, and "the calculus stone of soulful compromise and resignation appeared."

It is easy to reproduce vowel sounds such as long tones with a synthesizer. Still, it must be challenging to produce the instantaneous and complex sounds of vowels and consonant transition.

On the other hand, VOCALOID uses phonemes that are modeled from the original voice, and UTAU uses the original waveform, so the transition sound is cleaner.

 

Is Hatsune Miku NT a "New Technology"?

f:id:crimsonbutterfly0zero0:20211218023020p:plain

To put it simply, the core technology of Hatsune Miku NT is far from being "new technology."

I believe that this new technology is a "comprehensive concept that combines various existing technologies, ideas, etc.".

As proof of this, the word "new technology(新技術)" is only used here, and the other words used are "new development."

 

I believe that this "new technology" is based on the formant-singing function of Yamaha's PLG100-SG that I mentioned earlier. AIST might have improved the voice resolution and Crypton added various functions and UI.

I think this is what they meant when they said "we will continue to collaborate with Yamaha" at the Magical Mirai presentation. And "we can reproduce VOCALOID tones" may mean that the tone of formant-singing voicenbank will be based on VOCALOID tones.

 

Why did they develop Hatsune Miku NT?

In the first place, why did they develop NT instead of simply creating a voicebank with VOCALOID5?

To get a clue to this question, we have to go back to the announcement of Hatsune Miku NT at Magical Mirai.

I believe it was there that Wataru Sasaki (wat) said something like, "VOCALOID5 has a human voice mixed in, so it's not Hatsune Miku, it's Saki Fujita".

I believe this to be half true and half false.

(Please note that there will probably be a lot of speculation from here on.)

 

Originally, I think YAMAHA tried to add AI functions in VOCALOID5.


Unlike existing VOCALOID4 promotions, this video has a very "futuristic" feel to it. These functions and controls are all things that will be of real value when the voicebank becomes AI. (Especially the "you" in "I sing for you" around 1:04 is very unnatural)

 

However, if you insist on that, you will be told that VOCALOID5 will be released in 2018 and that Misora Hibari VOCALOID: AI will be announced in 2019.
However, in 2017, Pompeu Fabra University, which is developing singing voice synthesis technology in collaboration with Yamaha, published a paper called "Neural Parametric Singing Voice Synthesis," which is a precursor to the current so-called "AI singing voice synthesis technology.

The Misora Hibari AI is thought to have been created based on this.

However, since Yamaha has been updating VOCALOID every three years, the research and implementation for practical use were not ready in time for the release, and a distorted product called "VOCALOID5 without AI" was released.

 

With this in mind, let's look at WAT's statement again.

"VOCALOID5 has a human voice mixed in, so it's not Hatsune Miku. It's Saki Fujita."

Yes, this is not talking about VOCALOID5, but it can be taken as a statement about AI voice synthesis in general.

If you want to make Hatsune Miku into an AI, the idea would be to have Saki Fujita sing, but this is not Hatsune Miku AI. It's only "Saki Fujita AI."

That being said, even if the voice output from VOCALOID is converted into AI, it will only be a degraded version of the VOCALOID version of Hatsune Miku.

 

I'm going to change the subject a bit, but perhaps it's inevitable that Hatsune Miku V4x is often said to be complete than Hatsune Miku NT.

The reason for this is that V4x is "a masterpiece of VOCALOID Hatsune Miku (created by Wataru Sasaki/wat)", and Crypton has done a thorough job of specializing in VOCALOID voice processing.

(※The AHS live broadcast mentioned that "half-hearted processing will result in an error sound," so it seems that specialized processing is the only way to go.)

The official website only briefly mentions the effort, but it's probably not a level of effort or effort that ends with "carefully."

The voice database for "Hatsune Miku V4X" was carefully created by editing the voice of voice actress Saki Fujita, who recorded a large number of voices in a music studio, to include a variety of voice colors.

f:id:crimsonbutterfly0zero0:20211218025742p:plain

 

And it seems that Crypton did not go for AI, nor for "emulating the best of Hatsune Miku V4x", but for "a singing voice synthesis technology based on waveform synthesis that allows more flexible singing expression."

It is said that AHS was not informed of the details of VOCALOID5, so the timing may be that they saw the future of AI when NPSS was announced and made up their mind at that stage.

This is because it would take several years of research to release a new Miku when VOCALOID5 was announced.

Is it possible that Hatsune Miku will return to VOCALOID?

I think it's a possibility.

If you're wondering why Crypton didn't simply create a "Hatsune Miku AI" called "Fujita Saki AI," it's probably because they love "the existence of Hatsune Miku."

To put it simply, "Fujita Saki AI" is a misinterpretation.

 

At the announcement at Magical Mirai, wat-san was crying. Still, I suspect that this was because she was overwhelmed by the reality that "Hatsune Miku was born thanks to Yamaha, but for Hatsune Miku to continue to be Hatsune Miku, she will have to leave VOCALOID. I'm not really sure, though.

 

As I mentioned in "Is Hatsune Miku NT a New Technology?", I think AIST mainly develops the technology. Still, it wouldn't be surprising if Yamaha provided the core technology and UI-related patents technology, so I don't think it's a lie to say that Crypton and Yamaha still have a good relationship.

 

I think Crypton created Hatsune Miku NT because it was born in the process of getting to the final answer, "What is Hatsune Miku?"

If so, if Crypton can find an answer to the eternal question, then they might be able to make VOCALOID: AI Hatsune Miku.

 

Hatsune Miku has become "高品位(high-dignity/purity)"

The most common answer to the question, "When will Hatsune Miku stop being Hatsune Miku?" was when a voice actor became someone other than Saki Fujita.

 

And now, Hatsune Miku has gone from being "a thing made by cutting and pasting human voices" to being "a synthesizer that reproduces human voices."

You may finally understand why Hatsune Miku NT was called "高品位"

"high-dignity, high-purity"

 

By abstracting Hatsune Miku's voice, Hatsune Miku NT has enhanced the purity of Hatsune Miku, and I believe that this has raised her to existence and personality that is one dimension removed from reality.

 

From samplers to synthesizers.

 

Summary

Hatsune Miku NT is a new type/virtual being that could leave one dimension from the actual existence of "Saki Fujita," one of the creators of Hatsune Miku..........Maybe!

 

It's kind of emotional, isn't it?

 

※The second half of this discussion is likely to contain delusions, so I hope you will read it only for reference. I don't like it when people rag on Hatsune Miku NT, so I started thinking about why it was born, and this is what I came up with.
If this is far from the truth, I'm very sorry to Yamaha, Crypton, and WAT!