声フェチ野郎の音声生成録

音声生成に役立った知識がいっぱい

Results: Activation Maximization with a prior in audio domain.

Introduction


I'm writing this blog to play the result in audio form. All implementations and further information can be found in my github.


Neural networks are predominant for various tasks including object detection, speech recognition, emotion detection, and so on. However, its process is, in general, not understandable for human beings. To understand how the models tackle the problems, some visualization techniques are invented such as feature visualizations. In this repository, I'm going to share the applications of Activation Maximization(AM) which is one of the feature visualization tactics.

Basically, in AM, the input data is optimized to the data that activates the selected neuron. it contains the filter of layers, the classification output, and so on. In our case, the output of the classifier is optimized to observe the result of being a certain class. That's why I called it class-based Activation Maximization, and this is mentioned in this paper. For further information on AM, please visit this excellent explanation for AM

Screen Shot 2020-08-11 at 15 27 54

In this experiment, I'm going to optimize the noise of GAN which is employed as a prior as shown below. As for the form of audio data, 2 types of audio features are employed, which are raw audio and mel-spectrogram. We're going to observe the differences between the data form and the structure of the models. What's more, Conditional GAN has also experimented in order to figure out the importance of being a certain emotion.

Screen Shot 2020-08-11 at 15 27 54

Results

Neutral

f:id:shinshoji01:20201115172702p:plain

model audio
epoch 0
mel_wavenet
mel_lstm
mel_cnn
audio_wavenet
audio_lstm

Sad

f:id:shinshoji01:20201115172725p:plain

model audio
epoch 0
mel_wavenet
mel_lstm
mel_cnn
audio_wavenet
audio_lstm

Angry

f:id:shinshoji01:20201115172748p:plain

model audio
epoch 0
mel_wavenet
mel_lstm
mel_cnn
audio_wavenet
audio_lstm

Happy

f:id:shinshoji01:20201115172514p:plain

model audio
epoch 0
mel_wavenet
mel_lstm
mel_cnn
audio_wavenet
audio_lstm