Thoughts on a Neural Network for Music
I noticed a few months ago that Sora was announced by OpenAI. For those who don't know, Sora will be a video-generation software in the same vein as ChatGPT, but instead of text, it will generate videos with object-permanence.
Object-permanence was the long-sought-for feature to any AI video-generator. In the earlier days (maybe last year or so, and before), the attempts at video generation would always fall short because they didn't have static objects in the scenes--it looked like some sort of odd, infantile hallucination.
Luckily, the guys at OpenAI cracked the code. The decided to use the Unreal Engine! Basically, what that allowed them to do was to create 3D items to be displayed in a 2D video scene. These items wouldn't go away until it was time to look at something else, this yielding the effect of object-permanence in the video software.
Okay...good for those guys. But what does this have to do with MUSIC?
Well, I'm glad you asked. I haven't thought enough about this to publish any paper, and I haven't made any code around this...but, it's an interesting idea.
There are a lot of "music generators" out there. I think the fundamental problem with most of them is that they don't focus on the basic building blocks of their medium. When we think music, we might think songs, notes, descriptions of songs...but that may be all wrong. What we are looking for here is the fundamental objects that underpin music, at least from a computer's perspective.
Much like the OpenAI team determined that the fundamental object of 2D video was a 3D model, I think the fundamental object of music IS a note, but a note note played on a certain instrument.
(Okay, forget vocals for now, which can be tricky from this standpoint. If you're hurting for vocals, just use a Drake or Adele AI, like the ones they use for those annoying YouTube videos.)
Yes, so I think, besides the humble note (pitch+duration), we will also need to take into account the timbre.
Okay, so is this revolutionary? Not at all. There are probably 1000 people working on this right now who have gotten a lot farther than me.
But the thought stands. What if we take timbre into account?
There are already a number of AI applications that use stem-splitting technology to isolate the different instrument tracks of a song. This is incredible for my purposes--to create an effective song-generator!
I think, if I or anyone is going to go down this rabbit-hole, it would be interesting to see if they could get a huge collection of music, split the stems of each song, and THEN train a neural network on this.
Neural networks are amazing at teasing out the sequencing of certain "words" (in this case, "notes" will do), but they are not adept at object-permanence on their own. What we need from a music generator is something akin to instrument-permanence, unless we just want to stay in the real of infantile fever dreams.
Like I said above, I think stem-splitting will be of great use here.
Alternatively, if stem-splitting weren't an option, we could make a song-generator that is trained specifically on songs containing strictly one instrument. This may not be as hard as it sounds. Imagine if you could get Chopin's complete works for piano (you easily can), and train a neural network on them. Then, there would be no need for "instrument permanence" because it's already baked in to the corpus of sound.
I think both of these ideas are worthwhile (the song-generator with "instrument permanence" and the single instrument "Chopin generator",) and I intend to explore the more in the future.