The transformer is trained on sequential data: given a set of notes, we ask it to predict the upcoming note.
Additionally, we used the MAESTRO dataset. ClassicalArchives and BitMidi donated their large collections of MIDI files for this project, and we also found several collections online, including jazz, pop, African, Indian, and Arabic styles. We collected training data for MuseNet from many different sources. It’s much more obvious if a music model messes up structure by changing the rhythm, in a way that it’s less clear if a text model goes on a brief tangent. Yet we can easily hear whether the model is capturing long term structure on the order of hundreds to thousands of tokens. It has the fluid token structure of text (in images you can look back N tokens and find the row above, whereas in music there’s not a fixed number for looking back to the previous measure). Music generation is a useful domain for testing the Sparse Transformer as it sits on a middle ground between text and images. It can also create musical melodic structures, as in this sample imitating Mozart: This long context may be one reason why it is able to remember long-term structure in a piece, like in the following sample imitating Chopin:
MuseNet uses the recompute and optimized kernels of Sparse Transformer to train a 72-layer network with 24 attention heads-with full attention over a context of 4096 tokens. Here we use t-SNE to create a 2-D map of the cosine similarity of various musical composer and style embeddings. We can visualize the embeddings from MuseNet to gain insight into what the model has learned. Or prompted with the band Journey, with piano, bass, guitar, and drums:
At generation time, we can then condition the model to create samples in a chosen style by starting with a prompt such as a Rachmaninoff piano start: During training time, these composer and instrumentation tokens were prepended to each sample, so the model would learn to use this information in making note predictions. We created composer and instrumentation tokens to give more control over the kinds of samples MuseNet generates. Generations will be more natural if you pick instruments closest to the composer or band’s usual style.
In advanced mode you can interact with the model directly. This lets you explore the variety of musical styles the model can create.
Choose a composer or style, an optional start of a famous piece, and start generating. In simple mode (shown by default), you'll hear random uncurated samples that we've pre-generated. We’re excited to see how musicians and non-musicians alike will use MuseNet to create new compositions! The model manages to blend the two styles convincingly, with the full band joining in at around the 30 second mark: Here the model is given the first 6 notes of a Chopin Nocturne, but is asked to generate a piece in a pop style with piano, drums, bass, and guitar. Since MuseNet knows many different styles, we can blend generations in novel ways. MuseNet uses the same general-purpose unsupervised technology as GPT-2, a large-scale transformer model trained to predict the next token in a sequence, whether audio or text. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files. We've created MuseNet, a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles.