The stock GPT model, because it uses dense attention which works best at hundreds / thousands length, isn't suitable for any kind of raw audio, which involves extremely long sequences of millions of tokens at the millisecond level. (A WAV may be scores of megabytes long; even a highly optimized lossy encoding like MP3 or Vorbis is still megabytes for most music.) If you tried, it'd be a failure because 1024 or 2048 tokens would encode all of a few milliseconds of audio at best, and it's impossible to meaningful predict based on a few milliseconds; most sounds or phonemes or musical notes are far longer than that! You can use it for very high level encodings like ABC notation or, if you brute force it a bit, you can generate MIDI via ABC. See https://www.gwern.net/GPT-2-music This will let you generate folk or instrumental style music with a few instruments in a simple style. (Note the hack that iGPT resorts to, with pixel-encoding, to make even tiny images of 64px workable with enormous compute - because that's a 64^2^ RGB image is a 'sequence' of l=64*64*3=12,288, which is well into the painful territory for dense GPT.)
MuseNet goes one level below ABC by operating on a MIDI encoding of music. This requires shifting from dense attention to a more scalable attention, in its case, Sparse Transformers, which can handle lengths of tens of thousands with acceptable compute requirements & quality. MuseNet was better but still fairly limited. (Not raw audio, a few instruments, definitely no voices etc.)
Jukebox operates at the raw audio level, and it does this by using much larger models scaled up (<10b parameters), conditioned on lyrics/artist metadata (from n~1m songs, IIRC), and a hybrid architecture: not just Sparse Transformers, but VAE-style codebooks providing discrete embeddings of the music style for more global consistency compared to a pure autoregressive token-by-token approach like GPT/MuseNet. Jukebox is extremely impressive: it generates raw audio, for most genres of music, in the style of specific artists, and it even learns to synthesize singing voices (!). It doesn't quite have the global coherency that GPT or MuseNet samples can achieve, like choruses, because I think its attention window is still de facto limited to something like 20 seconds, which limits learning & long-range coherency; but I think fixing that's just a matter of adding on another layer in the hierarchy and maybe another order parameters, and that would fix much of the remaining quality gap.
Jukebox suggests that if you created a large enough model, you could probably dispense with the VAE part and just use pure Transformers.