That is not easy at all.
If the music is stereo and the speech is directly in the centre, then you can remove the speech to a certain degree, leaving the music (although it will sound thinned out) by inverting the phase of either the left or right track, then mixing the two together as a mono file.
Extracting speech can be done in a very crude way by shelf eq filtering the entire file between 200hz and approx 5khz, but you'll still have the mid range music in there to some degree.
There may be some clever software that does it all very well, but I don't know about it.