2nd Place Solution to the Taylor Swift Big Data Competition

by Devin Didericksen

tswift

The Competition

I recently participated in the Taylor Swift Big Data Competition, which was a predictive analytics competition with the stated problem: "Given a short sub-second clip of monophonic audio from Taylor Swift, can you predict if it is a mega success (> 300M views on YouTube) or a flop (< 30M views on YouTube)". It was a lot of fun to compete against some very talented data scientists. Thank you to Big Data Utah and Ben Taylor for putting it together, and to Skullcandy and Ampian Staffing for providing the prizes!

In the end I took 2nd place, but I was happy with that especially since the person I lost to was Jason Maughan, my extremely talented coworker at PurePredictive. A few people have since asked me for my solution, so I thought I'd share this ipython notebook with anyone that was interested. In addition to my final solution, I also want to share each step of my end-to-end approach (mistakes and all). The reason I do this is because the process of trying something, failing, then asking the right questions, and reiterating is much more important and helpful than the final solution itself in this case.

Getting Started

The timing of the competition had a big influence on my initial approach. Coincidentally, just a week before the competition started I assembled a personal linux box for both typical daily use and also for the purpose of running machine learning models. I made sure that I got a good GPU so that I could quickly iterate on deep learning models. When I heard about the Taylor Swift competition, and how it would require analyzing audio data, I knew that no model could beat a properly tuned deep learning or recurrent neural network model (or so I thought) because audio and visual data is deep learning's bread and butter. Plus, it gave me an excuse to test out my new GPU.

Since I am not too familiar with audio data, I figured I would skip the exploratory data analysis step and jump right into building models (mistake #1). I ran a few of the better machine learning models on the raw features just to get a benchmark for comparison against the deep learning model that I was about to build. The highest validation accuracy rate I could get was around 80%.

Attempt #1 - Deep Learning

I began building the deep learning model on the raw data. After trying out several different architectures I settled on a 5-layer neural network with architecture 3333-2000-1000-500-1 and dropout applied to all layers. I used the AdaDelta learning rate method because it converged faster than all the others on this particular problem. With this approach I was able to hit 92% accuracy. At the time I thought this was pretty good (I was wrong), and it was interesting for me to see how much better the deep learning model performed on the raw data when compared to the others.

Data Augmentation

I then came up with an idea to make the model better. Deep learning models really excel when data sets get very large, so I came up with an idea on how to augment the data to make it several times bigger in length while only sacrificing less than 1% in width. The idea is to take different windows of equal length on one record to turn that one record into several records. After I augmented the data and fed it into the deep learning model I was able to get to 97.5% accuracy. I submitted this result onto the leaderboard and was in first place momentarily. I soon got bumped down to 2nd, then to 3rd, then to 4th. I kept trying different tweaks unsuccessfully to improve a model that I was committed to (mistake #2).

Deep Learning Code

To build my deep learning model, I used nolearn's NeuralNet python library that runs on Lasagne which runs on Theano. I read a blog post on it a couple months ago, and since then I had been meaning to try it out. After using it for a while, I think personally I will stick with just Lasagne. I had a few issues that I could not remediate: I couldn't get weight initialization to perform as well as Lasagne, some of the update learning methods behaved strangely, I couldn't get 1-D convolutions to work, and it didn't run as fast as Lasagne on some of the same problems. It is entirely possible that some of these issues were due to user error... However, it seems that bugs are inevitable in nolearn since it sits on top of both Theano and Lasagne, which both get updated daily. Though, I have to say, I really did like its readability and ease of coding. You can see my code below:

In [1]:
#Load Training and Validation Data
import time
import numpy as np
from numpy import genfromtxt
t0 = time.time()
x_train = genfromtxt('X_train_public.csv', delimiter=',')/32768.0 #Neural Net input must be between [-1,1]
y_train = genfromtxt('Y_train_public.csv', delimiter=',')
x_valid = genfromtxt('X_test_public.csv', delimiter=',')/32768.0 #Neural Net input must be between [-1,1]
y_valid = genfromtxt('Y_test_public.csv', delimiter=',')

x_train = np.vstack((x_train,x_valid))
y_train = np.concatenate((y_train, y_valid), axis=0)

print x_train.shape
print y_train.shape
(22400, 3333)
(22400,)
In [2]:
#Feature Engineering
#x_train = np.column_stack((x_train2,
#    x_train2.min(axis=1),
#    np.percentile(x_train2, 0.25, axis=1),
#    np.median(x_train2, axis=1),
#    np.percentile(x_train2, 0.75, axis=1),
#    x_train2.max(axis=1),
#    (x_train2.max(axis=1) - x_train2.min(axis=1))/2,
#    (np.percentile(x_train2, 0.75, axis=1) - np.percentile(x_train2, 0.25, axis=1))/2,
#    np.mean(x_train2, axis=1),
#    np.std(x_train2, axis=1)))
In [3]:
#Frequence Binning
#bucketsize = 20
#buckets = np.linspace(-16700, 16700, num=(16700*2)/bucketsize+1)
#buckets = np.concatenate((np.array([-35000]),buckets,np.array([35000])))
#
#freq_train = np.zeros(shape=(x_train.shape[0],buckets.shape[0]-1))
#for j in range(x_train.shape[1]):
#    freq_train[j,], bins = np.histogram(x_train[j,], buckets)
#
#freq_valid = np.zeros(shape=(x_valid.shape[0],buckets.shape[0]-1))
#for j in range(x_valid.shape[1]):
#    freq_valid[j,], bins = np.histogram(x_valid[j,], buckets)
In [4]:
#Data Augmentation (Training Set)
m = 24 #multiplier; must be greater than 1

x_train2 = x_train[:,0:(x_train.shape[1]-m+1)]
for j in range(1,m):
    x_train2 = np.vstack((x_train2, x_train[:,j:(x_train.shape[1]-m+1+j)]))
    
y_train2 = y_train
for k in range(1,m):
    y_train2 = np.concatenate((y_train2,y_train), axis=0)
In [5]:
#Convert to Theano Types
import theano
x_train2 = x_train2.astype(theano.config.floatX)
y_train2 = y_train2.astype('int32')
print x_train2.shape
print y_train2.shape
(560000, 3309)
(560000,)
Using gpu device 0: GeForce GTX 970
In [6]:
#Train Deep Neural Net
import lasagne
from lasagne import layers, init, nonlinearities
from lasagne.updates import sgd, momentum, adadelta, nesterov_momentum
from nolearn.lasagne import NeuralNet

model = NeuralNet(
    layers=[
        ('input', layers.InputLayer),
        ('dropout1', layers.DropoutLayer),
        
        ('hidden2', layers.DenseLayer),
        ('dropout2', layers.DropoutLayer),
        
        ('hidden3', layers.DenseLayer),
        ('dropout3', layers.DropoutLayer),
        
        ('hidden4', layers.DenseLayer),
        ('dropout4', layers.DropoutLayer),
        
        ('output', layers.DenseLayer)
        ],
    input_shape = (None, 3333-(m-1)),
    dropout1_p = 0.2,
    
    hidden2_num_units = 2000,
    dropout2_p = 0.5,
    
    hidden3_num_units = 1000,
    dropout3_p = 0.5,
    
    hidden4_num_units = 500,
    dropout4_p = 0.5,
    
    output_num_units = 2,
    output_nonlinearity = lasagne.nonlinearities.softmax,

    # optimization method:
    update = adadelta,
    
    regression = False, 
    max_epochs = 25,
    verbose = 1,
    )
model.fit(x_train2, y_train2)
  InputLayer        	(None, 3309)        	produces    3309 outputs
  DropoutLayer      	(None, 3309)        	produces    3309 outputs
  DenseLayer        	(None, 2000)        	produces    2000 outputs
  DropoutLayer      	(None, 2000)        	produces    2000 outputs
  DenseLayer        	(None, 1000)        	produces    1000 outputs
  DropoutLayer      	(None, 1000)        	produces    1000 outputs
  DenseLayer        	(None, 500)         	produces     500 outputs
  DropoutLayer      	(None, 500)         	produces     500 outputs
  DenseLayer        	(None, 2)           	produces       2 outputs

 Epoch  |  Train loss  |  Valid loss  |  Train / Val  |  Valid acc  |  Dur
--------|--------------|--------------|---------------|-------------|-------
     1  |    0.195508  |    0.132620  |     1.474200  |     95.03%  |  38.1s
     2  |    0.074554  |    0.088546  |     0.841985  |     96.83%  |  38.1s
     3  |    0.049587  |    0.080220  |     0.618145  |     97.20%  |  38.1s
     4  |    0.041327  |    0.080187  |     0.515381  |     97.35%  |  38.1s
     5  |    0.035278  |    0.075999  |     0.464189  |     97.62%  |  38.1s
     6  |    0.031850  |    0.075061  |     0.424315  |     97.64%  |  38.1s
     7  |    0.029673  |    0.072712  |     0.408086  |     97.72%  |  38.1s
     8  |    0.027742  |    0.076409  |     0.363076  |     97.78%  |  38.1s
     9  |    0.026780  |    0.077089  |     0.347386  |     97.81%  |  38.0s
    10  |    0.025199  |    0.072465  |     0.347737  |     97.89%  |  38.1s
    11  |    0.024997  |    0.073400  |     0.340567  |     97.94%  |  38.1s
    12  |    0.024438  |    0.073778  |     0.331234  |     97.93%  |  38.1s
    13  |    0.024032  |    0.070117  |     0.342740  |     98.01%  |  38.1s
    14  |    0.022728  |    0.071153  |     0.319428  |     98.07%  |  38.4s
    15  |    0.022183  |    0.070663  |     0.313927  |     98.06%  |  38.1s
    16  |    0.021876  |    0.075103  |     0.291278  |     98.05%  |  38.1s
    17  |    0.021575  |    0.076225  |     0.283046  |     98.01%  |  38.2s
    18  |    0.022001  |    0.069903  |     0.314732  |     98.06%  |  38.1s
    19  |    0.020615  |    0.075668  |     0.272438  |     98.07%  |  38.1s
    20  |    0.021212  |    0.072457  |     0.292747  |     98.17%  |  38.1s
    21  |    0.020438  |    0.072918  |     0.280294  |     98.13%  |  38.0s
    22  |    0.020433  |    0.081651  |     0.250249  |     98.02%  |  38.0s
    23  |    0.019790  |    0.073019  |     0.271021  |     98.14%  |  38.0s
    24  |    0.020288  |    0.073248  |     0.276977  |     98.12%  |  38.0s
    25  |    0.019251  |    0.078618  |     0.244864  |     98.10%  |  38.0s
Out[6]:
NeuralNet(X_tensor_type=<function matrix at 0x7fdc171be320>,
     batch_iterator_test=<nolearn.lasagne.BatchIterator object at 0x7fdc07745e90>,
     batch_iterator_train=<nolearn.lasagne.BatchIterator object at 0x7fdc07745e50>,
     dropout1_p=0.2, dropout2_p=0.5, dropout3_p=0.5, dropout4_p=0.5,
     eval_size=0.2, hidden2_num_units=2000, hidden3_num_units=1000,
     hidden4_num_units=500, input_shape=(None, 3309),
     layers=[('input', <class 'lasagne.layers.input.InputLayer'>), ('dropout1', <class 'lasagne.layers.noise.DropoutLayer'>), ('hidden2', <class 'lasagne.layers.dense.DenseLayer'>), ('dropout2', <class 'lasagne.layers.noise.DropoutLayer'>), ('hidden3', <class 'lasagne.layers.dense.DenseLayer'>), ('dropou...<class 'lasagne.layers.noise.DropoutLayer'>), ('output', <class 'lasagne.layers.dense.DenseLayer'>)],
     loss=<function negative_log_likelihood at 0x7fdc08216410>,
     max_epochs=25, more_params={}, on_epoch_finished=(),
     on_training_finished=(),
     output_nonlinearity=<theano.tensor.nnet.nnet.Softmax object at 0x7fdc16e36ed0>,
     output_num_units=2, regression=False,
     update=<function adadelta at 0x7fdc0867c758>, use_label_encoder=False,
     verbose=1, y_tensor_type=TensorType(int32, vector))
In [7]:
#Data Augmentation (Validation Set)
#x_valid2 = x_valid[:,0:(x_valid.shape[1]-m+1)]
#for j in range(1,m):
#    x_valid2 = np.vstack((x_valid2, x_valid[:,j:(x_valid.shape[1]-m+1+j)]))
#    
#y_valid2 = y_valid
#for k in range(1,m):
#    y_valid2 = np.concatenate((y_valid2,y_valid), axis=0)
#
#x_valid2 = x_valid2.astype(theano.config.floatX)
#y_valid2 = y_valid2.astype('int32')
#print x_valid2.shape
#print y_valid2.shape
In [8]:
#my_score = model.score(x_valid2,y_valid2)
#print my_score
In [9]:
#Load Test Data
x_test = genfromtxt('X_test_private.csv', delimiter=',')/32768.0
print x_test.shape
(9600, 3333)
In [10]:
#Data Augmentation (Test Set)
x_test2 = x_test[:,0:(x_test.shape[1]-m+1)]
for j in range(1,m):
    x_test2 = np.vstack((x_test2, x_test[:,j:(x_test.shape[1]-m+1+j)]))

x_test2 = x_test2.astype(theano.config.floatX)
print x_test2.shape
(240000, 3309)
In [11]:
#Run Model on Test Data
y_test = model.predict(x_test2)
print y_test.shape
t1 = time.time()
t1 - t0
(240000,)
Out[11]:
1902.892056941986
In [12]:
y_test2 = y_test[0:x_test.shape[0]]
for j in range(1,m-1):
    y_test2 = np.column_stack((y_test2,y_test[j*x_test.shape[0]:(j+1)*x_test.shape[0]]))
print y_test2.shape
y_avg = np.mean(y_test2, axis=1)
print y_avg.shape
print y_avg[(y_avg>.3) & (y_avg<.7)].shape
print y_avg[(y_avg>.35) & (y_avg<.65)].shape
print y_avg[(y_avg>.4) & (y_avg<.6)].shape
y_avg = np.around(y_avg,0)
(9600, 24)
(9600,)
(205,)
(153,)
(115,)
In [13]:
#Save Competition Submission
np.savetxt("my_submission.csv", np.array(y_avg,dtype=int), delimiter=",",fmt='%i')

Rethinking the Question

At this point I had pretty much maxed out the accuracy rate on the deep learning model, and it was apparent that those ahead of me on the leaderboard knew something that I didn't. If I wanted to win the competition I would have to go back to the drawing board. This isn't always easy to do, but good data scientists don't let themselves get married to any one approach, one model, one programming language, etc. They side with the data and with what works, and they aren't afraid to change course when needed.

Up to this point I was still trying to answer the original question of how to predict song popularity given a sub-second clip. I then looked at the data, thought through the problem, and thought about what would be the reason every model performed so well on the test set, and it wasn't until then that I realized that predicting popularity was the wrong question... and I was wasting effort trying to come up with an answer to a wrong question.

I had three key insights that clued me in to coming up with a much better question and much better answer:

  1. There were a limited number of songs sampled.
  2. The clips in the test set were taken from the same songs as the clips in the training set - so even though the model had not seen those specific clips before, it had seen those songs before.
  3. Almost every modern pop song (including Taylor Swift's) has a computer generated rythmn and beat that is on a loop that repeats the same frequencies perfectly throughout most of the song.

Answering The New Question

Using these key insights, the new question I formulated was: "Can you predict whether a sound clip is part of Song Set A or Song Set B?". With the new question I am no longer thinking in terms of what makes a song popular, but what makes a song belong to a set of songs as apposed to another set of songs. With this being my new guide, it made sense to create a model that identified repeated frequencies to classify a sound clip as part of Song Set A or Song Set B. And the more the model overfits to the training data, the better. When over-fitting is desired (which it rarely is) there is no better way to accomplish that than using k-Nearest Neighbor with k = 1.

Attempt #2 - kNN

To identify the beat and rhythmn frequencies, I binned the frequencies. For instance, one variable was counting the number of times that a clip had frequencies between 1260 and 1280, and another for 1280 to 1300, and so on covering the entire spectrum. I then fed these features into a 1-Nearest Neighbor classifier. For this task I switched to R because sklearn's knn implementation in python uses euclidean distance, as opposed to R's that uses standardized vectors (either PCA or SVD). The R code below is what was used for my final submission, which put me at 99.56% accuracy.

In [14]:
setwd("~/R")
library(class)

x_train <- read.csv("X_train_public.csv", header = FALSE)
y_train <- read.csv("Y_train_public.csv", header = FALSE)
x_valid <- read.csv("X_test_public.csv", header = FALSE)
y_valid <- read.csv("Y_test_public.csv", header = FALSE)
x_test <- read.csv("X_test_private.csv", header = FALSE)

x_train2 <- rbind(x_train, x_valid)
y_train2 <- rbind(y_train, y_valid)

bins <- c(-35000, seq(-16700, 16700, by = 20), 35000)
x_train2 <-  t(apply(x_train2, 1, function(x) table(cut(as.numeric(x), breaks = bins))))
x_test <-  t(apply(x_test, 1, function(x) table(cut(as.numeric(x), breaks = bins))))

knn20 <- knn(x_train2, x_test, cl = as.factor(y_train2$V1), k = 1)
write.csv(knn20, "my_submission.csv")