I’ve been slogging away at this Kaggle competition for about 5 months now — that breaks down to about 99% effort spent building, improving, and debugging my data pipeline, and the last 1% on the deep learning model itself. As continuing to train and optimize my model becomes cost-prohibitive, I’m looking to gain what I can from my experiences (read: justify months of effort and about $200 of AWS bills) with a short list of lessons that will hopefully save me (and you) some time and money on the next project.
(Kaggle, for those who don’t know, is a data science competition platform acquired by Google in 2017, and an excellent resource for getting started with data science).
1. Use GitHub
Starting off with a real shocker! This one isn’t too exciting, nor is it hard to implement. My “version control” during the project was saving the state of my EC2 instance as an AWS image and simply launching from the image next time I wanted to pick up where I left off. Honestly this wasn’t a horrible approach, despite a couple extra bucks per month down the hole, except if my image were corrupted or deregistered, all my work would be lost (this never happened, but I came close a couple times).
But GitHub offers more than just another way to save your changes. Jupyter notebooks don’t really offer much in the way of modularity or multiple code paths, so I wound up with “Frankensteined” notebooks: commented blocks scattered throughout, boolean variables dictating what ran and in what way, and notebook duplicates with who-knows-what version of various methods. Once I transitioned to version control à la GitHub, I was able to use new branches to experiment without having to worry about preserving my working code. For example, in overfitting my model on a subset of the data to ensure it was training properly, I no longer had to worry about adding an extra flow with a “should_subset” variable; instead I could branch off and add [:1] as frequently as my heart desired.
However, Jupyter doesn’t play very well with Git either, nor does it provide much in the way of modularity, which brings me to my next lesson learned:
2. .ipynb for the playground, .py for the real work
This may be controversial, as Jupyter seems to be the de-facto tool for research purposes, but you give up a lot when you resign yourself to writing all your code in a Jupyter notebook. Python offers you not only proper Git integration (whereas Jupyter is…lacking), but it also allows you to write libraries, or “modules”, easily! This means anywhere you call a method it can come from the same block of code, whereas Jupyter notebooks can’t share code as easily between notebooks.
Solely using Jupyter resulted in multiple notebooks with a lot of copy-pasted code that needed to be kept in sync with its source — not what you want when trying to track down a bug in your data pipeline!
Not to say Jupyter doesn’t have a place in a Kaggle project — I would certainly continue to use it for data exploration and visualization. But I’d also define most of the methods in my main Python files and import them into my notebook — then I’ll know code that works in the notebook will also work in my pipeline. Regardless of how you go about it, visualizing the data at each step is very important (cue #3…)
3. Visualize EVERY transformation you perform on the data
I can’t stress this enough. Remember I mentioned trying to track down a bug in my pipeline? Well that was after about a month of not knowing it existed, which was a month of confusing errors and poor results. In scrounging for documentation I came across this post, where a single pixel offset cost the developer two months of debugging!
Always visualize the data before and after each change to ensure the operation is being performed as you expect. If it’s an image, show it. If it’s CSV data, do a sanity check on the summary statistics and make sure to replace NaN values. Even with NLP data, you can print the text before and after stemming, or plot the embedding vectors on a 2D or 3D graph.
These small checks will save you a lot of time debugging and means you don’t have to backtrack and do it later anyway. Plus you can keep the visualization code in a separate Jupyter notebook so it doesn’t get in the way!
4. Optimize for the right metric
This can be a big ask, but is important nonetheless. Many online ML (that’s what the PROS call machine learning) tutorials provide code to load data and train a model, and they almost always use one of the built-in metrics, such as categorical accuracy or mean squared error. But if you’re trying to detect tumors in brain scans, and the competition values recall over accuracy (a.k.a. you want to catch as many cases as you can, at the expense of some false positives), then accuracy isn’t going to train your model to perform well!
Always review the competition metric before you begin working with the data, and certainly before starting on the model. Sometimes one of the standard metrics will suffice, and sometimes you’ll have to write a custom objective function in some random source code file and encounter errors that are near-undiagnosable due to Tensorflow’s very FUN graph system (my first StackOverflow post, sadly unanswered).
Long story short, you can’t expect to edge out hundreds of ML PhDs when you’re not even optimizing for the right thing.
5. Choose a competition you’re psyched about
This may come as a surprise, but my passions do not lie in labeling “amateur” (crappy) smartphone images of historical landmarks. As I kept working on the project, the initial interest faded and I found it harder and harder to convince myself to open my laptop and invest a couple more hours on it.
In stark contrast, just last week I whipped up a quick March Madness bot for an office competition; the results were underwhelming (I found a bug hours after the submission deadline, but so it goes) but I found the time flying by, Deep Work style, even after a full day of programming at work. I’ll take this as a lesson for future competitions, or even a personal note regarding Kaggle in general: it’s nice to have a rank on a leaderboard to show the fruits of your labor, but you won’t be able to do your very best on it if you can’t find a reason to stay engaged and keep improving.
Writing coherent English is apparently harder than writing code. But YOUR takeaway should be that while Kaggle does a great job making data science seem easy, the pros are posting state-of-the-art scores due to solid foundations and intuition built from many failures, many of which I haven’t yet encountered.
Trying to teach yourself data science and machine learning can be brutal, but I’ve also never seen more collaboration and high-quality free resources in any other field. Stick with it! Thanks for reading!
PS: If anyone wants to take a stab at the Google Landmark Recognition Challenge, check out my kernel to download the dataset to TFRecord files in S3!
Follow me on Twitter @Eli_password123 for more!