Five tips to getting started in data science programming

Put the time and effort in early and it will make you a great programmer later on

Keith McNulty

If you want to be a genuine data scientist, you need to be able to code. There’s no getting around it. Some people don’t like this idea, and a number of companies are already tapping into that discomfort by offering ‘automated data science’ products — we do the coding so you don’t have to. If these are your only toolkit, you are not a data scientist.

The litmus test of a strong data scientist is that they are not scared of any data set or any problem. They might not know how to handle it straight away, and in fact it’s quite common that they don’t. But they know they can find out how, and they can eventually produce neat, efficient, reproducible code to handle the problem if it comes their way again. If you want to be a great data scientist, that’s the mindset you need to aim for.

So much of the inner confidence and quiet competence of a strong data scientist comes from how they learned to code in the first place. If you are just starting out, how you go about those early weeks and months of learning are critical to whether or not you will flourish further down the line. If you take the lazy approach — the how but not the why — you’ll develop habits that will make you less confident and efficient later. If you put the work in early — understand the how AND the why — you’ll gradually start to feel that confidence build and your capability expand faster and faster as the months go by.

Here are five tips to help you make a great start as you embark on your learning.

1. Choose the right learning sources

People learn in different ways. For example, I am not great at video learning. I need a detailed written narrative that I can carefully analyze and understand at a pace that I am happy with.

Avoid sources that are too practical — this means that they show you what to do but don’t explain why it works. If you are copy-pasting a method to solve a coding problem, and you have no idea why the method worked, then you haven’t really learned anything because you will have no idea how to apply that method later if a similar problem pops up again.

Good learning sources invest time in breaking down the underlying logic of a method. The best ones actually encourage you to code the method yourself through nudges and tips, rather than give you the entire thing ready-made. Thoughtful educators will provide follow on questions that require you to take what you’ve learned and apply it to another context in order to establish that you have learned it well.

It’s hard to find all of this in an online module, so I would recommend that you have a written resource for in-depth learning in your language of choice. If you ask friends, colleagues or classmates what they use, make sure that they have a similar philosophy to learning before accepting their recommendation.

2. Get skin in the game

Just as a sports team will try harder if there is a prize at stake, you will learn better if you have an incentive. Incentives are not credits that you can put on your resume if you have completed an online module. Incentives are real achievements that have made your current or future work better and stronger — where you and others can visibly see how things have improved because of the work you’ve put in.

As an example, when I first learned to code, I set my self a challenge on one of my own datasets. It was several hundred thousand lines of data which my colleagues had processed annually via Excel. It was a highly manual effort and was taking longer and longer every year because Excel with struggling with the increasing size of the dataset.

As I learned the basics, I also spent time applying my new learning to this dataset. It wasn’t easy. I made lots of errors and spent long hours trying to work out what was was wrong and how I could fix it. But this trial and error process was important because it forced me to engage with the inner workings of the language I was learning and get a deep understanding of how it worked under the hood.

Several weeks of work led to a fully automated script that could handle these larger and larger datasets with ease — something both I and my colleagues were excited and awed by. The tangible benefits of my learnings were clear, and it gave me the incentive and confidence to continue at pace.

Working on your own dataset which you have a strong familiarity with is one of the most effective ways to put early learning into practice. Avoid random datasets from the internet where you may not understand what the variables represent or kinds of manipulations are sensible and relevant. It’s much better to have skin in the game.

3. Errors are your friend, not your enemy

When you first learn you will make a LOT of errors. But that is a really good thing if you respond to them in the right way.

Whatever your language of choice, error messages can appear terse or unhelpful to the untrained eye, but spend a little more time on them and, nine times out of ten, you’ll get a decent understanding of exactly why your command didn’t work. This is important because if you understand why it didn’t work this time, you’ll know how to make it work next time.

Too many time I see friends and colleagues completely ignoring the text of error messages and coming straight to me or others asking for help. Since I have learned to treat the error message as my best data science friend, often I can take one look at the error and tell them straight away what the problem is.

When you see an error message, pursue it as the primary route to solving your problem. Often it will mention another function or operation and you’ll need to dive into that too to understand what went wrong. All of this is such an important part of gaining a deep understanding of the environment you are operating in.

4. Learn your base language before add-ons

Languages like R and Python benefit from a rich ecosystem of add-ons and packages to help import functionality needed for certain common tasks or problems. But be careful not to jump into these too quickly. These packages depend on their base language and could not operate without it. You will make life more difficult for yourself if you become too dependent on these without having a decent understanding of your base language.

If you don’t learn about how data types and data structures work in your base language, or if you don’t thoroughly understand how your system prioritizes between base functionality and imported functionality, you could end up in all sorts of twists later down the line that you don’t understand how to get out of. Errors will pop up and you will have no idea what they mean. Functions may produce a completely unexpected output that you have no understanding of.

Early on, I set myself the challenge of completing a task in the base language before I then attempted it using add-on packages. At the beginning of my learning journey, when my manipulations were relatively straightforward, this was very beneficial to my understanding of my base language. I recommend this approach to anyone in the early stages of learning.

5. Embrace the community

One of the main reasons I love working in open source data science is its community. Whatever the problem you are facing, there’s an extremely strong chance someone has faced it before and can give you advice to help you learn. No single textbook can hope to cover all the questions you might have as you learn, so the community will gradually become a key resource for you as you advance your learning.

Newbies can be scared of the community, but there really is no reason to be. The biggest reticence is often intellectual. Is this a stupid question? Will I get an embarrassing slap-down? A little bit of though and care on your part can help ease your concerns here.

First, choose your community carefully. If you are a beginner, don’t post questions to a Twitter hashtag that will push them to experienced programmers. Find online groups and hashtags that match your level of development and direct your questions to those folks.

If you are using a more wide-ranging resource like StackOverflow, learn its rules and follow them. If you are a beginner, it’s very very likely that the question you asked has already been answered so search for it carefully before you consider posting it as a new question. If you do post it as a new question, ensure you are really specific and provide a minimal reproducible example of your code. If you post a generic question with no example you are certain to get smacked down and you probably deserve it!

If you do get a response to your question and you think it is too brief — for example someone just posts the code you need without an explanation — don’t be afraid to ask them to explain why it works. Most respondents want to help, and they want to build their reputation on the platform, and so they will usually be willing to expand on their response.

These are just a few things that I recommend if you are on the start of your data science learning journey and you aspire to be a great data scientist in the future. Good luck on this exciting journey!

Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter.