Juggling in Data Science

data science tips

Fact: Multi-tasking is inefficient.

However, I don’t believe that having multiple data projects on the go is bad – it’s the only way to maximise impact as a data scientist. Too often do blocks come up that are out of your control:

  • more data is required (dependant on someone else)
  • business questions need answering
  • models need training

As part of my job as a data science consultant, context switching is inevitable. If I’m not required to change gears from time to time then we don’t have enough business coming in, and I should be concerned about my job!

My position is slap bang in the middle of business and academia. I work closely with business people to understand where ML can help, then work closely with academics to explore the various options.

It isn’t uncommon to have more than 3 projects on the go at a time so recently I have been asking myself “how the hell do I minimise this context switching pain???”.

Where the challenges come in

Data science is a particularly tricky beast because you have to maintain business context and the “data feel” that comes with time.

The biggest problem I have with stepping in and out is to get over the panicked feeling of dropping into a project where I feel like a complete n00b again – “I know there was something funny going on with the sensor data, but I can’t quite remember what” – or that unadulterated unpleasantness that comes with opening up a Jupyter notebook from a previous late night hack…

What I want to share here are the 3 habits that I utilise to keep myself sane, and my project juggling smooth.

1. Port from Jupyter

Every time I finish a chunk of work in Jupyter I carve out the meat of the task and store it as a function or suite of functions. Jupyter acts as a great innovation space but is utterly useless for building systems.

You combine with methodology with with `%aimport` autoimport functionality that is baked into notebooks and you can fly through the construction of some intense data pipelines.

Now you can look back at your hack and read it more like a story (go function naming!). Your mental parse-rate sky rockets and you have the basis of some maintainable code that you can actually make use of later.

As a side note, I am an engineer at heart and always strive for Object Oriented code where possible. What I have found when using this method is that Class code doesn’t port well to and from notebooks. Sometimes you just want to evaluate the output of a function and this can become very annoying very fast if you go balls deep with objects).

2. Standardise my work process

Do yourself a favour and standardise your project process. I was introduced to Cookiecutter Data Science and have never looked back. Is it perfect? No. But the idea behind having a consistent project layout is incredibly valuable. Keeping your data flowing from raw, to interim, to processed etc. in manageable chunks is a god send when you drop in. Knowing where to look for what you need is a game-changer!

The key advantage I find using the DAG data concept is that you can drop back a project and scope what you need to refresh yourself about. If I am only looking at the final, cleaned data then my life is going to be much less painful than having to go back to first principles.

The familiarity that comes with common filenames and similar layouts is akin to the familiarity that comes with a programming language. You can offload more to habit and auto-pilot and save the thinking time for the interesting important stuff. You can switch projects without feeling like you’ve switched worlds.

3. Journal

Finally, I have recently become a big believe in keeping a data journal. Every time I become stuck in the quagmire of frantic idleness I take a step back and write. If I’m coming in very cold to a project I’ll scan through my last few entries and reconnect with my thoughts.

Yes git commits will give you some info, but having a richer and more ad hoc place for your thoughts enables you to muse in a way that would be blasphemous in a git message.

I have found a particularly useful tool for logging this stuff is nvALT. Any time I do anything worthy of noting down, discover something interesting in the data, or need to just brain dump I can very quickly open it up and go for it.

Summing up

Despite believing deeply that multi-tasking is a terrible idea, I do believe that context switching is a necessary evil. In some cases I have my best realisations about other projects when working on something else – the possibility for cross pollination is fantastic and hard to quantify. The green field thinking that comes with getting your head out of one problem and into another is invaluable and is a skill well worth mastering.

That being said, unless some care is taken it’s easy to burn out in the frenzy of changing. I’ve run through a few ways that I’ve tried to stay cool under pressure and I’d love to hear of any more!

Published in

Leave a reply

Log in with your credentials


Forgot your details?

Create Account