Data scientists, often compared to unicorns, are highly sought after in the ever growing world of big data. It is the ‘Sexiest job of the 21st century’, with the number of advertised roles seeing a dramatic rise in the last few years, as figures from indeed.com illustrate.
So what skills are required to make a competent data scientist? Listing these is actually surprising difficult, since consensus on the precise role of a data scientist is lacking. Sought after skills range from traditional business analytics through to cutting edge deep learning techniques. Big Data Borat, a leading authority on all things AI (!), describes data science as “statistics on a Mac”, so experience with OS X is a must. I see data science as an intersection between three fields: machine learning/statistics, software engineering, and subject matter expertise.
The ‘bread and butter’ of a data scientist’s toolkit is, of course, statistics and machine learning. Whether identifying trends or patterns in a dataset, forecasting future events, or building an object classifier, these techniques are fundamental. Increasingly, data scientists don’t need to fully understand the nuances of algorithms they are using. Libraries such as Scikit-learn, TensorFlow, and Spark ML, to name but a few, are adding increasing levels of abstraction from the method’s fundamentals. This isn’t a bad thing, as it makes data science more accessible to those who have a less formal mathematical education. Indeed, numerous companies have a business model based around democratising AI. However, to be a successful data scientist one must go much further than this. When time allows, I go as far as writing my own implementations of algorithms using only basic mathematical libraries. I find this the best way to deeply understand the inner workings of the method.
So, is a data scientist simply a re-branded machine learning expert? Certainly not! In my opinion the crucial difference between a data scientist and a machine learning expert is that a data scientist can put their solution into production. This means having a good working knowledge of software engineering best practice (e.g. Git-flow, continuous integration), and writing production-ready code (e.g. well documented, suitable tests written). That is, your production code isn’t hacked together!
Subject matter expertise will always be required on a project, though this need not come directly from the data scientist. As a consultant I frequently work on problems in which I have little or no experience. However, I always work closely with the client to understand the problem space and ensure the data science direction makes sense in the business context. What is important is being able to rapidly gain enough knowledge to understand the question being asked (or, more often, to understand the question that should be asked!).
In terms of broader skills, good communication is essential. This is something that I cannot stress enough. Data scientists who are unable to articulate the purpose of their analysis, any caveats on the results, and the business context of their conclusions to a non-technical audience will have little real-world utility.
A less quantifiable but equally important aspect of a successful data scientist is attitude. Their attitude towards technology should be one of agnosticism; the correct technology is the one that is most appropriate for the problem posed. Their attitude towards non-data science tasks should not necessarily be one of enthusiasm, but at a minimum willingness. This especially important in a startup where the development team may be lean with each member wearing multiple hats; I could probably write an entire blog post comprising only the different data science, engineering, devops, project management, and sales tasks I’ve performed over the last two years! Finally, a successful data scientist should be acutely aware of the 80-20 rule (something I’ve noticed many have trouble following). The aim should be ‘good enough to meet the project requirements.’ If the time it takes you to improve your model by a fraction of a percent costs more than the additional value it generates then don’t bother.
I’m sure there are many other skills and attitudes I have omitted, but in my view this post outlines a minimum set of requirements for a competent data scientist. I would love to hear your views in the comments section below.Published in