The data scientist market: Some thoughts and tips
By Gabey Goh July 14, 2015
- Q&A with the authors of ‘The Data Science Handbook’
- Need for right skillets, and right kind of programmes
THE four-man team of Max Song, William Chen, Carl Shan and Henry Wang have published The Data Science Handbook, a compilation of indepth interviews with 25 data scientists who share their insights, stories, and advice, now available via Amazon.
Digital News Asia (DNA) posed some questions to the Data Science Handbook (DSH) team on common misconceptions about the field, what makes up the ‘right’ skillsets for aspiring students and tips for organisations and interested practitioners alike on embracing this field.
DNA: What is your advice for mid-career professionals looking to make the jump into this field?
DSH: The good news is that data science is not just restricted to graduate students or those with PhDs. Those who have already been working in the technology or analysis fields will arguably already have some experience, though the challenge is to find the time to learn new skills.
As psychological preparation, consider reading DJ Patil’s interview in our book for his excellent advice about useful mental models when making transitions.
Afterwards, it depends on where you are:
- As an analyst or statistician, you can look to transition from R (which is pure scripting) to a general purpose programming language. Python is a good choice, with a rich ecosystem of libraries. If you are feeling more adventurous, you can move to heavier production languages, like Scala or Java.
- As a software engineer, consider taking Andrew Ng’s Introduction to Machine Learning course on Coursera. There is also a Data Science track on Udacity and a wealth of well-explained statistical knowledge on Khan Academy.
We also link to other resources at the bottom of our website at thedatasciencehandbook.com.
DNA: Quite a few Asian governments have aggressively embarked on promoting the take-up of ‘data scientist’ programmes/ courses in universities to beef up the talent pipeline. In your opinion, what are the key considerations that must be taken into account when developing such initiatives?
DSH: The question of the ‘right’ education for a data scientist is interesting because many early and successful data scientists never went through a traditional data science training programme.
But that’s also because in their time, there was no real programme that could have prepared you for the new field. As educational institutions catch up to the modern career demands of the field, our interviews and personal experience suggest that it’s crucial to think about a few factors:
- A rigorous foundation of linear algebra: While calculus is useful for differential equations and modelling surfaces, a solid understanding of linear algebra is unparalleled in providing a good foundation to build understandings of machine learning algorithms.
- Chances to apply theoretical knowledge with practical data sets: Completing problem sets and writing proofs is critical for building a strong formal understanding, but there is nothing like doing very applied projects that really reveals the nuances, and highlights the limitations of theoretical understanding. For example, many statistical tests assume a normal distribution in the errors of your analysis, which makes the theory elegant and the math tractable. Real-world errors are rarely so cleanly distributed.
- Programming skills are a huge multiplier: Many new entrants to data science now are PhD’s in diverse quantitative disciplines who are usually versed in statistics, scriptable programming (R, Python), but who might not have had a focused undergraduate computer science education. As a result, there was a common thread in our interviewees of learning programming on the job, from colleagues and projects. While this is doable, having a stronger foundational knowledge of programming greatly accelerates the pace and speed of deploying useful machine learning models. We see this in our interviewees – for example Jace Kohlmeier of Khan Academy and Kunal Punera of RelateIQ – emphasising that the data scientists they hire are well-versed enough to be able to actually code the algorithms they prototype, instead of having to rely on another engineer to implement it and put it into production.
Data science programmes accept that data science are conducted almost entirely in industry, and thus practical and empirical work needs to be the centrepiece of any advanced programme.
Simultaneously, the very landscape of education itself is changing.
There is a growing wealth of free resources online that helps one study data science for free. One example we love to highlight is The Open Source Data Science Masters – made by one of our interviewees, Clare Corthell.
When building a programme or a course, the course instructor should take advantage of what an in-person classroom experience offers, such as group projects, regular feedback, and group presentations – and augment these things with what virtual resources are able to better provide.
DNA: While the education sector is gearing up to include this emerging field in their syllabus/ course offerings, industry/ public sector demand for data scientists continues to growth rapidly. In your observations, what are some of the stopgap measures being put in place to address this talent gap, and have they been successful?
DSH: Some stopgap measures include an increase in the credentialing of online resources (e.g., a Coursera certification for data science now exists). There is a tremendous wealth of data science resources available for free online, from books to MOOCs (massively open online courses) to data science communities online. These have been remarkably successful in sharing data science knowledge online.
Simultaneously, we also see internal human capital shifts – for example, engineers being asked to learn statistics, or statisticians being asked to learn programming.
The instances we have heard of problems in bringing in the wrong people or for the wrong job usually comes mostly at bigger companies, where decision-making about hiring is made separately from the product owner who has a good understanding of the problem.
At startups and small companies, where headcount is a constraining factor, demands of trying to do data science is usually concentrated in one or two people.
In an innovative way to solve this problem, some VC (venture capital) firms have begun to retain data science talent inhouse for their portfolio companies.
At bigger companies, management that has been influenced by the media attention around data science might be susceptible to the idea that they need to hire a few astrophysics PhDs to help them figure out what to do about all their ‘big data.’
Another market dynamic is the rise of high-powered data science SaaS (Software-as-a-Service) companies that try to abstract away the messiness by creating software that makes high dimensional, complex statistical analysis tractable for the laymen. Ayasdi, which spans financial and healthcare verticals, is a good example of this.
One common misconception that might bottleneck many people, is that because data science involves mathematics, it is hard to learn.
There are some great resources – like the O’Reilly Media book Machine Learning for Hackers, or one the new book Data Smart: Using Data Science to Transform Information into Insight (pic) from one of our interviewees, John Foreman, that provide a surprisingly gentle and accessible introduction for curious laymen to the nuts-and-bolts of machine learning.
Finally, data science is an emerging field, with different definitions and conceptions. Unlike iOS engineering, where most people understand what the job entails, data science is still developing.
We are now seeing the emergence of ‘data engineer’ as a complementary new profession to data scientist.
DNA: What are some of the most common misconceptions about being a Data Scientist? What should young people who are thinking about pursuing this as a career be aware of?
DSH: [These are]:
- The importance of communication: Statistics and communications skills are the bread and butter of data science, but an undervalued skill that complements the ability to do data analysis is the ability to clearly communicate your findings. All of the pioneering data scientists we interviewed had this quality and recommended us to learn it. That doesn’t mean to skimp out on understanding the trade-offs involved in machine learning models, but it means that being able to share your findings in an articulate way is a multiplier on your impact.
- Data science is a team sport: One of the highlights of talking to DJ [Patil], and many other data scientists, is the importance of being a team player. The data scientist is the feedback loop that helps initiate, iterate, and drive decisions in the company, and thus the data scientist will be working with product managers and people all across, convincing people of decisions and driving product and business decisions.
- Statistics is disproportionately useful: From the interviewees we talked to, it became clear that having a solid foundation in general mathematics, and then focusing on statistics, was really useful. You'll need multivariable calculus, linear/ matrix algebra, optimisation, and differential equations to understand the deeper pieces of statistics and machine learning, and will get you thinking in the right way. Applied math is a lot more directly applicable than pure math in data science. Statistics will guide you to be able to understand uncertainty in data and pull valid insights from data.
- Data science is about the product: It's about quantifying what makes a product or a feature ‘good,’ it's about measuring ‘engagement’ and ‘quality.’ It's about figuring out what people want in a product (using data), how to improve it (using data), and how to measure the impact of any change (using data).