A bit more than a year ago, I took the plunge and left my academic job to try my luck as a corporate data scientist, first at IBM (obviously a very big company) and now at Peltarion (a startup which I still want to call small although it is growing rapidly). I’d like to share some of my experiences and impressions of the different roles I’ve been in, as I have come to recognize that there are great differences in the data science field depending on the nature of the role and companies.
So, without further ado, I present my three last data science positions:
1. Bioinformatics scientist at Stockholm University
I was working as a senior bioinformatician at SciLifeLab/Stockholm University in different capacities for seven years. At first I was hired as a general bioinformatics go-to person in a core facility that performs DNA sequencing, and I was involved in a lot of different things: setting up data pipelines, deciding on quality control routines, assessing what had gone wrong, delivering data to and communicating with customers, performing routine or custom analysis, and sometimes doing some actual research and writing papers. After a while, I moved into a different role where my job was more explicitly to help researchers with data analysis, statistics and programming – more research-oriented and long-term work. In a way, I was an academic data science consultant. Of course, we didn’t really call it “data science” because we were doing science, plain and simple. But in terms of what we did all day, it was in many ways similar to “data science” in industry.
Characteristics of data science in an academic (biology) setting
- The final product is almost always a paper. This has some positive and negative implications. On the positive side, there is a strong focus on reproducibility. On the flip side, there is almost no emphasis on putting predictive models into production or making them easily usable. Code quality can also be spotty as a result.
- Bioinformatics data scientists tend to be good at data visualization, often in R or Python. They understand the concept of batch effects (drift in distribution parameters) and are good at dealing with high-dimensional data where the number of examples is usually much smaller than the number of dimensions (n << p), for example datasets with measurements of 20,000 genes for 20 different individuals. This makes it necessary for bioinformatics data scientists to be familiar with dimensionality reduction and multivariate methods such as PCA, PLS, t-SNE and so on.
- They like to use notebooks (Jupyter or R Markdown) to communicate analyses, because these have a similar structure to scientific presentations or manuscripts.
- They often like to use pipelining tools such as Nextflow, Snakemake or Bpipe to chain operations together.
During this time, I was also consulting part time for a few startups. From one of these gigs, I learned to build very complex processing pipelines with Snakemake. From another, I learned to build obscure functionality for web applications in Shiny. These are both tools that fit naturally into a bioinformatician’s mindset. For yet another customer, I suggested a way to use PCA and MDS to view their data from a global point of view which they had not considered, guiding them onto a path that eventually resulted in a model to accelerate business-to-business sales, which is featured in this Medium article.
2. Senior data scientist at IBM
After having been an academic consultant for quite a while, I decided to try to be a corporate one for a change. I got a position at IBM’s consulting arm, Global Business Services, in Kista outside of Stockholm. During my time there I participated in a handful of projects mostly related to the manufacturing industry. Conveniently, the knowledge of high-dimensional data that I had from biology came in good stead when working on these problems. It was not difficult to apply the skills I had obtained from academia in this setting.
Characteristics of data science in a big consulting setting
- Pragmatism is the word to summarize data science in “big consulting” environments. There isn’t time to think through every wrinkle of a problem as there is (albeit in theory only) in academia. The end goal is specified by a contract which you try to fulfill as closely as possible within the allotted amount of time. Notably, your task is not to do as much as possible but to do exactly what has been agreed upon. There is almost always a trade-off between time and model performance.
- Consultants are good at giving effective presentations. One of the first things I learned was to completely rework the way I had done presentations in academia to more clearly highlight the important findings and tailor them for the management level in companies. Communication is a very important skill for a data science consultant; maybe the most important one.
- Like in academia, there is also not that much emphasis on productization, because that part will typically be handled by a software engineering team that comes in after you have completed a proof of concept (PoC), so long as that PoC leads to a longer engagement. On the other hand, the IBM stack (see below) has good support for deploying models (e.g., via Node-RED).
- While potentially company-specific and for a variety of reasons, my team did not make very much use of code version control with GitHub. Since we mostly worked with short PoC projects, it was more of a priority to find a promising approach in the allotted time, after which the software engineering team would come in to build the final implementation. Also, some of my colleagues worked mainly with non-code tools such as SPSS Modeler, which has its own built-in version control. We ensured reproducibility mainly through the version control mechanism in Box, where we stored scripts, documentation and metadata.
- Automated data cleaning and model building (AutoML) are important in this setting because of the time constraints. Data cleaning can yield big “quick wins” but is tedious, and a lot can be gained by automating it, for example using packages such as vtreat for R. AutoML with TPOT, auto-sklearn or H20 are helpful for rapidly finding a good-enough model.
- Feature importance or other types of model explanation are very important for communicating results to customers (also see below). Decision trees are still used surprisingly often, and for random forests and gradient boosting, there are feature importance and various tree-model explanation interfaces.
- It’s quite common to encounter projects with unbalanced data and to use tools like SMOTE, ADASYN or ROSE to do smart oversampling of the rare class(es). It is also not uncommon that some classes are so rare that one needs to pursue an anomaly detection approach rather than standard classification.
- In terms of tooling, there was a larger emphasis on using commercial products (preferably from the IBM ecosystem) such as SPSS Modeler rather than open-source programming languages. Naturally, one has to rapidly become conversant with IBM Bluemix (now called IBM Cloud) offerings and associated products in order to be an effective IBM consultant.
3. Data scientist at Peltarion
In the autumn of 2017, I got an offer from a deep learning company, Peltarion, that I had applied to before starting at IBM. I decided to take it on the strength of the skills of my new colleagues, many of whom I knew from the Stockholm AI and machine learning scene. As the company is a startup, I have worn many hats during the first six months, working on customer projects, writing documentation and blog posts, testing our deep learning platform, sitting together with beta testers, keeping an eye on competitors and so on.
Characteristics of data science in a startup setting
- There is more emphasis on software engineering practices than in academia or big consulting. Git and GitHub (or some equivalent) are not “nice-to-haves” but the core of the whole enterprise, and frequent pull requests and code reviews are much more common. Virtual environments and containers (e.g., Docker) are important (though also found in academic bioinformatics to a large extent).
- Data scientists in startups tend to think more about deployment and productization of models, because it hits closer to home (there often isn’t a supporting software engineering team to do that for the data scientists, or the startup is building its own deployment functionality, like we are at Peltarion).
- Startup data scientists tend to be more informed about the latest technical advances in machine learning. Consultants don’t have time to keep up as much (or to install and play with the latest tools) and academics are often more interested in keeping up with the latest scientific advances in their specific field rather than general machine learning news. It is also more important to keep track of competitors.
- Reproducibility is achieved by writing libraries rather than chaining together operations with pipelines. Continuous integration (CI), like with Travis or Jenkins, is much more common than in academia, although it is starting to appear there as well. At Peltarion, CI is essential because we need to move fast and make every effort to minimize technical debt that could impact us in the future.
The role of the data scientist can differ considerably based on environment and industry, but is essential to any organization who wants to use data to improve their business.
This article was based on this blog post and was originally posted on the blog Follow the Data.