Data Science

Share:

The questions in this section were shown to developers involved in Business Intelligence, Data Analysis, Data Engineering, Machine Learning, or to those whose job role was Data Analyst / Data Engineer / Data Scientist or Business Analyst.

What kind of activity is data science, data analytics, data engineering, or machine learning for you?

Quite a large number of the respondents combine data science responsibilities with other activities. These results indicate that there has been a democratization of the sphere and that there is potential for market growth.

Which of the following activities are you involved in?

At JetBrains, we created Datalore – a collaborative data science platform for teams. While providing an excellent coding experience for data professionals, Datalore also brings no-code automations for data exploration and visualization workflows. This means that even non-technical users can do ad hoc reporting and data visualization in the same tool as the core data team.

How did you learn data science, machine learning, or data engineering?

Dedicated data specialist positions, such as Data Scientist, Data Engineer, and Machine Learning Engineer, are relatively new. Many of our respondents transitioned into these roles after working or training in adjacent fields, and therefore needed to upskill through independent study or online courses. While postgraduate degrees in STEM have traditionally been the most common path into data science or machine learning, trends from the last seven years show that an increasing number of people working in these areas entered with a bachelor's degree, rising from 20% in 2015 to 31% in 2021. As the number of people graduating from new undergraduate programs specializing in data skills increases, we may see these results shifting in favor of people who obtained these skills through formal education.

Which IDEs or editors do you use for data science or data analytics?

Jupyter notebooks won out as the preferred editor for data science and data analytics work, with around 40% of the respondents indicating they used notebooks for these activities. This result was even higher among respondents who reported doing data gathering and visualization, exploratory data analysis, or machine learning modeling, with 70% reporting that they use Jupyter notebooks.

Learn more about this topic with our recent research. We found that from 2019 to 2020, the number of Python 3 notebooks grew by 87%, and the number of Python 2 notebooks increased by 12%.

How much of your working time is spent inside notebooks?

What do you use notebooks for?

Jupyter notebooks remain one of the most popular tools of choice, with 42% of the respondents using them and more than 50% of those who do citing data work as their main activity. They are used primarily for exploratory work, such as exploring data and creating model prototypes. However, even among those who work primarily as data specialists, only a minority of respondents use notebooks for more than 40% of their working time.

Do you version your notebooks?

What versioning tools do you use?

The percentage of those who version their notebooks is quite large, which is a good sign as it indicates that a large proportion of data professionals see notebooks as code that needs to be maintained. The most popular tools among those who version their notebooks are Git and GitHub.

Versioning Jupyter notebooks via the Git command-line interface (CLI) can be tough. Luckily DataSpell has a rich range of features for working with Git, making it easy to perform core tasks through the UI, such as setting up a repo, adding and pushing notebooks, and viewing differences between commits of notebooks – all without having to remember a single Git command! Check out this article to learn more about how to use Git with Jupyter notebooks in DataSpell.

What types of data sources do you work with?

Besides local files, SQL databases remain the most commonly used data sources among data specialists.

What tools do you use to present the results of your research?

With Datalore you can turn Jupyter notebooks into beautiful data apps in seconds. Arrange the cells on the canvas and publish the result in Static or Interactive mode. Your stakeholders will be able to access the report via a link.

See a report

What sorts of methods and algorithms do you use?

Core machine learning algorithms, such as regression and tree-based methods, continue to be used widely. However, the majority of respondents also use neural networks, especially transformer architectures. The increasing ease of use and growing popularity of transformer nets may also explain why over a quarter of the respondents reported doing NLP work. Interestingly, only a fifth of the respondents reported using statistical testing as part of their work, suggesting that machine and deep learning have overtaken classical statistics as a core data skill.

Which machine learning frameworks do you use?

TensorFlow was the most popular deep learning framework among all respondents, although it and PyTorch were equally used by respondents who do data work as a main activity. Scikit-learn was the most popular machine learning library, although specialist packages and frameworks for tree-based modeling, such as XGBoost and LightGBM, were used by a notable minority of participants.

Which enterprise machine learning solutions do you use?

Amazon services are the most popular enterprise cloud solutions.

Including yourself, how many members does your data team have?

The majority of the respondents – 70% – work in small groups with no more than 10 people in a team. One in five works on a team with more than 15 data specialists.

Does your team or data department have a dedicated Data Engineer role?

Almost 50% of the teams or departments have a dedicated Data Engineer role.

Does your team or data department have a dedicated Machine Learning Engineer role?

Just over 50% of the respondents reported that their teams either have dedicated data engineers or machine learning engineers. Both Data Engineer and ML Engineer are broad titles that can vary widely depending on the company, so it’s possible that people in either of these roles are responsible for similar tasks related to machine learning, such as model deployment and data pipeline management. Unsurprisingly, the larger a team is, the more likely it is to have people working in one of these roles. Over 80% of respondents on data teams with 1–2 members had neither a dedicated data engineer nor an ML engineer, whereas 79% of respondents on data teams with more than 15 people had dedicated data engineers, and 65% had dedicated ML engineers.

Do you train machine learning or deep learning models?

Just under half of respondents train machine or deep learning models, with this figure rising to 60% among those who perform data work as their main activity. This suggests that predictive modeling is becoming a core component of data work in the industry.

Do you use GPUs to train your models?

How much VRAM do you usually need for your machine learning tasks?

Most respondents indicated that they use GPUs to train their machine or deep learning models. VRAM needs differed depending on how the respondents do data work. 40% of the respondents who do data work as a hobby or for educational purposes indicated that 8 GB was sufficient, compared to only 18% of those who do data work as a main work activity.

How much time do you spend monthly on model training?

Most respondents indicated that they spend up to 20 hours a week training models, which may include the time that the models spend training overnight. Almost a third spend 5 hours a week or less training models. This is consistent with previous results showing that model training forms a relatively small part of data science work, with the majority of time being spent on data preparation and exploration.

What sorts of computational resources do you use for data science tasks?

Consistent with other answers in our survey showing that the main activity done in notebooks is data exploration and visualization and that the majority of respondents work with local files, the majority of the respondents also use local resources to complete their data science work. Surprisingly, this did not differ much depending on how the respondent does data work. People who do data work as their main activity were as likely to use local resources as those who do it as a hobby or for educational purposes.

Which specific tools do you use for tracking model training experiments?

The majority of the respondents said they do not use any tools to track the performance of their model training experiments. However, the use of such tools was much more likely on data teams comprising 15 or more people (58% of respondents from such teams use at least one), when the team has a dedicated Machine Learning Engineer (62%), or when the respondent was involved in machine learning modeling and ML Ops work (63%). This indicates that this sort of tooling tends to be used in environments where there is specialist knowledge pertaining to machine-learning-model development.

What charts do you mostly use for data visualizations?

Simple but meaningful plots for exploring and presenting data were used by the majority of data specialists. These charts were used by the majority of respondents regardless of the type of data activities they were involved in, from data gathering and exploratory data analysis to data orchestration and ML Ops.

Data Science:

2022

Thank you for your time!

We hope you found our report useful. Share this report with your friends and colleagues.

If you have any questions or suggestions, please contact us at surveys@jetbrains.com.