Data Science
Share:
The questions in this section were shown to developers involved in Business Intelligence, Data Analysis, Data Engineering, Machine Learning, or to those whose job role was Data Analyst / Data Engineer / Data Scientist or Business Analyst.
A considerable number of respondents seem to be juggling data science responsibilities alongside other activities. These findings suggest a democratization of the field is in progress, implying potential opportunities for data science market growth.
PyCharm
An all-in-one Python IDE for building data pipelines, analyzing data, prototyping, and deploying ML models with excellent support for Python, scientific libraries, interactive Jupyter notebooks, Anaconda, SQL and NoSQL databases, and more.
The majority of data science professionals find value in employing tried and true plots for data exploration and presentation. These types of charts are widely used in various data-related tasks such as data gathering, exploratory data analysis, data orchestration, and ML Ops.
Datalore
Datalore by JetBrains is a collaborative data science and analytics platform for teams, accessible right from the browser. Datalore notebooks are compatible with Jupyter and offer smart coding assistance for Python, SQL, R, and Scala notebooks, as well as no-code visualizations and data wrangling. Datalore’s Report builder allows teams to turn a notebook full of code and experiments into a clear, data-driven story. Teams can share notebooks, edit them together in real time, and organize their projects in workspaces.
Close to half of all teams and departments have a dedicated Data Engineer or Machine Learning Engineer.
Specialized roles like Data Scientist, Data Engineer, and Machine Learning Engineer are relatively recent additions to the job market. Many respondents transition into these roles from related fields, necessitating the acquisition of new skills through self-study or online courses.
While the majority of data science professionals do not version their notebooks, a substantial proportion (41%) opt to do so, and most of them choose Git or GitHub for versioning.
Various implementations of Jupyter notebooks are widely popular in data science, with common use cases including exploratory data analysis, experimenting with data and data querying, and model prototyping. Approximately 40% of data science professionals use Jupyter notebooks to present their work results, but, interestingly, many (almost 50%) spend only 10%–20% of their time using Jupyter notebooks.
Although the majority uses local files, the share of those using SQL databases grew by 10 percentage points over the past year, highlighting the importance of SQL for data science.
Most polled data scientists process custom-collected data, with the most prevalent data types being transactional data, time series data, images, and machine-generated data. Interestingly, 30% work with synthetic data – data manufactured artificially rather than generated by real-world events.
Machine or deep learning models are trained by approximately 40% of all respondents. However, this figure jumps to more than 60% among those who consider data work as their primary activity. This industry trend implies that predictive modeling is becoming the central aspect of working with data work.
While half of the data science professionals retrain or update their machine learning models at least once a month, most spend less than 20 hours per month on the task.
The majority – 81% – of data science professionals use GPUs for model training. Efficient use of graphic processors can accelerate training and thus enhance model performance, making it an increasingly attractive resource for researchers and data specialists. This also emphasizes the importance and relevance of technological innovations in the world of machine learning.
Higher computing power is a clear trend for machine learning tasks. Nearly 80% of data science professionals now use 16 GB or more VRAM, while the share of those using 8 GB decreased by six percentage points over the past year.
Core machine learning algorithms, like regression and tree-based methods, remain prevalent, though a significant number of data science professionals also embrace neural networks. The rising popularity and user-friendliness of transformer nets might explain why 30% of the respondents engage in NLP work. Interestingly, only 24% of participants reported using statistical testing in their work, indicating that machine and deep learning have surpassed classical statistics as fundamental data skills.
Amazon services stand out as the most popular enterprise cloud solutions. Remarkably, there has been a significant increase (of over 10 percentage points) in the adoption of enterprise machine learning solutions compared to the previous year.
TensorFlow edges slightly ahead of scikit-learn and PyTorch in popularity, with Keras and XGBoost also showing solid adoption rates. Interestingly, a significant proportion of respondents (19%) reported not using any specific framework.
TensorBoard is the most commonly used tool, with a 23% share, followed by MLFlow with 10% and WandB with 7%. However, two-thirds of data science professionals aren’t using any specific tools for tracking their model training experiments.
Machine learning and AI have become crucial components of daily business life, so it should come as no surprise that almost half of our respondents use various AI-based features integrated into the software they use.
Data quality is a typical issue for professionals and organizations that work with data, as nearly 50% dedicate 30% of their time or more to data preparation. An Anaconda study also confirms that data cleaning is emerging as the most time-consuming aspect of data professionals’ workflow. Almost half of our respondents opt for Integrated Development Environments (IDEs) to handle these types of tasks.
Thank you for your time!
We hope you found our report useful. Share this report with your friends and colleagues.
If you have any questions or suggestions, please contact us at surveys@jetbrains.com.