Big Data

Share:

The questions in this section were shown to developers involved in Data Analysis, Data Engineering, Machine Learning, or to those whose job role was Data Analyst / Data Engineer / Data Scientist. This survey was targeted specifically at developers, so the results may not be representative of the wider big data audience.

Which of these batch processing tools do you use?

31%

Spark

16%

Hadoop MapReduce

13%

Hive

7%

Dask

3%

Pig

1%

Tez

3%

Other

56%

None

Which of these streaming processing tools do you use?

20%

Spark Streaming

8%

Flink

6%

Storm

5%

Dask

4%

Beam

3%

Apache NiFi

2%

Samza

3%

Other

65%

None

Professionals who are not involved in data pipeline creation use traditional relational databases for building data lakes. Spark continues to be the most popular tool for batching and streaming processing.

Which of these orchestration tools do you use?

22%

Airflow

10%

Custom or self-made

6%

Apache NiFi

6%

Apache Oozie

3%

Prefect

2%

Luigi

2%

Dagster

5%

Other

59%

None

Quite predictably, Apache Airflow is the most popular orchestration tool – especially among data engineers. Interestingly, 10% of the orchestration tools are custom or self-built.

Which of these tools do you use for Spark execution?

37%

Kubernetes

30%

YARN

27%

Amazon EMR

11%

Google DataProc

9%

Azure HDInsight

5%

Mesos

5%

Nomad

5%

DataBricks

3%

AWS Glue

2%

Other

13%

None

Kubernetes, YARN and Amazon EMR are the most popular cloud solutions for Spark execution.

Which of these tools do you use for building data lakes?

24%

Traditional relational DB

15%

Delta Lake

6%

MPP

4%

Iceberg

3%

Hudi

7%

Other

54%

None

Which of these MPP tools do you use?

15%

BigQuery

13%

Redshift

11%

Azure SQL Data Warehouse

9%

Azure Data Explorer

5%

ClickHouse

3%

Greenplum

3%

Spanner

4%

Other

61%

None

The vast majority of the respondents do not use MPP tools. BigQuery, Redshift, and Azure SQL Data Warehouse are the most popular instruments.

Do you work with message brokers or message queues (e.g. Kafka, RabbitMQ)?

Which of these tools do you use for messaging and delivery?

49%

RabbitMQ

42%

Kafka

20%

Amazon SQS

9%

ActiveMQ

7%

RocketMQ

5%

Azure Event Hub

4%

Amazon Kinesis

Big Data:

2022

Thank you for your time!

We hope you found our report useful. Share this report with your friends and colleagues.

If you have any questions or suggestions, please contact us at surveys@jetbrains.com.