PyCharm
 
Get PyCharm

Spark monitoring

Last modified: 17 June 2024

With the Spark plugin, you can monitor your Spark cluster and submitted jobs right in the IDE.

In this chapter:

  1. Establish a connection to a Spark server from scratch

    note

    In addition to creating a connection manually, you can also quickly create a connection from an AWS EMR cluster if you have Spark running on it.

  2. Establish a connection to a Spark from a Zeppelin notebook

  3. View job graphs

  4. Filter out the monitoring data

Once you have established a connection to the Spark server, the Spark monitoring tool window appears.

Spark monitoring: jobs

At any time, you can open the connection settings in one of the following ways:

  • Go to the Tools | Big Data Tools Settings settings page  CtrlAlt0S.

  • Open the Big Data Tools tool window (View | Tool Windows | Big Data Tools), select a Spark connection, and click Connection settings.

  • Click Connection Settings in any tab of the Spark monitoring tool window.

When you select an application in the Spark monitoring tool window, you can use the following tabs to monitor data:

  • Info: high-level information on the submitted application, such as App id or Attempt id.

  • Jobs: a summary of the application jobs. Click a job to see more details on it. Use the Visualization tab to view the job DAG.

  • Stages: details of each stage.

  • Environment: the values for the environment and configuration variables.

  • Executors: a process launched for an application that runs tasks and keeps data in memory or disk storage across them. Use the Logs tab to view the executor stdout and stderr logs.

  • Storage: persisted RDDs and DataFrames.

  • SQL: details about SQL queries execution (if used by the application).

You can also preview info on Tasks, units of work that sent to one executor.

For more information about types of data, refer to Spark documentation.

At any time, you can click Refresh in the Spark monitoring tool window to manually refresh the monitoring data. Alternatively, you can configure the automatic update within a certain time interval using the list located next to the Refresh button.