Run applications with Spark Submit
With the Big Data Tools plugin, you can execute applications on Spark clusters. PyCharm provides run/debug configurations to run the spark-submit script in Spark’s bin directory. You can execute an application locally or using an SSH configuration.
Run an application with the Spark Submit configurations
Prepare an application to run. It can be a jar or py file.
Select Add Configuration in the list of run/debug configurations.
If you have already created any run/debug configuration, select Edit configurations from the list.
Click the Add New Configuration button ().
Select the
or configuration from the list of the available configurations.Fill in the configuration parameters:
Mandatory parameters:Spark home: a path to the Spark installation directory.
Application: a path to the executable file.
Main class: the name of the main class. Select it from the list.
Name: a name to distinguish between run/debug configurations.
Allow parallel run: select to allow running multiple instances of this run configuration in parallel.
Store as project file: save the file with the run configuration settings to share it with other team members. The default location is .idea/runConfigurations. However, if you do not want to share the .idea directory, you can save the configuration to any other directory within the project.
Run arguments: Additional run arguments of the
spark-submit
command. For example,--executor-memory
or--total-executor-cores
.Cluster manager: select the management method to run an application on a cluster. The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos, or YARN). See more details in the Cluster Mode Overview.
Master: the format of the master URL passed to Spark.
Proxy user: a username that is enabled for using proxy for the Spark connection.
Specify Shell options if you want to execute any scripts before the Spark submit.
Enter the path to bash and specify the script to be executed. It is recommended to provide an absolute path to the script.
Select the Interactive checkbox if you want to launch the script in the interactive mode. You can also specify environment variables, for example,
USER=jetbrains
.Before launch: in this area you can specify tasks that must be performed before starting the selected run/debug configuration. The tasks are performed in the order they appear in the list.
Show this page: select this checkbox to show the run/debug configuration settings prior to actually starting the run/debug configuration.
Activate tool window: by default this checkbox is selected and the Run tool window opens when you start the run/debug configuration.
You can click the Add options and select an option to add to your configuration:
Spark Configuration: Spark configuration options available through a properties file or a list of properties.
Dependencies: files and archives (jars) that are required for the application to be executed.
Maven: Maven-specific dependencies. You can add repositories or exclude some packages from the execution context.
Driver: Spark Driver settings, such as memory, CPU, local driver libraries, Java options, and a class path.
Executor: Executor settings, such as memory, CPU, and archives.
Spark Monitoring Integration: ability to monitor the execution of your application with Spark Monitoring.
Kerberos: settings for establishing a secured connection with Kerberos.
Logging: an option to print debug logging.
Mandatory parameters:SSH configuration: click ... and create a new SSH configuration. Specify the URL of the remote host with the Spark cluster and user's credentials to access it. Then click Test Connection to ensure you can connect to the remote server.
Target directory: the directory on the remote host to upload the executable files.
Spark home: a path to the Spark installation directory.
Application: a path to the executable file.
Main class: the name of the main class. Select it from the list.
Name: a name to distinguish between run/debug configurations.
Allow parallel run: select to allow running multiple instances of this run configuration in parallel.
Store as project file: save the file with the run configuration settings to share it with other team members. The default location is .idea/runConfigurations. However, if you do not want to share the .idea directory, you can save the configuration to any other directory within the project.
Run arguments: Additional run arguments of the
spark-submit
command. For example,--executor-memory
or--total-executor-cores
.Cluster manager: select the management method to run an application on a cluster. The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos, or YARN). See more details in the Cluster Mode Overview.
Master: the format of the master URL passed to Spark.
Proxy user: a username that is enabled for using proxy for the Spark connection.
Specify Shell options if you want to execute any scripts before the Spark submit.
Enter the path to bash and specify the script to be executed. It is recommended to provide an absolute path to the script.
Select the Interactive checkbox if you want to launch the script in the interactive mode. You can also specify environment variables, for example,
USER=jetbrains
.Before launch: in this area you can specify tasks that must be performed before starting the selected run/debug configuration. The tasks are performed in the order they appear in the list.
Show this page: select this checkbox to show the run/debug configuration settings prior to actually starting the run/debug configuration.
Activate tool window: by default this checkbox is selected and the Run tool window opens when you start the run/debug configuration.
You can click the Add options and select an option to add to your configuration:
Spark Configuration: Spark configuration options available through a properties file or a list of properties.
Dependencies: files and archives (jars) that are required for the application to be executed.
Maven: Maven-specific dependencies. You can add repositories or exclude some packages from the execution context.
Driver: Spark Driver settings, such as memory, CPU, local driver libraries, Java options, and a class path.
Executor: Executor settings, such as memory, CPU, and archives.
Spark Monitoring Integration: ability to monitor the execution of your application with Spark Monitoring.
Kerberos: settings for establishing a secured connection with Kerberos.
Logging: an option to print debug logging.
Click OK to save the configuration. Then select configuration from the list of the created configurations and click .
Inspect the execution results in the Run tool window.