Run applications with Spark Submit
With the Big Data Tools plugin, you can execute applications on Spark clusters. DataGrip provides run/debug configurations to run the spark-submit script in Spark’s bin directory. You can execute an application locally or using an SSH configuration.
Run an application with the Spark Submit configurations
Prepare an application to run.
Open the Plugin settings and install the FTP/SFTP/WebDAV Connectivity (ex. Remote Hosts Access) plugin.
From the main menu, select
. Alternatively, press Alt+Shift+F10, then 0.Click the Add New Configuration button ().
Select the
or configuration from the list of the available configurations.Fill in the configuration parameters:
Mandatory parameters:
Spark home: a path to the Spark installation directory.
Application: a path to the executable file.
Class: the name of the main class of the jar archive. Select it from the list.
Optional parameters:
Name: a name to distinguish between run/debug configurations.
Allow parallel run: select to allow running multiple instances of this run configuration in parallel.
Store as project file: save the file with the run configuration settings to share it with other team members. The default location is .idea/runConfigurations. However, if you do not want to share the .idea directory, you can save the configuration to any other directory within the project.
Run arguments: Arguments of your application.
Cluster manager: select the management method to run an application on a cluster. The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos, or YARN). See more details in the Cluster Mode Overview.
Master: the format of the master URL passed to Spark.
Before launch: in this area you can specify tasks that must be performed before starting the selected run/debug configuration. The tasks are performed in the order they appear in the list.
Show this page: select this checkbox to show the run/debug configuration settings prior to actually starting the run/debug configuration.
Activate tool window: by default this checkbox is selected and the Run tool window opens when you start the run/debug configuration.
You can click the Add options and select an option to add to your configuration:
Spark Configuration: Spark configuration options available through a properties file or a list of properties.
Dependencies: files and archives (jars) that are required for the application to be executed.
Maven: Maven-specific dependencies. You can add repositories or exclude some packages from the execution context.
Driver: Spark Driver settings, such as amount of memory to use for the driver process. For the cluster mode, it is also possible to specify the number of cores.
Executor: Executor settings, such as amount of memory and the number of cores.
Spark Monitoring Integration: ability to monitor the execution of your application with Spark Monitoring.
Kerberos: settings for establishing a secured connection with Kerberos.
Shell options: select if you want to execute any scripts before the Spark submit.
Enter the path to bash and specify the script to be executed. It is recommended to provide an absolute path to the script.
Select the Interactive checkbox if you want to launch the script in the interactive mode. You can also specify environment variables, for example,
USER=jetbrains
.Advanced Submit Options:
Proxy user: a username that is enabled for using proxy for the Spark connection.
Driver Java options, Driver library path, and Driver class path: add additional driver options. For more details, refer to Runtime Environment.
Archives: comma-separated list of archives to be extracted into the working directory of each executor.
Print additional debug output: run spark-submit with the
--verbose
option to print debugging information.
Mandatory parameters:
SSH configuration: click ... and create a new SSH configuration. Specify the URL of the remote host with the Spark cluster and user's credentials to access it. Then click Test Connection to ensure you can connect to the remote server.
Target upload directory: the directory on the remote host to upload the executable files.
Spark home: a path to the Spark installation directory.
Application: a path to the executable file.
Class: the name of the main class of the jar archive. Select it from the list.
Name: a name to distinguish between run/debug configurations.
Allow parallel run: select to allow running multiple instances of this run configuration in parallel.
Store as project file: save the file with the run configuration settings to share it with other team members. The default location is .idea/runConfigurations. However, if you do not want to share the .idea directory, you can save the configuration to any other directory within the project.
Run arguments: Arguments of your application.
Cluster manager: select the management method to run an application on a cluster. The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos, or YARN). See more details in the Cluster Mode Overview.
Master: the format of the master URL passed to Spark.
Before launch: in this area you can specify tasks that must be performed before starting the selected run/debug configuration. The tasks are performed in the order they appear in the list.
Show this page: select this checkbox to show the run/debug configuration settings prior to actually starting the run/debug configuration.
Activate tool window: by default this checkbox is selected and the Run tool window opens when you start the run/debug configuration.
You can click the Add options and select an option to add to your configuration:
Spark Configuration: Spark configuration options available through a properties file or a list of properties.
Dependencies: files and archives (jars) that are required for the application to be executed.
Maven: Maven-specific dependencies. You can add repositories or exclude some packages from the execution context.
Driver: Spark Driver settings, such as amount of memory to use for the driver process. For the cluster mode, it is also possible to specify the number of cores.
Executor: Executor settings, such as amount of memory and the number of cores.
Spark Monitoring Integration: ability to monitor the execution of your application with Spark Monitoring.
Kerberos: settings for establishing a secured connection with Kerberos.
Shell options: select if you want to execute any scripts before the Spark submit.
Enter the path to bash and specify the script to be executed. It is recommended to provide an absolute path to the script.
Select the Interactive checkbox if you want to launch the script in the interactive mode. You can also specify environment variables, for example,
USER=jetbrains
.Advanced Submit Options:
Proxy user: a username that is enabled for using proxy for the Spark connection.
Driver Java options, Driver library path, and Driver class path: add additional driver options. For more details, refer to Runtime Environment.
Archives: comma-separated list of archives to be extracted into the working directory of each executor.
Print additional debug output: run spark-submit with the
--verbose
option to print debugging information.
Click OK to save the configuration. Then select configuration from the list of the created configurations and click .
Inspect the execution results in the Run tool window.