PyCharm 2022.2 Help

Configure Big Data Tools environment

Before you start working with Big Data Tools, you need to install the required plugins and configure connections to servers.

Install the required plugins

  1. Whatever you do in PyCharm, you do it in a project. So, open an existing project (File | Open) or create a new project (File | New | Project).

  2. Press Ctrl+Alt+S to open the IDE settings and select Plugins | Marketplace.

  3. Install the Big Data Tools plugin.

  4. Restart the IDE. After the restart, the Big Data Tools tab appears in the rightmost group of the tool windows. Click it to open the Big Data Tools window.

Once the Big Data Tools support is enabled in the IDE, you can configure a connection to a Zeppelin,Spark, Google Storage, and S3 server. You can connect to HDFS, WebHDFS, AWS S3, and a local drive using config files and URI.

Configure a server connection

  1. In the Big Data Tools window, click Add a connection and select the server type. The Big Data Tools Connection dialog opens.

  2. In the Big Data Tools Connection dialog, specify the following parameters depending on the server type:

    • File Systems: HDFS, Local, SFTP

    • Storages: AWS S3, Minio, Linode, Digital Open Spaces, GS, Azure, Yandex Object Storage, Alibaba OSS

    • Monitoring: Hadoop, Kafka, Spark, Hive Metastore, Flink.

    • Notebooks: Zeppelin

    • Data Processing Platforms: AWS EMR

    HDFS connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Root path: a path on the target server to be the root for HDFS connection.

      When the connection is successfully established the Driver home path field shows the target IP address of connection including a port number. Example: hdfs://127.0.0.1:65224/.

    • Config path a path to the HDFS configuration files directory. See the samples of configuration files.

    Optionally, you can set up:

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable tunneling (Only NameNode operation). Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available. SSH tunneling currently works only for operators with the following name nodes: list files, get meta info

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    Note that the Big Data Tools plugin uses the HADOOP_USER_NAME env variable to log in to the server. It this variable is not defined then the user.name property is used.

    See more examples of the Hadoop File System configuration files.

    Local FS

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Root path: a path to the root directory.

    Optionally, you can set up:

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    HDFS connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • SSH config: select an SSH configuration, which contains the needed server address and credentials.

    • Root path: a path on the target server to be the root for the SFTP connection.

    Optionally, you can set up:

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    Configure S3 connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Region: an AWS region of the specified bucket. You can select one from the list or let PyCharm autodetect it.

    • Authentication type: the authentication method. You can use your account credentials (by default), or opt to enter the access and secret keys. You can also use a named profile that is located in the default AWS config location (~/.aws/credentials on Linux or macOS, or C:\Users\<USERNAME>\.aws\credentials on Windows). If needed you can specify any profile from a custom credential file.

    Optionally, you can set up:

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Use custom endpoint: select if you want to specify a custom endpoint and a signing region.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    Configure AWS EMR connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Select if you want to specify a custom endpoint or a region. You can select a region from the list or let PyCharm autodetect it.

    • Authentication type: the authentication method. You can use your account credentials (by default), or opt to enter the access and secret keys. You can also use a named profile that is located in the default AWS config location (~/.aws/credentials on Linux or macOS, or C:\Users\<USERNAME>\.aws\credentials on Windows). If needed you can specify any profile from a custom credential file.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Click the Open SSH Key Settings link to create an SSH connection authenticated with a private key file. You need to specify the Amazon EC2 key pair private key in the EMR SSH Keystore dialog.

    Configure Minio connection

    Mandatory parameters:

    • Endpoint: specify an endpoint to connect to.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Bucket filter and Filter type help define a specific set of buckets to preview and work with. You can set filters that contain, match, start with the pattern, or use regular expression.

    • Access credentials: Access Key and Secret Key.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    Configure Linode connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Region: a region of the specified bucket. You can select one from the list or let PyCharm autodetect it.

    • Bucket filter and Filter type help define a specific set of buckets to preview and work with. You can set filters that contain, match, start with the pattern, or use regular expression.

    • Access credentials: Access Key and Secret Key.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    Configure Digital Open Spaces connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Region: a Digital Ocean region of the specified bucket. You can select one from the list or let PyCharm autodetect it.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Access credentials: Access Key and Secret Key.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    Connection settings for Google Cloud Storage

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Google app credentials: a path to the Cloud Storage JSON file (required if the bucket is not publicly shared).

    Optionally, you can set up:

    • Project ID: available if you have selected All buckets in the account. This overrides the project ID specified in the JSON credentials file. Enter a project ID to use buckets from a project other than the one specified in the credentials file.

    • Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    Connection settings for Azure Storage

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Endpoint: specify an endpoint to connect to.

    • Choose the way to get Microsoft Azure containers:

      • Select Custom roots and, in the Container field, specify the name of the container or the path to a directory in the container. You can specify multiple names or paths by separating them with a comma.

      • Select All containers in the account. You can then use the container filter to show only containers with particular names.

    • Authentication type: the authentication method. You can access the storage by username and key, by a connection string, or using a SAS token.

    Optionally, you can set up:

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    Connection settings for Yandex Object Storage

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Bucket filter and Filter type help define a specific set of buckets to preview and work with. You can set filters that contain, match, start with the pattern, or use regular expression.

    • Authentication type: the authentication method. You can use your account credentials (by default), or opt to enter the access and secret keys. You can also use a named profile that is located in the default Yandex Object Storage config location. If needed you can specify any profile from a custom credential file.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    Alibaba connection

    Mandatory parameters:

    • Region: an Alibaba OSS. Select one from the list.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Authentication type: the authentication method. You can use your account credentials (by default) or opt to enter the access and secret keys.

      You can also use a named profile that is located in the default OSS config location (~/.oss/credentials on Linux or macOS, or C:\Users\<USERNAME>\.oss\credentials on Windows). If needed you can specify any profile from a custom credential file.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    Configure Hadoop connection

    Mandatory parameters:

    • URL: the path to the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    • HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Kerberos authentication settings: opens the Kerberos authentication settings.

      Kerberos settings

      Specify the following options:

      • Enable Kerberos auth: select to use the Kerberos authentication protocol.

      • Krb5 config file: a file that contains Kerberos configuration information.

      • JAAS login config file: a file that consists of one or more entries; each specifies which underlying authentication technology should be used for a particular application or applications.

      • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

      • To include additional login information into PyCharm log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    • You can also reuse any of the existing Spark connections. Just select it from the Spark Monitoring list.

    Configure Kafka connection

    Mandatory parameters:

    • URL: the path to the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Properties: the list of configurable connection parameters. You can specify a file with the properties or the properties will be retrieved from Kafka documentation. You start typing a property name and press Ctrl+Space to get a target property.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Click the question mark next to the Kafka support is limited message to preview the list of the currently supported features.

    Configure Spark connection

    Mandatory parameters:

    • URL: the path to the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    • HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Kerberos authentication settings: opens the Kerberos authentication settings.

      Kerberos settings

      Specify the following options:

      • Enable Kerberos auth: select to use the Kerberos authentication protocol.

      • Krb5 config file: a file that contains Kerberos configuration information.

      • JAAS login config file: a file that consists of one or more entries, each specifying which underlying authentication technology should be used for a particular application or applications.

      • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

      • To include additional login information into PyCharm log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    Configure Hive connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Properties: select how to specify your Hive configuration properties: enter them explicitly or load them from a configuration folder. If you select Explicit, you can enter a value for the metastore.thrift.uris property in the URL field and enter any other properties in the Other properties field.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Database pattern: if you want to view only some of your Hive databases in the editor tab, use this field to enter a regular expression for the database names.

    • Table pattern: if you want to view only some of your database tables in the editor tab, use this field to enter a regular expression for the table names.

    Connection Settings

    Mandatory parameters:

    • URL: the path to the target server.

    • Login and Password: your credentials to access the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Login as anonymous: select to log in without using your credentials.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Library Versions: Scala Version, Spark Version, and Hadoop Version: these values are derived from the plugin bundles. If needed, specify any alternative version values.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Notifications. Select Enable cell execution notification if you want to be notified when execution time exceeds the specified time interval (60 seconds by default).

  3. Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

You can disable any connection if you temporarily do not need it. Right-click the corresponding item in the BigDataTools window and select Disable Connection from the context menu. The server changes its visual appearance and behavior: you cannot preview its content. To restore the connection, right-click it and select Enable Connection from the context menu.

For your convenience, you can rename the server root and copy a path to it. To quickly access all the required actions, right-click the target server in the Big Data Tools window and select the corresponding command from the context menu.

Now that you have established a connection to the server, you can start working with your notebooks. However, it might be a good practice to ensure that all the libraries and packages required for execution on a particular server are installed and available.

Configure notebook dependencies

  1. From the main menu, select File | Project Structure.

  2. In the Project Structure dialog, select Modules in the list of the Project Settings. Then select any of the configured connections in the list of the modules and double-click System Dependencies.

  3. Inspect the list of the added libraries. Click the list and start typing to search for a particular library.

    Configure dependencies
  4. If needed, modify the list of the libraries

    • Click the Add button to add a new library.

    • Click the Specify Documentation URL button and specify the URL of the external documentation.

    • Click the Execute button to select the items that you want PyCharm to ignore (folders, archives and folders within the archives), and click OK.

    • Click the Remove button to remove the selected ordinary library from the library or restore the selected excluded items. The items themselves will stay in the library.

Manage Zeppelin interpreters

You can configure interpreters on a Zeppelin server. Once an interpreter is added, it is available for all notes on this server.

Configure Zeppelin interpreters

  1. Open interpreter settings using one of the following ways:

    • Click the interpreter settings on the notebook toolbar.

    • Right-click a Zeppelin server in the BigDataTools tool window and select Open Interpreter Settings from the context menu.

  2. Preview the list of the available interpreters in the Interpreter Settings window.

    Interpreter settings

    Note that the list of the interpreters is identical to the list that opens in the Interpreter Bindings dialog for Zeppelin 0.8 and earlier. For Zeppelin 0.9, Interpreter Bindings shows only interpreters in use. To filter out the list of the interpreters, type the target name in the Search field.

    You can use the following actions of the interpreter toolbar:

    Item

    Description

    Refresh

    Updates the list of the interpreters.

    Add an interpreter

    Opens a dialog to add a new interpreter. You can include a new interpreter to an existing group of interpreters and configure its settings.

    Delete the selected interpreter

    Deletes the selected interpreter.

    Restart the interpreter

    Restarts the selected interpreter.

    Manage repositories

    Opens a dialog to add, remove, and modify interpreter repositories.

  3. Preview the settings of the target interpreter.

    • When an interpreter has resolved all dependencies and it is ready for use, its status is shown as Ready.

    • If the selected interpreter is a root of the interpreter group, you should see the interpreters that are included in this group. For example, the spark group consists of %spark, %spark.sql, %spark.pyspark, %spark.ipyspark, %spark.r, %spark.ir, %spark.shiny, %spark.kotlin

    • Select SHARED, SCOPED, or ISOLATED interpreter binding modes. In shared mode, every note using this interpreter shares a single interpreter instance. Scoped and isolated mode can be used under per user or per note dimensions. In scoped per note mode, each note will create a new interpreter instance in the same interpreter process. In isolated per note mode, each note will create a new interpreter process.

    • Select the Set permission checkbox and specify the owner names, if you want to restrict access to the selected interpreter.

    • Select the Connect to existing process checkbox to provide a Host and Port on the target server.

    • You can add interpreter Properties or modify the predefined set of properties and their values. Properties are exported as environment variables on the system if the property name consists of upper-case characters, numbers, or underscores ([A-Z_0-9]). Otherwise, the property is set as a common interpreter property. See more details in the Apache Zeppelin documentation.

      For example, you can add the zeppelin.SparkInterpreter.precode property and put some code into the Value field to execute on interpreter init.

      Add the zeppelin.SparkInterpreter.precode property

      This code is resolved in a note after initialization of the interpreter:

      Resolving the zeppelin.SparkInterpreter.precode property
                         in a note
    • In the Dependencies area add any library you want to use with the selected interpreter. If needed, specify the files that should be excluded.

Click Refresh to update the list of the interpreters. To restart the selected interpreter, click Restart the interpreter.

Manage repositories

  1. To open Repository Settings, click New interpreter on the interpreter toolbar.

    Manage repositories

    You can refresh the list of the repositories (Refresh), add a new repository (New repository), and remove the selected repository (Remove the selected repository).

  2. To add a new repository, click New repository and fill in the repository settings:

    Mandatory parameters:

    • Id: a unique name of the repository

    • Url: address of the repository

    Optionally, you can set up:

    • Name: a username to access the repository

    • Password: a password to access the repository

    • Host: an HTTP or HTTPS server where the repository resides

    • Port: a port of the repository server

    • Name and Password: user credentials to access the repository server

Samples of Hadoop File System configuration files

Type

Sample configuration

HDFS

<?xml version="1.0"?> <configuration> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>

S3

<?xml version="1.0"?> <configuration> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> <property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> <property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> </configuration>

WebHDFS

<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>

WebHDFS and Kerberos

<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> <property> ​ <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> <property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> <property>​ <name>hadoop.security.authorization</name>​ <value>true</value>​ </property> </configuration>
Last modified: 12 August 2022