DataSpell 2023.2 Help

HDFS

Connect to an HDFS server

  1. In the Big Data Tools window, click Add a connection and select HDFS.

  2. In the Big Data Tools dialog that opens, specify the connection parameters:

    HDFS connection
    • Name: the name of the connection to distinguish it between the other connections.

    • In Configuration source, select one of:

      • Custom: in the Cluster URI box, enter the URI of your HDFS server. If Kerberos is used to control access to your HDFS server, select Kerberos under Authentication.

      • Configuration Folder: a path to the directory with the HDFS configuration files. See the samples of configuration files.

    Optionally, you can set up:

    • Enable connection: deselect if you want to disable this connection. By default, the newly created connections are enabled.

    • Hadoop user name: enter a username to log in to the server. If not specified, the HADOOP_USER_NAME environment variable is used. If this variable is not defined, the user.name property is used. If Kerberos is enabled, it overrides any of these three values.

    • Enable tunneling (Only NameNode operation). This option creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network, but an SSH connection to the host in the network is available. SSH tunneling currently works only for operators with the following NameNodes: list files, get meta info.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Under Extended Connection Settings, you can set up:

      • Root path: a path on the target server to be the root for the HDFS connection.

      • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

  3. Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

When the connection is successfully established, the Driver home path field shows the target IP address of connection including a port number. Example: hdfs://127.0.0.1:65224/.

Samples of Hadoop File System configuration files

Type

Sample configuration

HDFS

<?xml version="1.0"?> <configuration> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>

S3

<?xml version="1.0"?> <configuration> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> <property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> <property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> </configuration>

WebHDFS

<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>

WebHDFS and Kerberos

<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> <property> ​ <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> <property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> <property>​ <name>hadoop.security.authorization</name>​ <value>true</value>​ </property> </configuration>
Last modified: 23 August 2023