HDFS

Connect to an HDFS server

In the Big Data Tools window, click and select HDFS.
In the Big Data Tools dialog that opens, specify the connection parameters:
- Name: the name of the connection to distinguish it between the other connections.
- In Configuration source, select one of:
  - Custom: in the Cluster URI box, enter the URI of your HDFS server. If Kerberos is used to control access to your HDFS server, select Kerberos under Authentication.
  - Configuration Folder: a path to the directory with the HDFS configuration files. See the samples of configuration files.
Optionally, you can set up:
- Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.
- Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.
- Hadoop user name: enter a username to log in to the server. If not specified, the HADOOP_USER_NAME environment variable is used. If this variable is not defined, the user.name property is used. If Kerberos is enabled, it overrides any of these three values.
- Enable tunneling (Only NameNode operation). Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network, but an SSH connection to the host in the network is available. SSH tunneling currently works only for operators with the following NameNodes: list files, get meta info.
  Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).
- Under Extended Connection Settings, you can set up:
  - Root path: a path on the target server to be the root for the HDFS connection.
  - Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.
Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

When the connection is successfully established, the Driver home path field shows the target IP address of connection including a port number. Example: hdfs://127.0.0.1:65224/.

Samples of Hadoop File System configuration files

Type	Sample configuration
HDFS	<?xml version="1.0"?> <configuration> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>
S3	<?xml version="1.0"?> <configuration> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> <property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> <property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> </configuration>
WebHDFS	<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>
WebHDFS and Kerberos	<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> <property> <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> <property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property> </configuration>

Last modified: 29 March 2023