DataSpell 2024.3 Help

Work with Data Wrangler

Data Wrangler is a no-code tool that simplifies data cleaning and preparation.

It offers an interactive user interface that allows you to view and analyze the data, displays column statistics and visualizations, and automatically generates Python code.

Open Data Wrangler

  1. Open a Jupyter notebook.

  2. Run code cell to create a pandas dataframe. For example, run cell with the following code:

    import pandas as pd # Data data = { 'Name': ['John', 'Anna', 'Peter', 'Linda', 'Dina', 'Kate', 'Tom', 'Emily'], 'Age': [22, 78, 22, 30, 45, 30, 35, 40], 'Gender': ['Male', 'Female', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female'], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego'], 'Occupation': ['Engineer', 'Doctor', 'Teacher', 'Nurse', 'Architect', 'Lawyer', 'Accountant', 'Scientist'] } # Create a DataFrame df = pd.DataFrame(data) # Display the DataFrame df
  3. In the upper-right corner of the output cell, click Open Data Wrangler.

    Click Open Data Wrangler button
  4. Data Wrangler will open in a new tab:

    Data Wrangler tab opened

Use Data Wrangler transformations

Data Wrangler transformations in the Data Wrangler tab

Transformation

Description

Sort and filter

Filter

Filters rows in a selected column based on a specified condition and value

Clean and remove

Drop column

Removes a selected column from a table

Remove duplicates

Removes all rows that have duplicate values from a selected column

Drop missing values

Removes all rows with missing values from a selected column

Remove rows with NaN values

Removes rows that contain empty values from a table

Drop rows

Removes selected rows from a table

Find and replace

Find and replace

Replaces cells with a specified matching pattern from a selected column

Create additional column

Transform column with string

Transforms strings in a selected column. You can select one of the following transformations:

  • Capitalize first character

  • Convert text to lowercase

  • Convert text to uppercase

One-hot encoding categorical variables

Splits categorical data from a selected column into a new column for each category

Other

Fill missing

Replaces cells with missing values with a new value in a selected column

Round numerical

Rounds numbers in a selected column to the specified number of decimal places:

  • Round: rounds a number to the nearest integer.

    If the fraction of the number is 0.5 or higher, it rounds up. If it's less than 0.5, it rounds down.

  • Floor: rounds a number down to the nearest integer

  • Ceil: rounds a number up to the nearest integer

Split column

Splits a selected column into several columns based on a user defined delimiter

Change a type of column

Changes the data type of selected column

Normalize and scale

Min-Max scaling

Rescales a selected numerical column between a minimum and maximum value

Z-Score normalization

Transforms the data from a selected column into a distribution with a mean of 0 and a standard deviation of 1

Handling outliers or skewed distributions

Outlier detection with IQR

Detects outliers in a selected column using Interquartile Range

Reduce skewness

Reduces skewness by applying logarithmic or square root transformation to the data in a selected colum

Outlier detection with MAD

Detects outliers in a selected column using Median Absolute Deviation

Outlier detection with Euclidean distance

Detects outliers in a selected column using Euclidean Distance

Export Code to Notebook

You can create a new cell in your Jupyter Notebook with all the data transformation code you generated.

  1. Click Export Code to Notebook Export Code to Notebook.

    You can view the history of changes applied to data before you export the code.

    Click Export Code to Notebook button
  2. Your Jupyter notebook will open, and a new cell with generated code will be added to the notebook.

Export Data to File

You can save the transformed dataset as a new file.

  1. Click Export Data to File Export Data to File.

  2. Choose an extractor and configure additional settings.

    Export data
  3. Click the Browse button Browse to choose the location for your file.

  4. Click Export to File to save the data as a file.

Example: remove duplicate entries

One of the common data cleaning tasks is to remove duplicate entries to prevent biased results from your analysis.

You can use Data Wrangler to transform your data through the interface. Data Wrangler will automatically generate the Python code required for the removing of duplicates.

  1. Open Data Wrangler.

  2. Select Remove duplicates from the Transformations drop-down list.

  3. Select the column from the Column drop-down list.

  4. Check the generated code.

    Check code generated by Data Wrangler
  5. Click Apply.

  6. Click Export Code to Notebook if you want to add a new code cell with generated code to your notebook or click Export Data to File to save transformed data as a file.

Last modified: 20 November 2024