A repository of exercises to support the training.
Before starting this exercise, make sure you have the following prerequisites set up:
Compute Instances: Development workstations that data scientists can use to work with data and models.
For this exercise, we recommend using a Standard_DS11_v2 compute instance. You can configure this compute target within Azure Machine Learning Studio under the Compute section.
Compute Clusters: Scalable clusters of virtual machines for on-demand processing of experiment code.
In this exercise, you'll dive into the Azure Machine Learning Designer feature to explore and preprocess data for machine learning tasks. Using the Titanic dataset, sourced from here, you'll perform the following tasks:
Azure Machine Learning Designer offers an intuitive visual interface for building, testing, and deploying machine learning models without writing a single line of code. By leveraging its drag-and-drop functionality, data scientists and analysts can streamline the data preprocessing workflow and gain valuable insights into their datasets.
Let's explore techniques to identify and handle missing data within the dataset, ensuring robustness and reliability in subsequent analyses.
If all is well, you'll see the data asset that you just created in the Designer under the Data tab.
Drag the data asset into the Pipeline Interface. From there, you can right click and select preview data as well.
You typically apply data transformations to prepare the data for modeling. In the pane on the left (not the menu where you see Authoring, Assets, Manage), first select Component and then expand the Data Transformation section, which contains a wide range of modules you can use to transform data before model training.
Drag a Select Columns in Dataset module to the canvas, below the Titanic module. Then connect the output at the bottom of the Titanic module to the input at the top of the Select Columns in Dataset module.
Select the Select Columns in Dataset module, and in its Settings pane on the right, select Edit column. Then in the Select columns window, select By name and use the + links to add all columns.
This being a missing data exercise implies they are missing data. I could tell you (or you could just scroll down to find out), however, let's run the pipeline and see the missing data for yourselves.
To run the pipeline, click the Configure & Submit button on the ribbon.
Basics: - Experiment name: Create new - New experiment name: Enter a unique name - Job display name: Unchanged - Job description: Unchanged - Job tags: Unchanged Inputs & Outputs: Unchanged Runtime settings: - Select compute type: Compute cluster - Select Azure ML compute cluster: Previously created - Select datastore: Previously created - Continue on step failure: Selected
At the point, the pipeline will execute. It will take sometime, perhaps, this is a good time for a quick break.
It looks like we don't know the age of 177 passengers, and we don't know if two passengers even embarked.
Cabin information for a whopping 687 persons is also missing.
Apply appropriate techniques such as imputation or deletion to handle missing data.
There a few ways to handle missing data such assigning the missing data as zero, deleting rows with missing data, replacing empty values with the mean or median for that data, and assigning
Some datasets may have missing values that appear as zero. While the Titanic dataset doesn't have this problem. You may consider the values of 0 not as 'missing' values, but instead as actual age values. However, it is important to review your raw data.
Option 1: Delete rows with missing data
For a model that cannot handle missing data, the most prudent thing to do is to remove rows that have information missing.
Let's remove some data from the Embarked
column, which only has two rows with missing data using the Clean Missing Data module.
Option 2: Replace empty values with the mean or median for that data
Sometimes, our model cannot handle missing values, and we also cannot afford to remove too much data. In this case, we can sometimes fill in missing data with an average calculated on the basis of the rest of the dataset. Note that imputing data like this can affect model performance in a negative way. Usually, it's better to simply remove missing data, or to use a model designed to handle missing values.
Here, we impute data for the Age
field. We use the mean Age
from the remaining rows, given that >80% of these have values.
The Age
field has no longer has empty cells.
Option 3: Assign a new category to unknown categorical data
The Cabin
field is a categorical field, because the Titanic cabins have a finite number of possible options. Unfortunately, many records have no cabin listed.
For this exercise, it makes perfect sense to create an Unknown
category, and assign it to the cases where the cabin is unknown.
Run the pipeline as an experiment again.
That's it! No more missing data!
We only lost two records (where Embarked
was empty).
That said, we had to make some approximations to fill the missing gaps for the Age
and Cabin
columns, and those will certainly influence the performance of any model we train on this data.
After completing this exercise, you'll have gained practical experience in using Azure Machine Learning Designer to preprocess and explore datasets for machine learning tasks. Stay tuned for upcoming modules where we'll delve deeper into advanced data analytics and model development techniques.
Happy exploring!
If you're not using the Azure resources created in this lab for other training modules, you can delete them to avoid incurring further charges.
Open the Azure portal at https://portal.azure.com
, and in the top search bar, search for the resources you created in this lab.
On the resource page, select Delete and follow the instructions to delete the resource. Alternatively, you can delete the entire resource group to clean up all resources at the same time.