Simplifying Machine Learning model development with ColumnTransformer and Pipeline

This article explains the usage of ColumnTransformer and Pipeline classes of Scikit-Learn to simplify the process of developing and deploying a machine learning model.

KSV Muralidhar

Published in

Towards Data Science

6 min readFeb 7, 2021

Introduction to ColumnTransformer

ColumnTransformer enables us to transform a specified set of columns. It helps us to apply multiple transforms to multiple columns with a single fit() or fit_transform() statement. For example, we can mean impute the first column and one hot encode the second column of a data frame with a single fit() or fit_transform() statement. ColumnTransformer class can be imported from ‘sklearn.compose’ module as shown below.

Let’s dive deep into ColumnTransformer by looking at an example. Consider the data frame below where we have to one hot encode ‘col1’ and ordinal encode ‘col2’.

Conventionally, we create an instance of OneHotEncoder class and fit it to the data frame and transform it, as shown below.

The above step has one hot encoded both the columns of the data frame, which is not what we wanted to do. This problem can be solved by using ColumnTransformer. Let’s create a data frame and transform its columns using ColumnTransformer.

As seen above, we could perform multiple transforms on multiple columns with a single fit_transform() statement. This operation would have been complex without ColumnTransformer. In the above example, we created an instance of ColumnTransformer class and passed an argument named ‘transformers’ which is a list/tuple of transforms we want to perform.

Each transform in a list/tuple of transforms has three parts:

Name of the transform, which can be any name, but each transform in the list of transforms must have a unique name.
The transform itself.
A list of columns to which you want to transform. It should be a list even if you have to apply it to a single column.

Let’s work with another example. Consider the data frame below where we have to one hot encode ‘col1’ and keep the ‘col2’ as it is.

Where is ‘col2’ in the output?

A ColumnTransformer returns only the columns which it has transformed. In the previous example, we have transformed both the columns, hence, both of them were returned. But in the current example, we’ve transformed only a single column (‘col1’) so only the transformed ‘col1’ is returned. To avoid this, we need to pass an argument ‘remainder=passthrough’, as shown below, the default value of this argument is ‘drop’ which drops the other columns.

Limitations of ColumnTransformer

There are a few limitations of ColumnTransformer which are discussed below.

ColumnTransformer outputs an array even if we input a DataFrame object which makes it difficult to keep a track of the columns.
In a ColumnTransformer, we cannot apply multiple transforms to a single column as shown below.

Consider the data frame below, where we have to mode impute and one hot encode ‘col1’ and median impute ‘col2’. Here we are trying to apply multiple transforms to a single column (‘col1’).

We got an error saying ‘Input contains NaN’. Why is ‘NaN’ present even after imputing the ‘col1’ in the first step of the above ColumnTransformer?

It is because the ColumnTransformer takes the columns directly from the input data frame/array in each step. Output of one step is not an input to the next step. So, the second step of one hot encoding took the ‘col1’ from the input data frame, not from step 1. Let’s look at another example. In the data frame below, we’ll try to ordinal encode and min-max scale ‘col1’ and median impute ‘col2’.

Again there is an error saying ‘could not convert string to float: ‘a’’ even after adding an ordinal encoder in step1. To tackle this problem, let’s know something about Pipeline.

Introduction to Pipeline

Pipeline is a sequence of operations, where output of one step becomes the input to its subsequent step. Pipeline class can be imported from sklearn.pipeline module as shown below.

In the example below, we’ll first median impute the columns of the data frame and then min-max scale them using Pipeline.

As seen above, we could perform multiple transforms with a single fit() or fit_transform() statement. Unlike ColumnTransformer, Pipelines follow a sequential process, where output of a previous step becomes an input to the next step. Similar to ColumnTransformer, Pipeline takes an input of steps, which can be a classifier, transform, etc. Each step in Pipeline has two parts:

Name of step.
The operation itself.

Without using a Pipeline, the above operation would have been performed using the conventional process as shown below.

The conventional process took two fit_transform() statements to transform the columns, whereas, Pipeline took it down to one.

Limitations of Pipeline

Similar to ColumnTransformer, Pipeline also outputs an array.
As in ColumnTransformer, we cannot specify a column which we want to transform.

Using ColumnTransformer in conjunction with Pipeline

As seen in one of the examples previously, we couldn’t apply multiple transforms to a single column using ColumnTransformer. We got an error while trying it. Let’s discuss the example once again. Consider a data frame below, where we have to mode impute and one hot encode ‘col1’ and median impute ‘col2’. Here we are trying to apply two transforms to a single column (‘col1’).

What happened above?

As discussed above, in the pipeline (‘col1_pipe’), the output of the first step (‘mode_col1’) became an input to the second step (‘one_hot_encode’). Then we passed the pipeline as an input to ColumnTransformer (‘col_transform’), where these sequence of steps are applied to ‘col1’ and median transform is applied to ‘col2’.

Thus, using ColumnTransformer in conjunction with Pipeline simplifies both the model development and deployment process and also reduces the size of the code.

Know more about my work at https://ksvmuralidhar.in/