Pipelines are just series of steps you perform on data in `sklearn`

. (The `sklearn`

guide to them is here.)

A "typical" pipeline in ML projects

- Preprocesses the data to clean and tranform variables
- Possibly selects a subset of variables from among the features to avoid overfitting (see also this)
- Runs a model on those cleaned variables

```
{tip}
You can set up pipelines with `make_pipeline`.
```

```
{margin}
<img src="https://media.giphy.com/media/k5b6fkFnSA3yo/source.gif" alt="Mario" style="width:200px;">
```

For example, here is a simple pipeline:

In [1]:

```
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
ridge_pipe = make_pipeline(SimpleImputer(),Ridge(1.0))
```

You put a series of steps inside `make_pipeline`

, separated by commas.

The pipeline object (printed out below) is a list of steps, where each step has a name (e.g. "simpleimputer" ) and a task associated with that name (e.g. "SimpleImputer()").

In [2]:

```
ridge_pipe
```

Out[2]:

```
{tip}
You can `.fit()` and `.predict()` pipelines like any model, and they can be used in `cross_validate` too!
```

Using it is the same as using any estimator! After I load the data we've been using from the last two pages below (hidden), we can fit and predict like on the "one model intro" page:

In [3]:

```
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_validate
url = 'https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
.assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
)
.iloc[:,-11:]
)
rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)
```

In [4]:

```
ridge_pipe.fit(X_train,y_train)
ridge_pipe.predict(X_test)
```

Out[4]:

Those are the same numbers as before - good!

And we can use this pipeline in cross-validation method just like before:

In [5]:

```
cross_validate(ridge_pipe,X_train,y_train,
cv=KFold(5), scoring='r2')['test_score'].mean()
```

Out[5]:

```
{warning}
(Virtually) All preprocessing should be done in the pipeline!
```

This is the link you should start with to see how you might clean and preprocess data. Key preprocessing steps include

- Filling in missing values (imputation) or dropping those observations
- Standardization
- Encoding categorical data

With real-world data, you'll have many data types. So the preprocessing steps you apply to one column won't necessarily be what the next column needs.

I use ColumnTransformer to assemble my preprocessing portion of my full pipeline, and it allows me to process different variables differently.

**The generic steps to preprocess in a pipeline:**

- Set up a pipeline for numerical data
- Set up a pipeline for categorical variables
- Set up the ColumnTransformer:
`ColumnTransformer()`

is a functions, needs "()"- First argument is a list (so now it is
`ColumnTransformer([])`

) - Each element in that list is a tuple that has three parts:
- name of the step (you decide the name),
- estimator/pipeline to use on that step,
- and which variables to use it on

**Put the pipeline for each variable type as its own tuple inside**`ColumnTransformer([<here!>])`

- Use the
`ColumnTransformer`

set as the first step inside your glorious estimation pipeline.

So, let me put this together:

```
{tip}
This is good pseudo!
```

In [6]:

```
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
#############
# Step 1: how to deal with numerical vars
# pro-tip: you might set up several numeric pipelines, because
# some variables might need very different treatment!
#############
numer_pipe = make_pipeline(SimpleImputer())
# this deals with missing values (somehow?)
# you might also standardize the vars in this numer_pipe
#############
# Step 2: how to deal with categorical vars
#############
cat_pipe = make_pipeline(OneHotEncoder(drop='first'))
# notes on this cat pipe:
# OneHotEncoder is just one way to deal with categorical vars
# drop='first' is necessary if the model is regression
#############
# Step 3: combine the subparts
#############
preproc_pipe = ColumnTransformer(
[ # arg 1 of ColumnTransformer is a list, so this starts the list
# a tuple for the numerical vars: name, pipe, which vars to apply to
("num_impute", numer_pipe, ['l_credscore','TCMR']),
# a tuple for the categorical vars: name, pipe, which vars to apply to
("cat_trans", cat_pipe, ['Property_state'])
]
, remainder = 'drop' # you either drop or passthrough any vars not modified above
)
#############
# Step 4: put the preprocessing into an estimation pipeline
#############
new_ridge_pipe = make_pipeline(preproc_pipe,Ridge(1.0))
```

The data loaded above has no categorical vars, so I'm going to reload the data and keep different variables:

`'TCMR','l_credscore'`

are numerical`'Property_state'`

is categorical`'l_LTV'`

is in the data, but should be dropped (because of`remainder='drop'`

)

So here is the raw data:

In [7]:

```
url = 'https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
.assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
)
[['TCMR', 'Property_state', 'l_credscore', 'l_LTV']]
)
rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)
display(X_train.head())
display(X_train.describe().T.round(2))
```

We could `.fit()`

and `.transform()`

using the `preproc_pipe`

from step 3 (or just `.fit_transform()`

to do it in one command) to see how it transforms the data. But the output is tough to use:

In [8]:

```
preproc_pipe.fit_transform(X_train)
```

Out[8]:

So I added a convenience function (`df_after_transform`

) to the community codebook to show the dataframe after the ColumnTransformer step.

Notice

- The
`l_LTV`

column is gone! - The property state variable is now 50+ variables (one dummy for each state, and a few territories)
- The numerical variables aren't changed (there are no missing variables, so the imputation does nothing)

This is the transformed data:

In [9]:

```
from df_after_transform import df_after_transform
df_after_transform(preproc_pipe,X_train)
```

Out[9]:

In [10]:

```
display(df_after_transform(preproc_pipe,X_train)
.describe().T.round(2)
.iloc[:7,:]) # only show a few variables for space...
```

- Using pipes is the same as any model:
`.fit()`

and`.predict()`

, put into CVs - When modelling, you should spend time interrogating model predictions, plotting and printing. Does the model struggle predicting certain observations? Does it excel at some?
- You'll want to tweak parts of your pipeline. The next pages cover how we can do that.