Introduction To Machine Learning Pipeline

6 min readJan 24, 2022

Prerequisite: Basic knowledge of what ML is.

In this article, I will be giving an idea about the ML Pipeline.

The Wikipedia definition for ML is Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data.

But what does it really mean? Machine Learning is using Data to make predictions or just using Data in any way to extract knowledge.

Now when we word with Data we use the following steps

Import Data
Clean Data
Split the Data into Training/ Testing sets
Creating a model
Training a model
Make Predictions
Evaluate and Improve

I’ll give a brief intro of these steps right now and go into more detail in the upcoming articles.

Step 1 Importing Data

There are 4 major types in which data will be available for use i.e. CSV , JSON , SQLite , BigQuery.

For beginners, CSV file are preferred as they contain a more organised form of Data. These are comma-separated files where each record consists of Data and its attributes. Pandas ( Python Library) is used to read the dataset’s .csv files.

import pandas as pd

df = pd.read_csv (r'Path where the CSV file is stored\File name.csv')

If you wish to read more check out this article on how to work with JSON in python- https://towardsdatascience.com/working-with-json-in-python-a53c3b88cc0

Step 2 Cleaning the Data

This varies from dataset to dataset. Some simpler datasets might need minimal cleaning like adding NULL values to missing entries but some need might need a lot of work.

For Example- I worked on the COVID Dataset where all I needed to do was drop the unnecessary columns, fill empty entries with NULL and sort the data a little.

But while attempting to work on bigger datasets adding proper indexing, filtering out things like keywords etc, removing duplicate data, fixing syntax errors ( like changing date format so that it can easily be visualized) , filtering out unwanted outliers and also at times fixing typos.

To be successful in this step a good understanding of the data is needed. To gain a better understanding of the data we can plot graphs and charts to see what trends our eyes can find. For this python libraries like Matplotlib , Plotly or seaborn can be used.

Step 3 Split the Data into Training/ Testing sets

The train-test split is a technique for evaluating the performance of a machine learning algorithm. The procedure involves taking a dataset and dividing it into two subsets.

We’ll do this using the Scikit-Learn library to do this.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)

Using the code on the dataset df we imported and cleaned above. Here it gets split into 80/20 ratio ( as shown by the 0.2 value at the end of code). Here X is the independent variables and y the depend. Such a split is commonly seen in regression analysis ( I will be covering this in future articles).

This method isn’t perfect because random splitting can lead to overfitting or underfitting. To avoid this cross-validation methods are used. There are a bunch of cross-validation methods the most used being K-Folds Cross-Validation and Leave One Out Cross-Validation.

from sklearn.model_selection import KFold

from sklearn.model_selection import LeaveOneOut

The Sklearn Library is used for the same and more detail can be found in it’s documentation.

Step 4 Creating a model

Now the data is ready to be played around with using an ML Model.

Machine learning algorithms could be broadly categorised to one of three types:

Supervised learning — is a machine learning task that establishes the mathematical relationship between input X and output Y variables. Such X, Y pair constitutes the labelled data that are used for model building in an effort to learn how to predict the output from the input.
Unsupervised learning — is a machine learning task that makes use of only the input X variables. Such X variables are unlabeled data that the learning algorithm uses in modelling the inherent structure of the data.
Reinforcement learning — is a machine learning task that decides on the next course of action and it does this by learning through trial and error in an effort to maximize the reward.

Our job here is to find which model will be the perfect fit for our data. Like in the case of supervised learning we would need to check if it is a classification problem or a regression problem and then go into depth on choosing which regression/classification algorithm to use.

Once we are ready we move on to the next step.

Step 5 Training a model

In this step, we use our training data ( in case of supervised learning ) and try to get our desired result. We see the result we get and then change the weights and bias to minimize loss (the penalty for bad prediction i.e. an ideal model would have 0 loss). So the goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

This process is called empirical risk minimization.

Step 6 Make Predictions

In regular programming we get an output, in Machine learning, we get a prediction. It is for all intents and purposes the same thing. The word prediction is pretty misleading. In the case of time series analysis where we are attempting to “tell the future” it’s correct but in most other cases it is the desired output. For example, in marketing model for a company shows what would be the best course of action or in a model for banks it shows whether or not a transaction was a scam or not.

Step 7 Evaluate and Improve

Once we get the prediction we see if it is our desired outcome or not using parameters such as accuracy, precision and recall ( in supervised learning).

Accuracy is defined as the percentage of correct predictions for the test data.

Precision is defined as the fraction of relevant examples (true positives) among all of the examples which were predicted to belong in a certain class.

Recall is defined as the fraction of examples that were predicted to belong to a class with respect to all of the examples that truly belong in the class.

Otherwise based on the situation various other parameters are also checked by using statistics and visualizations like confusion matrix, receiver operating characteristic (ROC) curve, cluster distortion, and means squared error (MSE).

Once we get all this information we repeat whichever above step is necessary to make our model closer to the desired outcome.

Note- No Machine learning model will provide 100% accuracy , if it does it might be a case of overfitting.

Below is a schematic representation of working with the Penguins Dataset. Here quantitative inputs like bill length, bill depth, flipper length and body mass are taken along with qualitative inputs like sex and island to predict which species the penguin belongs to. ( Detailed Article)