Practicle-1

  AIM:  Data Preprocessing using scikitLearn python library.

Theory: 

To analyze our data and extract the insights out of it, it is necessary to process the data before we start building up our machine learning model ie we need to convert our data in the form which our model can understand. Since the machines cannot understand data in the form of images, audios, etc.

Data is processed in the form (an efficient format) that can be easily interpreted by the algorithm and produce the required output accurately.

The data we use in the real world is not perfect and it is incomplete, inconsistent (with outliers and noisy values), and in an unstructured form. Preprocessing the raw data helps to organize, scaling, clean (remove outliers), standardize ie simplifying it to feed the data to the machine learning algorithm.

The process of data preprocessing involves a few steps:

  • Data cleaning:  the data we use may have some missing points (like rows or columns which does not contain any values) or have noisy data (irrelevant data that is difficult to interpret by the machine). To solve the above problems we can delete the empty rows and columns or fill them with some other values ​​and we can use methods like regression and clustering for noisy data.
  • Data transformation:  this is the process of transforming the raw data into a format that is ready to suitable for the model. It may include steps like- categorical encoding, scaling, normalization, standardization, etc.
  • Data reduction:  this helps to reduce the size of the data we are working on (for easy analysis) while maintaining the integrity of the original data.


Dataset Description:

For this article, I have used a subset of the Loan Prediction  (missing value observations are dropped).

Now, let's get started by importing important packages and the data set.

Let's take a closer look at our data set.



Normalization:

Normalization is the process where the values ​​are scaled in a range of  -1,1  ie converting the values ​​to a common scale. This ensures that the large values ​​in the data set do not influence the learning process and have a similar impact on the model's learning process. Normalization can be used when we want to quantify the similarity of any pair of samples such as dot-product.


Encoding

Many times the data we use may not have the features values ​​in a continuous form, but instead the forms of categories with text labels. To get this data processed by the machine learning model, it is necessary for converting these categorical features into a machine-understandable form.


All our categorical features are encoded. You can look at your updated data set using  X_train.head() . We are going to take a look at  frequency distribution before and after the encoding.Now that we are done with label encoding, lets now run a logistic regression model on the data set with both categorical and continuous features.
Its working now. But, the accuracy is still the same as we got with logistic regression after standardization from numeric features. This means categorical features we added are not very significant in our objective function.

No comments:

Post a Comment