DS Practical: Practical-2

Feature Selection/Elimination using Scikit-learn in Python

What is feature selection?

One of the most significant steps in machine learning is feature selection. It is the method of narrowing down a subset of features to be used without sacrificing the true knowledge in predictive modelling

Why feature selection?

1. It improves model performance: these features act as a noise when you have irrelevant features in your details, which makes the models of machine learning perform poorly.

2. It leads to faster machine learning models.

3. It prevents overfitting: We would be able to perfectly match our training data if we have more columns in the data than the number of rows, but it will not generalize to the new samples. And thus we learn absolutely nothing.

4. Removing Garbage: We'll have several non-informative features much of the time. Name or ID variables, for instance. Poor-quality input will generate poor-quality output.

Advantages:

The main benefit of feature selection is that it reduces overfitting. By removing extraneous data, it allows the model to focus only on the important features of the data, and not get hung up on features that don’t matter.
Another benefit of removing irrelevant information is that it improves the accuracy of the model’s predictions. It also reduces the computation time involved to get the model.
Finally, having a smaller number of features makes your model more interpretable and easy to understand. Overall, feature selection is key to being able to predict values with any amount of accuracy.

Disadvantages:

A feature that is not useful by itself can be very useful when combined with others.
Filter methods can miss it! o Example1: “data mining” can be very predictive in document classification, while each individual term is not!