Fictionally Irrelevant.

Know any Dataset in 4 Lines of Python

Cover Image for Know any Dataset in 4 Lines of Python
Harshit Singhai
Harshit Singhai

Data has always been used to empower smarter decision-making process. When solving any machine learning problem the first thing a data scientist does is Exploratory Data Analysis (EDA). This is the first step towards solving any machine learning regression or classification problem.

EDA for me is the most monotonous yet crucial task in the machine learning pipeline. It's important to get the gist of the dataset at hand, visualize features, and pick the best machine learning model for the job in hand.

At a high level, EDA is the practice of describing the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis.This involves looking at your data set from many angles, describing it, and summarizing it without making any assumptions about its contents.

Pandas-profiling

https://pypi.org/project/pandas-profiling/

pip install pandas-profiling

For the dataset, I’ve used The Complete Pokemon Dataset from Kaggle. Download the zip file and extract the pokemon.csv file.

Import the libraries.

import pandas as pd
import pandas_profiling

df = pd.read_csv('pokemon.csv')
pandas_profiling.ProfileReport(df).to_file("pokemon_summary.html")

The following code will generate pokemon_summary.html file in the same location. The generated file will contain a complete overview of the EDA of the dataset.

That’s it for today. See you soon.