Table of Contents
Welcome
To Lesson-1 of the Data Exploration with Python series! In this tutorial, we will explore a public dataset that contains Country/Mission-wise OCI details from the Indian Open Government Data Platform. I am using some of these less explored datasets trying to mimic a typical work situation where the given data is from a new domain or has information that you may not have observed before.
We will cover the lesson through below steps, introducing you to some basic commands/functions to analyze the dataset:
- Import required python libraries
- Read the dataset into a dataframe. Since it is a small file, the dataset is already saved to my GitHub repository.
- Analyze the dataset through various commands/functions
You may use either of these options to run through the lesson,
- Click on the launch-binder button below. You will get an interactive live JupyterLab notebook that’s created directly from my GitHub repo. You will then be able to start playing around with the data right away!
Note: It takes about 1-2 minutes for the notebook to be ready. Since it prepares the entire environment on a server. So, hang in there!
2. Set up the whole environment on your laptop. You may do so by creating a virtual environment, installing any python libraries needed, and cloning my GitHub repo from https://github.com/hgante/telestreak.git. My notes from How I setup JupyterLab for Data Exploration in Python would come in handy!
I have also embedded the entire Jupyter notebook in a Github gist below, for your reference.
Summary
In conclusion, we used some basic commands and functions from the pandas library to read a public dataset having OCI details. We also performed some basic data transformations and analysis to build familiarity with this dataset.
Data transformations
In addition to the basic commands, I have demonstrated two basic data transformations:
- renaming a dataframe column
- changing datatype of an object type to date
Functions for data exploration
Here’s a list of functions I have used in the tutorial. I highly encourage you to practice reading through the documentation for any function or command, to learn about it’s functionality and parameters by placing your cursor on the function name and hitting Shift + Tab in your Jupyter notebook.
read_csv | Read the .csv file that is saved in this repository into a pandas dataframe |
shape | Specifies the rows and columns in the dataset |
columns | Lists the column names |
head | display top 5 rows in the dataset |
min | Get the minimum value in the conditions specified |
max | Get the maximum value in the conditions specified |
sum | displays the sum of values in the requested axis |
mean | displays the mean of values in the requested axis |
describe | describes the dataset with respect to its statistical values |
date_range | retrieves the data in a particular range of date |
nunique | unique entries in the dataset for a particular axis |
groupby | group a certain set of values based on a criteria |
sort_values | sort the values in ascending or descending |
To sum up, with these basic commands you should be able to read any kind of csv file. After that, you can gain deeper insights into them.
Most importantly, hoping that you enjoyed this tutorial. We welcome comments and examples on the dataset you’ve just explored!
Next: Coming soon