CodeNewbie Community 🌱

Cover image for Data Analysis in Python using Jupyter Notebook - Part 1
Neha
Neha

Posted on • Edited on

Data Analysis in Python using Jupyter Notebook - Part 1

In this tutorial we will be using Python to conduct basic data analysis on a World Population data set.

Preliminary Work

We will use Jupyter Notebook which allows you to run code as you go. Jupyter Notebook can be installed by typing the following in your terminal:

pip install notebook
Enter fullscreen mode Exit fullscreen mode

Alternatively, Jupyter Notebook can be obtained by downloading Anaconda which provides many preinstalled libraries.

Next, we will obtain the population data set from datahub.io and save it on our machine. For this tutorial, download the csv file of the data set. Be sure to take note of where the data set is saved.

Starting the Notebook

Open the Jupyter Notebook application using:

jupyter notebook
Enter fullscreen mode Exit fullscreen mode

The application opens in your web browser. First, navigate to the file path where you want to store your work. Then on the top right of your screen click 'New' then under the Notebook heading click 'Python 3'.

Drop down menu to create a new Jupyter Notebook

At the top of the screen, we will see that our notebook by default is named 'Untitled'. We can directly click that name to change it to a more relevant title such as 'Population Data Analysis'.

To start off, we want to add a heading to describe what we will be doing in the Notebook. We do this by selecting the drop down and changing the cell type to ‘Heading’.
Cell type drop down menu in Jupyter Notebook

The number of # signs before the texts denotes the heading level. So, for instance, one # sign denotes heading level 1, ## denotes heading level 2 and so on.

# World Population Data Analysis
Enter fullscreen mode Exit fullscreen mode

Next we will add a description for our notebook. We can do this by selecting the drop down and changing the cell type to ‘Markdown’. As we can see, if the text has no # preceding it, the text will display as standard markdown.

In this tutorial we will using Python to conduct basic data analysis
on a World Population data set. 
Enter fullscreen mode Exit fullscreen mode

We can use a level 2 subheading to label a subsection of our Notebook.

## Preliminary Steps
Enter fullscreen mode Exit fullscreen mode

Heading levels in Jupyter Notebook as demonstrated by # signs leading each line of text

For the next cell we want to add a brief description of what we’re doing next using 'Markdown'. We want to note that we'll import the necessary libraries.

Import libraries needed for data analysis
Enter fullscreen mode Exit fullscreen mode

Now that that all our headings and descriptions are set up, we can run the cells and see all the text displayed cleanly.

Jupyter Notebook headings and markdown text

In the next section of this tutorial, we will begin writing code.

Examining the Data Set using Python

Now we'll import the necessary libraries. Matplotlib and pandas are common libraries used for data analysis. They will allow us to create plots and extract useful information from our data. As we import these libraries, we give them shorthand names such as 'plt' and 'pd' so when we need to call functions from them, we won't need to type out a long name each time. This can be done by using 'as' after importing. After adding a new cell, the type should be changed to ‘Code’.

import matplotlib.pyplot as plt
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Make sure to run each cell as you go so the code can continue to execute successfully.

Next, we’ll read in the csv file and display some rows. Make sure the file path is the location where population_csv.csv is saved on your machine. Continue to add comments using markdown as you go at your own discretion. I'm using the name 'popData' to label the data set.

popData = pd.read_csv('../Downloads/population_csv.csv')
popData.head
Enter fullscreen mode Exit fullscreen mode

Then we will run some code to get a basic idea of the data set we’ve obtained. The following line returns the dimensions of the data set.

popData.shape
Enter fullscreen mode Exit fullscreen mode

The line below returns the type of all the information in the data set.

popData.info()
Enter fullscreen mode Exit fullscreen mode

The next line provides basic statistics.

popData.describe()
Enter fullscreen mode Exit fullscreen mode

If we want to select only certain rows for display, we can do that with the following. In this case we are displaying the 'Country Code' and population 'Value'.

popData[['Country Code', 'Value']]
Enter fullscreen mode Exit fullscreen mode

Creating a Line Plot

To plot the data for a specific country we'll need to create a table with data only specific to that country. A table is also known as a DataFrame in pandas. The data set lists country after country and we want to have only the data for Zimbabwe. We can filter the country name for Zimbabwe with this line of code.

zwe = popData.loc[popData['Country Name'] == 'Zimbabwe']
Enter fullscreen mode Exit fullscreen mode

To display the DataFrame created, simply type:

zwe
Enter fullscreen mode Exit fullscreen mode

Now we will use a line graph to plot the population of Zimbabwe. We will be plotting the population Value over Time. Here's how we can do this:

zwe.plot('Year', 'Value')
Enter fullscreen mode Exit fullscreen mode

The graph looks great, but it's not very descriptive. We can add labels to the graph to make sure our data is clear for anyone who looks at it.

zwe.plot('Year', 'Value')
ylab = 'Population'
title = 'Zimbabwe Population'
plt.ylabel(ylab)
plt.title(title)
Enter fullscreen mode Exit fullscreen mode

Line plot of Zimbabwe population over time in years

The expected code execution is located here for reference.

Up Next

Now you know how to create a Jupyter Notebook, read in a csv file, and perform basic data analysis. In the upcoming installments, you'll learn how write more streamlined code, more ways to filter data, and plot multiple lines on the same graph!

Top comments (0)