CodeNewbie Community 🌱

Cover image for Data Analysis in Python using Jupyter Notebook - Part 2
Neha
Neha

Posted on

Data Analysis in Python using Jupyter Notebook - Part 2

In this tutorial we will expand on our knowledge and learn how to plot multiple lines on the same graph, streamline our code, and create bar graphs.

Preliminary Steps

If you have not worked through Part 1, to ensure you're able to run the code in this tutorial, make sure to obtain the population data set and run the following based on where you have saved the file:

import matplotlib.pyplot as plt
import pandas as pd
popData = pd.read_csv('../Downloads/population_csv.csv')
Enter fullscreen mode Exit fullscreen mode

Explore the data

To further understand our data set and know the unique countries/regions in the data, we'll use the following line of code. This will give us a table with only one country/region per row so we can see all the places listed.

popData.drop_duplicates(subset = "Country Name")
Enter fullscreen mode Exit fullscreen mode

We can see there are 263 unique parts of the world listed after running the above code.

Plotting Multiple Lines on the Same Plot

Now we'll try plotting multiple countries on the same graph to see how their populations grow relative to each other. We can filter the countries we want to plot using the 'Country Name' field.

pop = popData.loc[(popData['Country Name'] == 'Zimbabwe') | 
(popData['Country Name'] == 'Vietnam') | 
(popData['Country Name'] == 'Spain')]
Enter fullscreen mode Exit fullscreen mode

I initially struggled to figure out how to put these three countries on the same plot. After some research, I learned that there is one line of code using the Seaborn package that accomplishes this. Next, we’ll add some labels to the graph to make it clear.

import seaborn as sns
sns.lineplot(data = pop, x = 'Year', y = 'Value', hue = 'Country Name')
ylab = 'Population'
title = 'Population'
plt.ylabel(ylab)
plt.title(title)
Enter fullscreen mode Exit fullscreen mode

While we can see that all this code works well, there is a more streamlined way we can obtain the data for specific countries. This can be done by changing the indices of the rows of the table. With the below line, now the country code is the index.

popData.index = popData['Country Code']
Enter fullscreen mode Exit fullscreen mode

As you can see, the numerical index for each row has now been replaced with the Country Code.

popData.head
Enter fullscreen mode Exit fullscreen mode

Now using .loc with the Country Code, the data for a specific country or region can be obtained.

popData.loc['ZWE']
Enter fullscreen mode Exit fullscreen mode

To avoid the redundancy of the country code getting listed twice, the following code removes the display of the index.

zwe = popData.loc['ZWE']
print(zwe.to_string(index = False))
Enter fullscreen mode Exit fullscreen mode

Now we can continue to plot the data for Zimbabwe just as we did in Part 1 of this series.

zwe.plot('Year', 'Value')
ylab = 'Population'
title = 'Zimbabwe Population'
plt.ylabel(ylab)
plt.title(title)
Enter fullscreen mode Exit fullscreen mode

Writing zwe = popData.loc['ZWE'] is much more steamlined than zwe = popData.loc[popData['Country Name'] == 'Zimbabwe']. Before the code had to check whether the field β€˜Country Name’ was equal to Zimbabwe. With the current code, it only needs to check whether the index value is β€˜ZWE’. Now we can use this same idea to rewrite our code for plotting Spain, Zimbabwe, and Vietman on the same graph!

pop = popData.loc[['ZWE', 'VNM', 'ESP']]
sns.lineplot(data = pop, x = 'Year', y = 'Value', hue = 'Country Name')
ylab = 'Population'
title = 'Population'
plt.ylabel(ylab)
plt.title(title)
Enter fullscreen mode Exit fullscreen mode

As we can see, we’re able to get a line plot with all three countries just like we did before.
Line plot of Zimbabwe, Vietnam, and Spain population over time in years

Filtering by Population Value

Now we are going to display a bar graph with data that satisfies the conditions below.

highPop = popData.loc[(popData.Value > 1000000000) & 
(popData.Year == 1985)]
Enter fullscreen mode Exit fullscreen mode

For plotting a bar graph, we can use the β€˜kind’ field to denote a bar graph.

highPop.plot('Country Name', 'Value', kind = "bar")
ylab = 'Population'
title = 'Population > 1B'
plt.ylabel(ylab)
plt.title(title)
Enter fullscreen mode Exit fullscreen mode

We get a bar graph with all the countries/regions that satisfy our criteria.
Bar graph denoting countries and regions with a population of over 1 billioin

To filter the years where Zimbabwe has a population is over 10000000, we can do the following:

zwePop = popData.loc[(popData['Country Name'] == 'Zimbabwe') &
(popData.Value > 10000000)] 
zwePop.plot('Year', 'Value', kind = "bar")
ylab = 'Population'
title = 'Zimbabwe Population'
plt.ylabel(ylab)
plt.title(title)
Enter fullscreen mode Exit fullscreen mode

Bar graph of population in Zimbabwe each year
The expected code execution is located here for reference.

Up Next

Now you have learned a little more on plots. We'll continue building upon what we've learned here in the next installment!

Top comments (1)

Collapse
 
tomdanny profile image
Tom Danny

Printed tote bags at a data science workshop was an informative session title: "Data Analysis in Python using Jupyter Notebook - Part 2." This headline intrigued attendees, igniting discussions about advanced techniques and methodologies for analyzing data with Python. Participants shared experiences, tips, and best practices, exploring topics such as data visualization, statistical analysis, and machine learning integration. The printed tote bags served as practical reminders of the valuable knowledge gained during the workshop.