Neha

Posted on Mar 26, 2022

Data Analysis in Python using Jupyter Notebook - Part 3

#python #tutorial #cnc2022 #learninpublic

In this tutorial we will learn how to use scatter plots and customize our graphs. We will also be making use of a new data set. In the last two articles, we looked at population over time. The population data set only included population as a measurable value and for scatter plots, at least two varying values are needed.

Preliminary Steps

First we’ll need to import the needed packages necessary for data analysis. Then we will obtain the pharmaceutical drug spending data set from datahub.io and save it on our machine. Make sure to take note of where the csv file is saved.

import matplotlib.pyplot as plt
import pandas as pd

Now we’ll read in the csv file. Denoting the index column as 0 (the first column) will make it easier to obtain data for specific countries.

spendData = pd.read_csv('../Downloads/pharm_data_csv.csv',
index_col = 0)
spendData.head

To get a quick look at the statistical distributions of the data, we’ll do the following:

spendData.describe()

Now we’re all set to conduct data analysis!

Australia Pharmaceutical Spending

Using .loc, we can the obtain data specifically related to Australia’s spending.

aus = spendData.loc['AUS']
aus

We'll use a scatter plot to show the correlation between % of Health Spending with Spending in US GDP per capita for Australia. These columns are denoted as PC_HEALTHCP AND USD_CAP respectively.

aus.plot(kind = 'scatter', x = 'USD_CAP', y = 'PC_HEALTHXP')
xlab = 'Spending in US GDP per capita'
ylab = '% of Health spending'
title = 'Australia Pharmaceutical Spending'
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)

This is one way of creating a scatter plot which involves using DataFrame.plot and passing in kind = 'scatter'. We can include a little more information by making the Total Spending denoted by the size of the data points on the graph.

aus.plot(kind = 'scatter', x = 'USD_CAP', y = 'PC_HEALTHXP', 
s = aus['TOTAL_SPEND']/100)
xlab = 'Spending in US GDP per capita'
ylab = '% of Health spending'
title = 'Australia Pharmaceutical Spending'
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)

Plotting the % of Heath spending against the year, we see a similar trend. The year is represented by TIME in the DataFrame.

aus.plot(kind = 'scatter', x = 'TIME', y = 'PC_HEALTHXP', 
s = aus['TOTAL_SPEND']/100)
xlab = 'Year'
ylab = '% of Health spending'
title = 'Australia Pharmaceutical Spending'
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)

Pharmaceutical Spending in 2005

Now we’ll compare the spending of all countries in 2005. To obtain all the data, we’ll do the following:

spend2005 = spendData.loc[(spendData.TIME == 2005)]
spend2005

Similar to what we did with Australia, we’ll look at the correlation between % of Health Spending with Spending in US GDP per capita for all countries in 2005. The size of each point corresponds to Total Spending. This time we'll be using DataFrame.plot.scatter. Using this syntax to create the scatter plot will allow us to add a colormap later on.

spend2005.plot.scatter(x = 'USD_CAP', y = 'PC_HEALTHXP',
s = spend2005['TOTAL_SPEND']/100)
xlab = 'Spending in US GDP per capita'
ylab = '% of Health spending'
title = 'Pharmaceutical Spending in 2005'
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)

The graph shows much more variation and is no longer linear. To go a step further, we can add another piece of data from the table, the percentage of GDP denoted as PC_GDP.

spend2005.plot.scatter('USD_CAP', 'PC_HEALTHXP', 
s = spend2005['TOTAL_SPEND']/100, c = 'PC_GDP', 
                       colormap = 'viridis')
xlab = 'Spending in US GDP per capita'
ylab = '% of Health spending'
title = 'Pharmaceutical Spending in 2005'
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)

This graph adds a colormap to denote the percentage of GDP. We get a good view of how all these values appear compared against each other.

The expected code execution is located here for reference.

Closing

In this tutorial we learned to create scatter plots, closing out the series. We’ve conducted basic data analysis on countries around the world and looked at both the population trends and pharmaceutical drug spending. Congrats on making it to the end of the series, you've learned a lot!

Top comments (1)

Stacie Skipper • Apr 4

The article provides a very clear practical guide, suitable for beginners to learn data analysis with Python and Sandtrix. Using Jupyter Notebook along with popular libraries such as pandas and matplotlib makes it easy for learners to manipulate and visualize the results

CodeNewbie Community 🌱

Data Analysis in Python using Jupyter Notebook - Part 3

Preliminary Steps

Australia Pharmaceutical Spending

Pharmaceutical Spending in 2005

Closing

Top comments (1)

Read next

SafeLine vs. F5 WAF: Lightweight Speed vs. Enterprise Muscle

Agentic AI vs Generative AI in Code Refactoring Guide

Which Nginx WAF Is Right for You: SafeLine or NAXSI?

Zero-Day in eSafeNet Document Security Platform (CVE Unconfirmed)