Homework 0
Assignment:Write a tutorial explaining how to construct an interesting data visualization of the Palmer Penguins data set.
1.Data Import and Cleaning
Reading in Palmer Penguins data set.
import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
Choosing the relevant columns.
cols=["Species",
"Island",
"Culmen Length (mm)",
"Culmen Depth (mm)",
"Flipper Length (mm)",
"Body Mass (g)",
"Sex"]
penguins["Species"] = penguins["Species"].str.split().str.get(0)
penguins = penguins[cols]
Clean the data
#drops all rows with NaN values
penguins = penguins.dropna()
#drops rows with "." as Sex
penguins = penguins[penguins['Sex']!= "."]
penguins.head()
2.Visualization
Seaborn is a Python data visualization library based on matplotlib.
Reference:https://seaborn.pydata.org/examples/index.html
Scatterplot of Culmen Length against Body Mass for Three Penguin Species, Separated by Sex
import seaborn as sns
##create a facet grid called scatterplot
sns.relplot(data=penguins, # data that needs to be plotted
x="Culmen Length (mm)", # column name for x-axis
y="Body Mass (g)", # column name for y-axis
hue="Species", # column name for color coding
col = "Sex",
sizes=(40, 400),
alpha=.5)
Matplotlib is a library for creating interactive visualizations in Python.
Reference:https://matplotlib.org/3.5.0/tutorials/introductory/pyplot.html
Boxplot of Sex against Body Mass
from matplotlib import pyplot as plt
#creat an empty plot with figure size (10,7)
fig, ax = plt.subplots(figsize=(10,7))
#plot the graph with boxplot method
sns.boxplot(data=penguins, # data that needs to be plotted
x="Body Mass (g)", # column name for x-axis
y="Sex", # column name for y-axis
hue="Species", # column name for color coding
width=.2)
Plotly is a graphing library makes interactive graphs.
Reference:https://plotly.com/python-api-reference/index.html
Hitogram of the number of penguins with differnt Culmen Length
from plotly import express as px
fig=px.histogram(penguins, #data that needs to be plotted
x="Culmen Length (mm)", # column name for x-axis
color="Species", # column name for color coding
opacity=0.5, # set the opacity
nbins=30, # set the number of bins
barmode='stack',
width=600, # the figure width in pixels
height=300) # the figure height in pixels
# reduce whitespace
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()
Compare to matplotlib, plotly can offer a more ornate visualization, and plotly is easy to modify and export the plot.