Homework 0

Assignment:Write a tutorial explaining how to construct an interesting data visualization of the Palmer Penguins data set.

1.Data Import and Cleaning

Reading in Palmer Penguins data set.

import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

Choosing the relevant columns.

cols=["Species",
      "Island",
      "Culmen Length (mm)",
      "Culmen Depth (mm)",
      "Flipper Length (mm)",
      "Body Mass (g)",
      "Sex"]
penguins["Species"] = penguins["Species"].str.split().str.get(0)
penguins = penguins[cols]

Clean the data

#drops all rows with NaN values 
penguins = penguins.dropna()
#drops rows with "." as Sex
penguins = penguins[penguins['Sex']!= "."]
penguins.head()

image4.png

2.Visualization

Seaborn is a Python data visualization library based on matplotlib.

Reference:https://seaborn.pydata.org/examples/index.html

Scatterplot of Culmen Length against Body Mass for Three Penguin Species, Separated by Sex

import seaborn as sns
##create a facet grid called scatterplot
sns.relplot(data=penguins,          # data that needs to be plotted   
            x="Culmen Length (mm)", # column name for x-axis
            y="Body Mass (g)",      # column name for y-axis
            hue="Species",          # column name for color coding
            col = "Sex",
            sizes=(40, 400),
            alpha=.5)

image1.png

Matplotlib is a library for creating interactive visualizations in Python.

Reference:https://matplotlib.org/3.5.0/tutorials/introductory/pyplot.html

Boxplot of Sex against Body Mass

from matplotlib import pyplot as plt
#creat an empty plot with figure size (10,7)
fig, ax = plt.subplots(figsize=(10,7))
#plot the graph with boxplot method
sns.boxplot(data=penguins,     # data that needs to be plotted
            x="Body Mass (g)", # column name for x-axis
            y="Sex",           # column name for y-axis
            hue="Species",     # column name for color coding
            width=.2)

image2.png

Plotly is a graphing library makes interactive graphs.

Reference:https://plotly.com/python-api-reference/index.html

Hitogram of the number of penguins with differnt Culmen Length

from plotly import express as px
fig=px.histogram(penguins, #data that needs to be plotted
                   x="Culmen Length (mm)", # column name for x-axis
                   color="Species", # column name for color coding
                   opacity=0.5,     # set the opacity   
                   nbins=30,        # set the number of bins
                   barmode='stack',
                   width=600,       # the figure width in pixels
                   height=300)      # the figure height in pixels
# reduce whitespace
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()

image3.png

Compare to matplotlib, plotly can offer a more ornate visualization, and plotly is easy to modify and export the plot.

Written on April 2, 2022