add csv example with egg volume

brunj7 · brunj7 · commit 2405b2775ff6 · 2024-03-04T17:09:38.000-08:00
diff --git a/hands-on.qmd b/hands-on.qmd
@@ -4,43 +4,41 @@ execute:
   warning: false
 ---
 
-
 Loading the necessary packages. DuckDB has its own R package that is mostly a wrapper around dbplyr and DBI
 
 ```{r}
+#| message: false  
+  
 library(tidyverse)
 library(dbplyr)       # to query databases in a tidyverse style manner
 library(DBI)          # to connect to databases
 # install.packages("duckdb")  # install this package to get duckDB API
 library(duckdb)       # Specific to duckDB
 ```
 
-
 ## The dataset
 
-ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK  <https://doi.org/10.18739/A2222R68W>{target="_blank"}
+ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK [https://doi.org/10.18739/A2222R68W](https://doi.org/10.18739/A2222R68W){target="_blank"}
 
 Data set hosted by the NSF Arctic Data Center (<https://arcticdata.io>)
 
-Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia.  
-
-Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files.  All research was conducted under permits from relevant federal, state and university authorities. 
-
-See `	01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set.
+Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia.
 
+Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files. All research was conducted under permits from relevant federal, state and university authorities.
 
+See `01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set.
 
 ## Analyzing the bird dataset using csv files (raw data)
 
-
-Let us import the csv files with the species information:
+Let us import the csv files with the bird species information:
 
 ```{r}
 # Import the species
 species_csv <- read_csv("data/species.csv")
 
 glimpse(species_csv)
 ```
+
 Let's explore what is in the `Relevance` attribute/column:
 
 ```{r}
@@ -49,40 +47,90 @@ species_csv %>%
   count()
 ```
 
-We are interested in the `Study species` because according to the metadata they are the species that are 
-included in the data sets for banding, resighting, and/or nest monitoring. Let us extract the species and sort them in alphabetical order: 
+We are interested in the `Study species` because according to the metadata they are the species that are included in the data sets for banding, resighting, and/or nest monitoring. Let us extract the species and sort them in alphabetical order:
 
 ```{r}
 # list of the bird species included in the study
-species_csv %>%
+species_study <- species_csv %>%
   filter(Relevance=="Study species") %>%
-  select(Scientific_name) %>%
+  select(Scientific_name, Code) %>%
   arrange(Scientific_name)
+
+species_study
 ```
 
-Now we can load more information about the sites, nests, and eggs monitoring 
+We would like to know what is the average egg size for each of those bird species. How would we do that?
 
-```{r}
-sites_csv <- read_csv("data/site.csv")
+We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table.
+
+An egg is in a nest, and a nest is associated with a species
 
+```{r}
+# information about the nests
 nests_csv <- read_csv("data/ASDN_Bird_nests.csv")
 
+# information about the 
 eggs_csv <- read_csv("data/ASDN_Bird_eggs.csv")
 ```
 
+How do we join those tables?
 
-## Let's connect to our first database
+```{r}
+glimpse(eggs_csv)
+```
+
+`Nest_Id` seems promising as a foreign key!!
+
+```{r}
+glimpse(nests_csv)
+```
+
+`Species` is probably the field we will use to join nest to the species
+
+OK let's do it:
+
+First compute the average of the volume of an egg. we can use the following formula:
+
+$\frac{\Pi}6W^2L$
+
+```{r}
+eggs_area_df <- eggs_csv %>%
+  mutate(egg_volume = pi/6*Width^2*Length) # %>%
+  # group_by(Nest_ID) %>%
+  # summarise(eggs_area_avg = mean(egg_area, na.rm = TRUE))
+```
 
+Now let's join this information to the nest table, and average by species
+
+```{r}
+species_egg_volume_avg <- left_join(nests_csv, eggs_area_df, by="Nest_ID") %>%
+  group_by(Species) %>%
+  summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
+  arrange(desc(egg_volume_avg)) %>%
+  drop_na()
+
+species_egg_volume_avg
+```
+Ideally we would like the scientific names...
+
+```{r}
+species_egg_area_avg <- species_study %>%
+  inner_join(species_egg_volume_avg, by = join_by(Code == Species)) 
+
+species_egg_area_avg
+```
+
+
+## Let's connect to our first database
 
 ### Load the bird database
 
-This database has been built from the csv files we just manipulated, so the data should be very similar  - note we did not say identical more on this in the last section:
+This database has been built from the csv files we just manipulated, so the data should be very similar - note we did not say identical more on this in the last section:
 
 ```{r}
 conn <- dbConnect(duckdb::duckdb(), dbdir = "./data/bird_database.duckdb", read_only = FALSE)
 ```
 
-
 ### Let's try to reproduce the analaysis we just did
 
 ```{r}
@@ -102,7 +150,7 @@ Note that those are not dataframes but tables. What `dbplyr` is actually doing b
 
 #### How can I get a "real dataframe?"
 
-you add `collect()` to your query. 
+you add `collect()` to your query.
 
 ```{r}
 species %>%
@@ -113,10 +161,8 @@ species %>%
   collect()
 ```
 
-
 Note it means the full query is going to be ran and save in you memory. This might slow things down so you generally want to collect on the smallest data frame you can
 
-
 #### How can you see the SQL query equivalent to the tidyverse code?
 
 ```{r}
@@ -128,6 +174,7 @@ species %>%
   head(3) %>% 
   show_query()
 ```
+
 This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with `dbplyr` you can not do everything that SQL can do. So at some point you might want to start using SQL directly.
 
 Here is how you could run the query using the SQL code directly
@@ -149,13 +196,13 @@ species %>%
 
 Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:
 
-
 ```{r}
 species %>%
   group_by(Relevance) %>%
   summarize(num_species = n()) %>%
   show_query()
 ```
+
 ```{r}
 species %>%
   mutate(Code = paste("X", Code)) %>%
@@ -171,8 +218,6 @@ species %>%
 
 Limitation: no way to add or update data, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions.
 
-
-
 ### Disconnecting from the database
 
 Before we close our session, it is good practice to disconnect from the database first
@@ -181,11 +226,6 @@ Before we close our session, it is good practice to disconnect from the database
 DBI::dbDisconnect(conn, shutdown = TRUE)
 ```
 
-
 ## How did we create this database
 
 You might be wondering, how we created this database from our csv files. Most databases have some function to help you import csv files into databases. Note that since there is not data modeling (does not have to be normalized or tidy) constraints nor data type constraints a lot things can go wrong. This is a great opportunity to implement a QA/QC on your data and help you to keep clean and tidy moving forward as new data are collected.
-
-
-
-