Skip to content

Commit 2405b27

Browse files
committed
add csv example with egg volume
1 parent ebdf50a commit 2405b27

File tree

1 file changed

+71
-31
lines changed

1 file changed

+71
-31
lines changed

hands-on.qmd

Lines changed: 71 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -4,43 +4,41 @@ execute:
44
warning: false
55
---
66

7-
87
Loading the necessary packages. DuckDB has its own R package that is mostly a wrapper around dbplyr and DBI
98

109
```{r}
10+
#| message: false
11+
1112
library(tidyverse)
1213
library(dbplyr) # to query databases in a tidyverse style manner
1314
library(DBI) # to connect to databases
1415
# install.packages("duckdb") # install this package to get duckDB API
1516
library(duckdb) # Specific to duckDB
1617
```
1718

18-
1919
## The dataset
2020

21-
ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK <https://doi.org/10.18739/A2222R68W>{target="_blank"}
21+
ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK [https://doi.org/10.18739/A2222R68W](https://doi.org/10.18739/A2222R68W){target="_blank"}
2222

2323
Data set hosted by the NSF Arctic Data Center (<https://arcticdata.io>)
2424

25-
Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia.
26-
27-
Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files. All research was conducted under permits from relevant federal, state and university authorities.
28-
29-
See ` 01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set.
25+
Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia.
3026

27+
Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files. All research was conducted under permits from relevant federal, state and university authorities.
3128

29+
See `01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set.
3230

3331
## Analyzing the bird dataset using csv files (raw data)
3432

35-
36-
Let us import the csv files with the species information:
33+
Let us import the csv files with the bird species information:
3734

3835
```{r}
3936
# Import the species
4037
species_csv <- read_csv("data/species.csv")
4138
4239
glimpse(species_csv)
4340
```
41+
4442
Let's explore what is in the `Relevance` attribute/column:
4543

4644
```{r}
@@ -49,40 +47,90 @@ species_csv %>%
4947
count()
5048
```
5149

52-
We are interested in the `Study species` because according to the metadata they are the species that are
53-
included in the data sets for banding, resighting, and/or nest monitoring. Let us extract the species and sort them in alphabetical order:
50+
We are interested in the `Study species` because according to the metadata they are the species that are included in the data sets for banding, resighting, and/or nest monitoring. Let us extract the species and sort them in alphabetical order:
5451

5552
```{r}
5653
# list of the bird species included in the study
57-
species_csv %>%
54+
species_study <- species_csv %>%
5855
filter(Relevance=="Study species") %>%
59-
select(Scientific_name) %>%
56+
select(Scientific_name, Code) %>%
6057
arrange(Scientific_name)
58+
59+
species_study
6160
```
6261

63-
Now we can load more information about the sites, nests, and eggs monitoring
62+
We would like to know what is the average egg size for each of those bird species. How would we do that?
6463

65-
```{r}
66-
sites_csv <- read_csv("data/site.csv")
64+
We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table.
65+
66+
An egg is in a nest, and a nest is associated with a species
6767

68+
```{r}
69+
# information about the nests
6870
nests_csv <- read_csv("data/ASDN_Bird_nests.csv")
6971
72+
# information about the
7073
eggs_csv <- read_csv("data/ASDN_Bird_eggs.csv")
7174
```
7275

76+
How do we join those tables?
7377

74-
## Let's connect to our first database
78+
```{r}
79+
glimpse(eggs_csv)
80+
```
81+
82+
`Nest_Id` seems promising as a foreign key!!
83+
84+
```{r}
85+
glimpse(nests_csv)
86+
```
87+
88+
`Species` is probably the field we will use to join nest to the species
89+
90+
OK let's do it:
91+
92+
First compute the average of the volume of an egg. we can use the following formula:
93+
94+
$\frac{\Pi}6W^2L$
95+
96+
```{r}
97+
eggs_area_df <- eggs_csv %>%
98+
mutate(egg_volume = pi/6*Width^2*Length) # %>%
99+
# group_by(Nest_ID) %>%
100+
# summarise(eggs_area_avg = mean(egg_area, na.rm = TRUE))
101+
```
75102

103+
Now let's join this information to the nest table, and average by species
104+
105+
```{r}
106+
species_egg_volume_avg <- left_join(nests_csv, eggs_area_df, by="Nest_ID") %>%
107+
group_by(Species) %>%
108+
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
109+
arrange(desc(egg_volume_avg)) %>%
110+
drop_na()
111+
112+
species_egg_volume_avg
113+
```
114+
Ideally we would like the scientific names...
115+
116+
```{r}
117+
species_egg_area_avg <- species_study %>%
118+
inner_join(species_egg_volume_avg, by = join_by(Code == Species))
119+
120+
species_egg_area_avg
121+
```
122+
123+
124+
## Let's connect to our first database
76125

77126
### Load the bird database
78127

79-
This database has been built from the csv files we just manipulated, so the data should be very similar - note we did not say identical more on this in the last section:
128+
This database has been built from the csv files we just manipulated, so the data should be very similar - note we did not say identical more on this in the last section:
80129

81130
```{r}
82131
conn <- dbConnect(duckdb::duckdb(), dbdir = "./data/bird_database.duckdb", read_only = FALSE)
83132
```
84133

85-
86134
### Let's try to reproduce the analaysis we just did
87135

88136
```{r}
@@ -102,7 +150,7 @@ Note that those are not dataframes but tables. What `dbplyr` is actually doing b
102150

103151
#### How can I get a "real dataframe?"
104152

105-
you add `collect()` to your query.
153+
you add `collect()` to your query.
106154

107155
```{r}
108156
species %>%
@@ -113,10 +161,8 @@ species %>%
113161
collect()
114162
```
115163

116-
117164
Note it means the full query is going to be ran and save in you memory. This might slow things down so you generally want to collect on the smallest data frame you can
118165

119-
120166
#### How can you see the SQL query equivalent to the tidyverse code?
121167

122168
```{r}
@@ -128,6 +174,7 @@ species %>%
128174
head(3) %>%
129175
show_query()
130176
```
177+
131178
This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with `dbplyr` you can not do everything that SQL can do. So at some point you might want to start using SQL directly.
132179

133180
Here is how you could run the query using the SQL code directly
@@ -149,13 +196,13 @@ species %>%
149196

150197
Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:
151198

152-
153199
```{r}
154200
species %>%
155201
group_by(Relevance) %>%
156202
summarize(num_species = n()) %>%
157203
show_query()
158204
```
205+
159206
```{r}
160207
species %>%
161208
mutate(Code = paste("X", Code)) %>%
@@ -171,8 +218,6 @@ species %>%
171218

172219
Limitation: no way to add or update data, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions.
173220

174-
175-
176221
### Disconnecting from the database
177222

178223
Before we close our session, it is good practice to disconnect from the database first
@@ -181,11 +226,6 @@ Before we close our session, it is good practice to disconnect from the database
181226
DBI::dbDisconnect(conn, shutdown = TRUE)
182227
```
183228

184-
185229
## How did we create this database
186230

187231
You might be wondering, how we created this database from our csv files. Most databases have some function to help you import csv files into databases. Note that since there is not data modeling (does not have to be normalized or tidy) constraints nor data type constraints a lot things can go wrong. This is a great opportunity to implement a QA/QC on your data and help you to keep clean and tidy moving forward as new data are collected.
188-
189-
190-
191-

0 commit comments

Comments
 (0)