Skip to content

Commit d635a60

Browse files
committed
add the DB egg example
1 parent d9429fb commit d635a60

File tree

1 file changed

+78
-19
lines changed

1 file changed

+78
-19
lines changed

hands-on.qmd

Lines changed: 78 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Data were not collected in every year at all sites. Studies of the population ec
2828

2929
See `01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set.
3030

31+
3132
## Analyzing the bird dataset using csv files (raw data)
3233

3334
Let us import the csv files with the bird species information:
@@ -44,7 +45,7 @@ Let's explore what is in the `Relevance` attribute/column:
4445
```{r}
4546
species_csv %>%
4647
group_by(Relevance) %>%
47-
count()
48+
summarize(num_species = n())
4849
```
4950

5051
We are interested in the `Study species` because according to the metadata they are the species that are included in the data sets for banding, resighting, and/or nest monitoring. Let us extract the species and sort them in alphabetical order:
@@ -59,6 +60,8 @@ species_study <- species_csv %>%
5960
species_study
6061
```
6162

63+
#### Average egg volume
64+
6265
We would like to know what is the average egg size for each of those bird species. How would we do that?
6366

6467
We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table.
@@ -95,9 +98,7 @@ $\frac{\Pi}6W^2L$
9598

9699
```{r}
97100
eggs_area_df <- eggs_csv %>%
98-
mutate(egg_volume = pi/6*Width^2*Length) # %>%
99-
# group_by(Nest_ID) %>%
100-
# summarise(eggs_area_avg = mean(egg_area, na.rm = TRUE))
101+
mutate(egg_volume = pi/6*Width^2*Length)
101102
```
102103

103104
Now let's join this information to the nest table, and average by species
@@ -131,43 +132,49 @@ This database has been built from the csv files we just manipulated, so the data
131132
conn <- dbConnect(duckdb::duckdb(), dbdir = "./data/bird_database.duckdb", read_only = FALSE)
132133
```
133134

135+
List all the tables present in the database:
136+
137+
```{r}
138+
dbListTables(conn)
139+
```
140+
134141
### Let's try to reproduce the analaysis we just did
135142

136143
```{r}
137-
species <- tbl(conn, "Species")
138-
species
144+
species_db <- tbl(conn, "Species")
145+
species_db
139146
```
140147

141148
```{r}
142-
species %>%
149+
species_db %>%
143150
filter(Relevance=="Study species") %>%
144151
select(Scientific_name) %>%
145152
arrange(Scientific_name) %>%
146153
head(3)
147154
```
148155

149-
Note that those are not dataframes but tables. What `dbplyr` is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL to the database, retrieving results, etc.
156+
Note that those are not data frames but tables. What `dbplyr` is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL to the database, retrieving results, etc.
150157

151-
#### How can I get a "real dataframe?"
158+
#### How can I get a "real data frame?"
152159

153160
you add `collect()` to your query.
154161

155162
```{r}
156-
species %>%
163+
species_db %>%
157164
filter(Relevance=="Study species") %>%
158165
select(Scientific_name) %>%
159166
arrange(Scientific_name) %>%
160167
head(3) %>%
161168
collect()
162169
```
163170

164-
Note it means the full query is going to be ran and save in you memory. This might slow things down so you generally want to collect on the smallest data frame you can
171+
Note it means the full query is going to be ran and save in your environment. This might slow things down so you generally want to collect on the smallest data frame you can
165172

166173
#### How can you see the SQL query equivalent to the tidyverse code?
167174

168175
```{r}
169176
# Add show_query() to the end to see what SQL it is sending!
170-
species %>%
177+
species_db %>%
171178
filter(Relevance=="Study species") %>%
172179
select(Scientific_name) %>%
173180
arrange(Scientific_name) %>%
@@ -177,10 +184,10 @@ species %>%
177184

178185
This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with `dbplyr` you can not do everything that SQL can do. So at some point you might want to start using SQL directly.
179186

180-
Here is how you could run the query using the SQL code directly
187+
Here is how you could run the query using the SQL code directly:
181188

182189
```{r}
183-
# Establish a set of Parquet files
190+
# query the database using SQL
184191
dbGetQuery(conn, "SELECT Scientific_name FROM Species WHERE (Relevance = 'Study species') ORDER BY Scientific_name LIMIT 3")
185192
```
186193

@@ -189,34 +196,79 @@ You can do pretty much anything with these quasi-tables, including grouping, sum
189196
Let's count how many species there are per Relevance categories:
190197

191198
```{r}
192-
species %>%
199+
species_db %>%
193200
group_by(Relevance) %>%
194201
summarize(num_species = n())
195202
```
196203

197204
Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:
198205

199206
```{r}
200-
species %>%
207+
species_db %>%
201208
group_by(Relevance) %>%
202209
summarize(num_species = n()) %>%
203210
show_query()
204211
```
212+
You can also create new columns using mutate:
205213

206214
```{r}
207-
species %>%
215+
species_db %>%
208216
mutate(Code = paste("X", Code)) %>%
209217
head()
210218
```
219+
How does the query looks like?
211220

212221
```{r}
213-
species %>%
222+
species_db %>%
214223
mutate(Code = paste("X", Code)) %>%
215224
head() %>%
216225
show_query()
217226
```
227+
:::warning
228+
Limitation: no way to add or update data in the database, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions.
229+
:::
230+
231+
#### Average egg volume
218232

219-
Limitation: no way to add or update data, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions.
233+
Calculating the average bird eggs volume per species directly on the database
234+
235+
```{r}
236+
# loading all the necessary tables
237+
eggs_db <- tbl(conn, "Bird_eggs")
238+
nests_db <- tbl(conn, "Bird_nests")
239+
```
240+
241+
Compute the volume:
242+
243+
```{r}
244+
eggs_area_db <- eggs_db %>%
245+
mutate(egg_volume = pi/6*Width^2*Length)
246+
```
247+
248+
Now let's join this information to the nest table, and average by species
249+
250+
```{r}
251+
species_egg_volume_avg_db <- left_join(nests_db, eggs_area_db, by="Nest_ID") %>%
252+
group_by(Species) %>%
253+
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
254+
arrange(desc(egg_volume_avg)) %>%
255+
collect() %>%
256+
drop_na()
257+
258+
species_egg_volume_avg_db
259+
```
260+
261+
```{r}
262+
species_egg_volume_avg_db <- left_join(nests_db, eggs_area_db, by="Nest_ID") %>%
263+
group_by(Species) %>%
264+
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
265+
arrange(desc(egg_volume_avg)) %>%
266+
show_query()
267+
```
268+
269+
:::note
270+
Why does the SQL query include the volume computation?
271+
:::
220272

221273
### Disconnecting from the database
222274

@@ -226,6 +278,13 @@ Before we close our session, it is good practice to disconnect from the database
226278
DBI::dbDisconnect(conn, shutdown = TRUE)
227279
```
228280

281+
229282
## How did we create this database
230283

231284
You might be wondering, how we created this database from our csv files. Most databases have some function to help you import csv files into databases. Note that since there is not data modeling (does not have to be normalized or tidy) constraints nor data type constraints a lot things can go wrong. This is a great opportunity to implement a QA/QC on your data and help you to keep clean and tidy moving forward as new data are collected.
285+
286+
287+
```{sql}
288+
```
289+
290+

0 commit comments

Comments
 (0)