You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: hands-on.qmd
+78-19Lines changed: 78 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -28,6 +28,7 @@ Data were not collected in every year at all sites. Studies of the population ec
28
28
29
29
See `01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set.
30
30
31
+
31
32
## Analyzing the bird dataset using csv files (raw data)
32
33
33
34
Let us import the csv files with the bird species information:
@@ -44,7 +45,7 @@ Let's explore what is in the `Relevance` attribute/column:
44
45
```{r}
45
46
species_csv %>%
46
47
group_by(Relevance) %>%
47
-
count()
48
+
summarize(num_species = n())
48
49
```
49
50
50
51
We are interested in the `Study species` because according to the metadata they are the species that are included in the data sets for banding, resighting, and/or nest monitoring. Let us extract the species and sort them in alphabetical order:
We would like to know what is the average egg size for each of those bird species. How would we do that?
63
66
64
67
We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table.
### Let's try to reproduce the analaysis we just did
135
142
136
143
```{r}
137
-
species <- tbl(conn, "Species")
138
-
species
144
+
species_db <- tbl(conn, "Species")
145
+
species_db
139
146
```
140
147
141
148
```{r}
142
-
species %>%
149
+
species_db %>%
143
150
filter(Relevance=="Study species") %>%
144
151
select(Scientific_name) %>%
145
152
arrange(Scientific_name) %>%
146
153
head(3)
147
154
```
148
155
149
-
Note that those are not dataframes but tables. What `dbplyr` is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL to the database, retrieving results, etc.
156
+
Note that those are not data frames but tables. What `dbplyr` is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL to the database, retrieving results, etc.
150
157
151
-
#### How can I get a "real dataframe?"
158
+
#### How can I get a "real data frame?"
152
159
153
160
you add `collect()` to your query.
154
161
155
162
```{r}
156
-
species %>%
163
+
species_db %>%
157
164
filter(Relevance=="Study species") %>%
158
165
select(Scientific_name) %>%
159
166
arrange(Scientific_name) %>%
160
167
head(3) %>%
161
168
collect()
162
169
```
163
170
164
-
Note it means the full query is going to be ran and save in you memory. This might slow things down so you generally want to collect on the smallest data frame you can
171
+
Note it means the full query is going to be ran and save in your environment. This might slow things down so you generally want to collect on the smallest data frame you can
165
172
166
173
#### How can you see the SQL query equivalent to the tidyverse code?
167
174
168
175
```{r}
169
176
# Add show_query() to the end to see what SQL it is sending!
170
-
species %>%
177
+
species_db %>%
171
178
filter(Relevance=="Study species") %>%
172
179
select(Scientific_name) %>%
173
180
arrange(Scientific_name) %>%
@@ -177,10 +184,10 @@ species %>%
177
184
178
185
This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with `dbplyr` you can not do everything that SQL can do. So at some point you might want to start using SQL directly.
179
186
180
-
Here is how you could run the query using the SQL code directly
187
+
Here is how you could run the query using the SQL code directly:
181
188
182
189
```{r}
183
-
# Establish a set of Parquet files
190
+
# query the database using SQL
184
191
dbGetQuery(conn, "SELECT Scientific_name FROM Species WHERE (Relevance = 'Study species') ORDER BY Scientific_name LIMIT 3")
185
192
```
186
193
@@ -189,34 +196,79 @@ You can do pretty much anything with these quasi-tables, including grouping, sum
189
196
Let's count how many species there are per Relevance categories:
190
197
191
198
```{r}
192
-
species %>%
199
+
species_db %>%
193
200
group_by(Relevance) %>%
194
201
summarize(num_species = n())
195
202
```
196
203
197
204
Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:
198
205
199
206
```{r}
200
-
species %>%
207
+
species_db %>%
201
208
group_by(Relevance) %>%
202
209
summarize(num_species = n()) %>%
203
210
show_query()
204
211
```
212
+
You can also create new columns using mutate:
205
213
206
214
```{r}
207
-
species %>%
215
+
species_db %>%
208
216
mutate(Code = paste("X", Code)) %>%
209
217
head()
210
218
```
219
+
How does the query looks like?
211
220
212
221
```{r}
213
-
species %>%
222
+
species_db %>%
214
223
mutate(Code = paste("X", Code)) %>%
215
224
head() %>%
216
225
show_query()
217
226
```
227
+
:::warning
228
+
Limitation: no way to add or update data in the database, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions.
229
+
:::
230
+
231
+
#### Average egg volume
218
232
219
-
Limitation: no way to add or update data, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions.
233
+
Calculating the average bird eggs volume per species directly on the database
234
+
235
+
```{r}
236
+
# loading all the necessary tables
237
+
eggs_db <- tbl(conn, "Bird_eggs")
238
+
nests_db <- tbl(conn, "Bird_nests")
239
+
```
240
+
241
+
Compute the volume:
242
+
243
+
```{r}
244
+
eggs_area_db <- eggs_db %>%
245
+
mutate(egg_volume = pi/6*Width^2*Length)
246
+
```
247
+
248
+
Now let's join this information to the nest table, and average by species
Why does the SQL query include the volume computation?
271
+
:::
220
272
221
273
### Disconnecting from the database
222
274
@@ -226,6 +278,13 @@ Before we close our session, it is good practice to disconnect from the database
226
278
DBI::dbDisconnect(conn, shutdown = TRUE)
227
279
```
228
280
281
+
229
282
## How did we create this database
230
283
231
284
You might be wondering, how we created this database from our csv files. Most databases have some function to help you import csv files into databases. Note that since there is not data modeling (does not have to be normalized or tidy) constraints nor data type constraints a lot things can go wrong. This is a great opportunity to implement a QA/QC on your data and help you to keep clean and tidy moving forward as new data are collected.
0 commit comments