UBC-DSCI
diff --git a/‎source/acknowledgements.md‎
Lines changed: 3 additions & 5 deletions b/‎source/acknowledgements.md‎
Lines changed: 3 additions & 5 deletions
diff --git a/‎source/classification1.md‎
Lines changed: 4 additions & 4 deletions b/‎source/classification1.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎source/classification2.md‎
Lines changed: 26 additions & 23 deletions b/‎source/classification2.md‎
Lines changed: 26 additions & 23 deletions
diff --git a/‎source/clustering.md‎
Lines changed: 3 additions & 3 deletions b/‎source/clustering.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎source/img/classification2/ML-paradigm-test.ai‎
Lines changed: 1121 additions & 1077 deletions b/‎source/img/classification2/ML-paradigm-test.ai‎
Lines changed: 1121 additions & 1077 deletions
diff --git a/‎source/img/classification2/ML-paradigm-test.png‎
15 KB b/‎source/img/classification2/ML-paradigm-test.png‎
15 KB
@@ -51,14 +51,12 @@ is reflected in the content of this book.
 
 We'd like to thank everyone that has contributed to the development of
 [*Data Science: A First Introduction (Python Edition)*](https://python.datasciencebook.ca).
-This is an open source Python translation of the original [*Data Science: A First Introduction*](https://datasciencebook.ca)
+This is an open source Python translation of the original
 book, which focused on the R programming language. Both of these books are
 used to teach DSCI 100 at the University of British Columbia (UBC).
 We would like to give special thanks to Navya Dahiya and Gloria Ye
 for completing the first round of translation of the R material to Python,
 and to Philip Austin for his leadership and guidance throughout the translation process.
-We also gratefully acknowledge the UBC Open Educational Resources Fund
-and the UBC Department of Statistics for supporting the translation of
+We also gratefully acknowledge the UBC Open Educational Resources Fund, the UBC Department of Statistics,
+and the UBC Department of Earth, Ocean, and Atmospheric Sciences for supporting the translation of
 the original R textbook and exercises to the Python programming language.
-
-
@@ -1359,7 +1359,7 @@ glue(
 :::{glue:figure} fig:05-scaling-plt
 :name: fig:05-scaling-plt
 
-Comparison of K = 3 nearest neighbors with standardized and unstandardized data.
+Comparison of K = 3 nearest neighbors with unstandardized and standardized data.
 :::
 
 ```{code-cell} ipython3
@@ -1421,7 +1421,7 @@ To better illustrate the problem, let's revisit the scaled breast cancer data,
 what the data would look like if the cancer was rare. We will do this by
 picking only 3 observations from the malignant group, and keeping all
 of the benign observations. We choose these 3 observations using the `.head()`
-method, which takes the number of rows to select from the top (`n`).
+method, which takes the number of rows to select from the top.
 We will then use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
 function from `pandas` to glue the two resulting filtered
 data frames back together. The `concat` function *concatenates* data frames
@@ -1532,8 +1532,8 @@ Imbalanced data with 7 nearest neighbors to a new observation highlighted.
 +++
 
 {numref}`fig:05-upsample-2` shows what happens if we set the background color of
-each area of the plot to the predictions the K-nearest neighbors
-classifier would make. We can see that the decision is
+each area of the plot to the prediction the K-nearest neighbors
+classifier would make for a new observation at that location. We can see that the decision is
 always "benign," corresponding to the blue color.
 
 ```{code-cell} ipython3
 
@@ -159,7 +159,7 @@ it classified 3 malignant observations as benign, and 4 benign observations as
 malignant. The accuracy of this classifier is roughly
 89%, given by the formula
 
-$$\mathrm{accuracy} = \frac{\mathrm{number \; of  \; correct  \; predictions}}{\mathrm{total \;  number \;  of  \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892$$
+$$\mathrm{accuracy} = \frac{\mathrm{number \; of  \; correct  \; predictions}}{\mathrm{total \;  number \;  of  \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892.$$
 
 But we can also see that the classifier only identified 1 out of 4 total malignant
 tumors; in other words, it misclassified 75% of the malignant cases present in the
@@ -279,7 +279,7 @@ are completely determined by a
 but is actually totally reproducible. As long as you pick the same seed
 value, you get the same result!
 
-```{index} sample; numpy.random.choice
+```{index} sample, to_list
 ```
 
 Let's use an example to investigate how randomness works in Python. Say we
@@ -291,6 +291,8 @@ Below we use the seed number `1`. At
 that point, Python will keep track of the randomness that occurs throughout the code.
 For example, we can call the `sample` method
 on the series of numbers, passing the argument `n=10` to indicate that we want 10 samples.
+The `to_list` method converts the resulting series into a basic Python list to make
+the output easier to read.
 
 ```{code-cell} ipython3
 import numpy as np
@@ -300,7 +302,7 @@ np.random.seed(1)
 
 nums_0_to_9 = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 
-random_numbers1 = nums_0_to_9.sample(n=10).to_numpy()
+random_numbers1 = nums_0_to_9.sample(n=10).to_list()
 random_numbers1
 ```
 You can see that `random_numbers1` is a list of 10 numbers
@@ -309,7 +311,7 @@ we run the `sample` method again,
 we will get a fresh batch of 10 numbers that also look random.
 
 ```{code-cell} ipython3
-random_numbers2 = nums_0_to_9.sample(n=10).to_numpy()
+random_numbers2 = nums_0_to_9.sample(n=10).to_list()
 random_numbers2
 ```
 
@@ -319,12 +321,12 @@ as before---and then call the `sample` method again.
 
 ```{code-cell} ipython3
 np.random.seed(1)
-random_numbers1_again = nums_0_to_9.sample(n=10).to_numpy()
+random_numbers1_again = nums_0_to_9.sample(n=10).to_list()
 random_numbers1_again
 ```
 
 ```{code-cell} ipython3
-random_numbers2_again = nums_0_to_9.sample(n=10).to_numpy()
+random_numbers2_again = nums_0_to_9.sample(n=10).to_list()
 random_numbers2_again
 ```
 
@@ -336,21 +338,21 @@ obtain a different sequence of random numbers.
 
 ```{code-cell} ipython3
 np.random.seed(4235)
-random_numbers = nums_0_to_9.sample(n=10).to_numpy()
-random_numbers
+random_numbers1_different = nums_0_to_9.sample(n=10).to_list()
+random_numbers1_different
 ```
 
 ```{code-cell} ipython3
-random_numbers = nums_0_to_9.sample(n=10).to_numpy()
-random_numbers
+random_numbers2_different = nums_0_to_9.sample(n=10).to_list()
+random_numbers2_different
 ```
 
 In other words, even though the sequences of numbers that Python is generating *look*
 random, they are totally determined when we set a seed value!
 
 So what does this mean for data analysis? Well, `sample` is certainly not the
-only data frame method that uses randomness in Python. Many of the functions
-that we use in `scikit-learn`, `pandas`, and beyond use randomness&mdash;many
+only place where randomness is used in Python. Many of the functions
+that we use in `scikit-learn` and beyond use randomness&mdash;some
 of them without even telling you about it.  Also note that when Python starts
 up, it creates its own seed to use. So if you do not explicitly
 call the `np.random.seed` function, your results
@@ -387,22 +389,23 @@ reproducible.
 In this book, we will generally only use packages that play nicely with `numpy`'s
 default random number generator, so we will stick with `np.random.seed`.
 You can achieve more careful control over randomness in your analysis
-by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html)
+by creating a `numpy` [`Generator` object](https://numpy.org/doc/stable/reference/random/generator.html)
 once at the beginning of your analysis, and passing it to
 the `random_state` argument that is available in many `pandas` and `scikit-learn`
-functions. Those functions will then use your `RandomState` to generate random numbers instead of
-`numpy`'s default generator. For example, we can reproduce our earlier example by using a `RandomState`
+functions. Those functions will then use your `Generator` to generate random numbers instead of
+`numpy`'s default generator. For example, we can reproduce our earlier example by using a `Generator`
 object with the `seed` value set to 1; we get the same lists of numbers once again.
 ```{code}
-rnd = np.random.RandomState(seed=1)
-random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
+from numpy.random import Generator, PCG64
+rng = Generator(PCG64(seed=1))
+random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rng).to_list()
 random_numbers1_third
 ```
 ```{code}
 array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5])
 ```
 ```{code}
-random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
+random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rng).to_list()
 random_numbers2_third
 ```
 ```{code}
@@ -1830,7 +1833,7 @@ summary_df = pd.DataFrame(
 )
 plt_irrelevant_accuracies = (
     alt.Chart(summary_df)
-    .mark_line() #point=True
+    .mark_line(point=True)
     .encode(
         x=alt.X("ks", title="Number of Irrelevant Predictors"),
         y=alt.Y(
@@ -1864,12 +1867,12 @@ this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls of
 
 plt_irrelevant_nghbrs = (
     alt.Chart(summary_df)
-    .mark_line()  # point=True
+    .mark_line(point=True)
     .encode(
         x=alt.X("ks", title="Number of Irrelevant Predictors"),
         y=alt.Y(
             "nghbrs",
-            title="Number of neighbors",
+            title="Tuned number of neighbors",
         ),
     )
 )
@@ -1894,7 +1897,7 @@ plt_irrelevant_nghbrs_fixed = (
     alt.Chart(
         melted_summary_df
     )
-    .mark_line()  # point=True
+    .mark_line(point=True)
     .encode(
         x=alt.X("ks", title="Number of Irrelevant Predictors"),
         y=alt.Y(
@@ -2134,7 +2137,7 @@ where the elbow occurs, and whether adding a variable provides a meaningful incr
 
 fwd_sel_accuracies_plot = (
     alt.Chart(accuracies)
-    .mark_line()  # point=True
+    .mark_line(point=True)
     .encode(
         x=alt.X("size", title="Number of Predictors"),
         y=alt.Y(
 
@@ -320,7 +320,7 @@ improves it by making adjustments to the assignment of data
 to clusters until it cannot improve any further. But how do we measure
 the "quality" of a clustering, and what does it mean to improve it?
 In K-means clustering, we measure the quality of a cluster by its
-*within-cluster sum-of-squared-distances* (WSSD), also called *intertia*. Computing this involves two steps.
+*within-cluster sum-of-squared-distances* (WSSD), also called *inertia*. Computing this involves two steps.
 First, we find the cluster centers by computing the mean of each variable
 over data points in the cluster. For example, suppose we have a
 cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -608,7 +608,7 @@ First three iterations of K-means clustering on the `penguins_standardized` exam
 +++
 
 Note that at this point, we can terminate the algorithm since none of the assignments changed
-in the fourth iteration; both the centers and labels will remain the same from this point onward.
+in the third iteration; both the centers and labels will remain the same from this point onward.
 
 ```{index} K-means; termination
 ```
@@ -949,7 +949,7 @@ For example,
 we could square all the numbers from 1-4 and store them in a list:
 
 ```{code-cell} ipython3
-[number ** 2 for number in range(1, 5)]
+[number**2 for number in range(1, 5)]
 ```
 
 Next, we will use this approach to compute the WSSD for the K-values 1 through 9.