@@ -159,7 +159,7 @@ it classified 3 malignant observations as benign, and 4 benign observations as
159159malignant. The accuracy of this classifier is roughly
16016089%, given by the formula
161161
162- $$ \mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892 $$
162+ $$ \mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892. $$
163163
164164But we can also see that the classifier only identified 1 out of 4 total malignant
165165tumors; in other words, it misclassified 75% of the malignant cases present in the
@@ -279,7 +279,7 @@ are completely determined by a
279279but is actually totally reproducible. As long as you pick the same seed
280280value, you get the same result!
281281
282- ``` {index} sample; numpy.random.choice
282+ ``` {index} sample, to_list
283283```
284284
285285Let's use an example to investigate how randomness works in Python. Say we
@@ -291,6 +291,8 @@ Below we use the seed number `1`. At
291291that point, Python will keep track of the randomness that occurs throughout the code.
292292For example, we can call the ` sample ` method
293293on the series of numbers, passing the argument ` n=10 ` to indicate that we want 10 samples.
294+ The ` to_list ` method converts the resulting series into a basic Python list to make
295+ the output easier to read.
294296
295297``` {code-cell} ipython3
296298import numpy as np
@@ -300,7 +302,7 @@ np.random.seed(1)
300302
301303nums_0_to_9 = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
302304
303- random_numbers1 = nums_0_to_9.sample(n=10).to_numpy ()
305+ random_numbers1 = nums_0_to_9.sample(n=10).to_list ()
304306random_numbers1
305307```
306308You can see that ` random_numbers1 ` is a list of 10 numbers
@@ -309,7 +311,7 @@ we run the `sample` method again,
309311we will get a fresh batch of 10 numbers that also look random.
310312
311313``` {code-cell} ipython3
312- random_numbers2 = nums_0_to_9.sample(n=10).to_numpy ()
314+ random_numbers2 = nums_0_to_9.sample(n=10).to_list ()
313315random_numbers2
314316```
315317
@@ -319,12 +321,12 @@ as before---and then call the `sample` method again.
319321
320322``` {code-cell} ipython3
321323np.random.seed(1)
322- random_numbers1_again = nums_0_to_9.sample(n=10).to_numpy ()
324+ random_numbers1_again = nums_0_to_9.sample(n=10).to_list ()
323325random_numbers1_again
324326```
325327
326328``` {code-cell} ipython3
327- random_numbers2_again = nums_0_to_9.sample(n=10).to_numpy ()
329+ random_numbers2_again = nums_0_to_9.sample(n=10).to_list ()
328330random_numbers2_again
329331```
330332
@@ -336,21 +338,21 @@ obtain a different sequence of random numbers.
336338
337339``` {code-cell} ipython3
338340np.random.seed(4235)
339- random_numbers = nums_0_to_9.sample(n=10).to_numpy ()
340- random_numbers
341+ random_numbers1_different = nums_0_to_9.sample(n=10).to_list ()
342+ random_numbers1_different
341343```
342344
343345``` {code-cell} ipython3
344- random_numbers = nums_0_to_9.sample(n=10).to_numpy ()
345- random_numbers
346+ random_numbers2_different = nums_0_to_9.sample(n=10).to_list ()
347+ random_numbers2_different
346348```
347349
348350In other words, even though the sequences of numbers that Python is generating * look*
349351random, they are totally determined when we set a seed value!
350352
351353So what does this mean for data analysis? Well, ` sample ` is certainly not the
352- only data frame method that uses randomness in Python. Many of the functions
353- that we use in ` scikit-learn ` , ` pandas ` , and beyond use randomness&mdash ; many
354+ only place where randomness is used in Python. Many of the functions
355+ that we use in ` scikit-learn ` and beyond use randomness&mdash ; some
354356of them without even telling you about it. Also note that when Python starts
355357up, it creates its own seed to use. So if you do not explicitly
356358call the ` np.random.seed ` function, your results
@@ -387,22 +389,23 @@ reproducible.
387389In this book, we will generally only use packages that play nicely with `numpy`'s
388390default random number generator, so we will stick with `np.random.seed`.
389391You can achieve more careful control over randomness in your analysis
390- by creating a `numpy` [`RandomState ` object](https://numpy.org/doc/1.16 /reference/generated/numpy. random.RandomState .html)
392+ by creating a `numpy` [`Generator ` object](https://numpy.org/doc/stable /reference/random/generator .html)
391393once at the beginning of your analysis, and passing it to
392394the `random_state` argument that is available in many `pandas` and `scikit-learn`
393- functions. Those functions will then use your `RandomState ` to generate random numbers instead of
394- `numpy`'s default generator. For example, we can reproduce our earlier example by using a `RandomState `
395+ functions. Those functions will then use your `Generator ` to generate random numbers instead of
396+ `numpy`'s default generator. For example, we can reproduce our earlier example by using a `Generator `
395397object with the `seed` value set to 1; we get the same lists of numbers once again.
396398```{code}
397- rnd = np.random.RandomState(seed=1)
398- random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
399+ from numpy.random import Generator, PCG64
400+ rng = Generator(PCG64(seed=1))
401+ random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rng).to_list()
399402random_numbers1_third
400403```
401404```{code}
402405array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5])
403406```
404407```{code}
405- random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy ()
408+ random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rng).to_list ()
406409random_numbers2_third
407410```
408411```{code}
@@ -1830,7 +1833,7 @@ summary_df = pd.DataFrame(
18301833)
18311834plt_irrelevant_accuracies = (
18321835 alt.Chart(summary_df)
1833- .mark_line() # point=True
1836+ .mark_line(point=True)
18341837 .encode(
18351838 x=alt.X("ks", title="Number of Irrelevant Predictors"),
18361839 y=alt.Y(
@@ -1864,12 +1867,12 @@ this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls of
18641867
18651868plt_irrelevant_nghbrs = (
18661869 alt.Chart(summary_df)
1867- .mark_line() # point=True
1870+ .mark_line(point=True)
18681871 .encode(
18691872 x=alt.X("ks", title="Number of Irrelevant Predictors"),
18701873 y=alt.Y(
18711874 "nghbrs",
1872- title="Number of neighbors",
1875+ title="Tuned number of neighbors",
18731876 ),
18741877 )
18751878)
@@ -1894,7 +1897,7 @@ plt_irrelevant_nghbrs_fixed = (
18941897 alt.Chart(
18951898 melted_summary_df
18961899 )
1897- .mark_line() # point=True
1900+ .mark_line(point=True)
18981901 .encode(
18991902 x=alt.X("ks", title="Number of Irrelevant Predictors"),
19001903 y=alt.Y(
@@ -2134,7 +2137,7 @@ where the elbow occurs, and whether adding a variable provides a meaningful incr
21342137
21352138fwd_sel_accuracies_plot = (
21362139 alt.Chart(accuracies)
2137- .mark_line() # point=True
2140+ .mark_line(point=True)
21382141 .encode(
21392142 x=alt.X("size", title="Number of Predictors"),
21402143 y=alt.Y(
0 commit comments