{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Groupby and Arrest Data\n", "\n", "In our merging exercises, we examined the relationship between county-level violent arrest totals and county-level drug arrest totals. In those exercises, you were given a dataset that provided you with county-level arrest totals. But that's not actually how the data is provided by the state of California. This week we will work with the *raw* California arrest data, which is not organized by county or even county-year. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Gradescope Autograding\n", "\n", "Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.\n", "\n", "For this assignment, please name your file `exercise_groupby.ipynb` before uploading.\n", "\n", "You can check that you have answers for all questions in your `results` dictionary with this code:\n", "\n", "```python\n", "assert set(results.keys()) == {\n", " \"ex4_num_rows\",\n", " \"ex5_collapsed_vars\",\n", " \"ex7_alameda_1980_share_violent_arrestees_black\",\n", " \"ex11_white_drug_share\",\n", " \"ex11_black_drug_share\",\n", " \"ex12_proportionate\",\n", "}\n", "```\n", "\n", "\n", "### Submission Limits\n", "\n", "Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "\n", "Import the raw California arrest data from the State Attorney General's office. Please use [this link](https://github.com/nickeubank/MIDS_Data/blob/master/OnlineArrestData1980-2021.csv) (the original is here [here](https://openjustice.doj.ca.gov/data), but they keep updating it and I get tired of updating solutions, so... please use my copy!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning the Group Structure of Your Data\n", "\n", "### Exercise 2\n", "\n", "What is the unit of observation for this dataset? In other words, when row zero says that there were 505 arrests for `VIOLENT` crimes, what exactly is that telling you—505 arrests in 1980? 505 arrests in Alameda County?\n", "\n", "(Please answer in Markdown)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing Your Assumptions\n", "\n", "It's important to be able to test whether the data you are working with really is organized the way you think it is, especially when working with groupby. Let's discuss how to check your answer to Exercise 2 with the `.duplicated()` method. \n", "\n", "Consider the following toy data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | social_security_numbers | \n", "second_column | \n", "
---|---|---|
0 | \n", "111111111 | \n", "a | \n", "
1 | \n", "222222222 | \n", "a | \n", "
2 | \n", "222222222 | \n", "a | \n", "
3 | \n", "333333333 | \n", "a | \n", "
4 | \n", "333333333 | \n", "b | \n", "