Practical Example

Practical Example#

Let’s explore a practical example: what if we wanted to know how common homicide was in each state in the United States and identify those places where per capita homicide lowest and highest. To start with, this means we’ll need some data, so we’ll use data from the U.S. Centers for Disease Control and Prevention for homicide totals by state and data from the U.S. Census Bureau to get population in 2020. Let’s read those files into dataframes and see what we’ve got!

Note: we’ve removed some extraneous columns to make the example clearer

import pandas as pd

pd.set_option("mode.copy_on_write", True)

population = pd.read_csv("data/population.csv")
population
NAME POPESTIMATE2020
0 United States 331501080
1 Northeast Region 57525633
2 Midwest Region 68935174
3 South Region 126409007
4 West Region 78631266
5 Alabama 5024803
6 Alaska 732441
7 Arizona 7177986
8 Arkansas 3012232
9 California 39499738
10 Colorado 5784308
11 Connecticut 3600260
12 Delaware 991886
13 District of Columbia 690093
14 Florida 21569932
15 Georgia 10725800
16 Hawaii 1451911
17 Idaho 1847772
18 Illinois 12785245
19 Indiana 6785644
20 Iowa 3188669
21 Kansas 2935880
22 Kentucky 4503958
23 Louisiana 4651203
24 Maine 1362280
25 Maryland 6172679
26 Massachusetts 7022220
27 Michigan 10067664
28 Minnesota 5707165
29 Mississippi 2956870
30 Missouri 6154481
31 Montana 1086193
32 Nebraska 1961455
33 Nevada 3114071
34 New Hampshire 1377848
35 New Jersey 9279743
36 New Mexico 2117566
37 New York 20154933
38 North Carolina 10457177
39 North Dakota 778962
40 Ohio 11790587
41 Oklahoma 3962031
42 Oregon 4241544
43 Pennsylvania 12989625
44 Rhode Island 1096229
45 South Carolina 5130729
46 South Dakota 887099
47 Tennessee 6920119
48 Texas 29217653
49 Utah 3281684
50 Vermont 642495
51 Virginia 8632044
52 Washington 7718785
53 West Virginia 1789798
54 Wisconsin 5892323
55 Wyoming 577267
56 Puerto Rico 3281538
homicides = pd.read_csv("data/homicides.csv")
homicides
STATE DEATHS
0 AL 654
1 AK 55
2 AZ 525
3 AR 371
4 CA 2368
5 CO 336
6 CT 152
7 DE 86
8 FL 1530
9 GA 1093
10 HI 46
11 ID 42
12 IL 1353
13 IN 620
14 IA 106
15 KS 195
16 KY 404
17 LA 873
18 ME 21
19 MD 649
20 MA 183
21 MI 811
22 MN 198
23 MS 576
24 MO 803
25 MT 65
26 NE 76
27 NV 217
28 NH 14
29 NJ 362
30 NM 216
31 NY 875
32 NC 884
33 ND 30
34 OH 1004
35 OK 342
36 OR 157
37 PA 994
38 RI 29
39 SC 622
40 SD 52
41 TN 753
42 TX 2212
43 UT 95
44 VT 14
45 VA 531
46 WA 322
47 WV 114
48 WI 334
49 WY 25

There are two items to note here. The first is that the population data has the full state name listed, while the homicide data only has state abbreviations. The second is that while the homicide data has 50 states, the population data has 57 since there are additional regions listed (e.g. “United States”, “Northeast Region”, etc.).

What we’ll want to do, is to add in the state names for each state in the homicide data, then merge the two datasets into one. Once we have that combined dataset, we can calculate the per capita homicide rate in each state and sort our data to determine the highest/lowest per capita.

To add in the state names for each state, let’s load in a dataset that contains the 50 states and their abbreviations:

names = pd.read_csv("data/state_codes.csv")
names
State Code
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
5 Colorado CO
6 Connecticut CT
7 Delaware DE
8 District of Columbia DC
9 Florida FL
10 Georgia GA
11 Hawaii HI
12 Idaho ID
13 Illinois IL
14 Indiana IN
15 Iowa IA
16 Kansas KS
17 Kentucky KY
18 Louisiana LA
19 Maine ME
20 Maryland MD
21 Massachusetts MA
22 Michigan MI
23 Minnesota MN
24 Mississippi MS
25 Missouri MO
26 Montana MT
27 Nebraska NE
28 Nevada NV
29 New Hampshire NH
30 New Jersey NJ
31 New Mexico NM
32 New York NY
33 North Carolina NC
34 North Dakota ND
35 Ohio OH
36 Oklahoma OK
37 Oregon OR
38 Pennsylvania PA
39 Rhode Island RI
40 South Carolina SC
41 South Dakota SD
42 Tennessee TN
43 Texas TX
44 Utah UT
45 Vermont VT
46 Virginia VA
47 Washington WA
48 West Virginia WV
49 Wisconsin WI
50 Wyoming WY

We see another issue here, though: there are 51 states listed with abbreviations (remember, Python starts counting at 0) since the District of Columbia is included while there are 50 states listed in the homicides DataFrame. Let’s left-merge the data so we only keep the 50 states for which we have data. Since the column in homicides that has the labels is labeled ‘STATE’ and the column in names with the state labels is called ‘Code’, we’ll use the ‘left_on’ and ‘right_on’ parameters to specify those columns:

homicides_with_states = pd.merge(
    homicides, names, left_on="STATE", right_on="Code", how="left"
)
homicides_with_states
STATE DEATHS State Code
0 AL 654 Alabama AL
1 AK 55 Alaska AK
2 AZ 525 Arizona AZ
3 AR 371 Arkansas AR
4 CA 2368 California CA
5 CO 336 Colorado CO
6 CT 152 Connecticut CT
7 DE 86 Delaware DE
8 FL 1530 Florida FL
9 GA 1093 Georgia GA
10 HI 46 Hawaii HI
11 ID 42 Idaho ID
12 IL 1353 Illinois IL
13 IN 620 Indiana IN
14 IA 106 Iowa IA
15 KS 195 Kansas KS
16 KY 404 Kentucky KY
17 LA 873 Louisiana LA
18 ME 21 Maine ME
19 MD 649 Maryland MD
20 MA 183 Massachusetts MA
21 MI 811 Michigan MI
22 MN 198 Minnesota MN
23 MS 576 Mississippi MS
24 MO 803 Missouri MO
25 MT 65 Montana MT
26 NE 76 Nebraska NE
27 NV 217 Nevada NV
28 NH 14 New Hampshire NH
29 NJ 362 New Jersey NJ
30 NM 216 New Mexico NM
31 NY 875 New York NY
32 NC 884 North Carolina NC
33 ND 30 North Dakota ND
34 OH 1004 Ohio OH
35 OK 342 Oklahoma OK
36 OR 157 Oregon OR
37 PA 994 Pennsylvania PA
38 RI 29 Rhode Island RI
39 SC 622 South Carolina SC
40 SD 52 South Dakota SD
41 TN 753 Tennessee TN
42 TX 2212 Texas TX
43 UT 95 Utah UT
44 VT 14 Vermont VT
45 VA 531 Virginia VA
46 WA 322 Washington WA
47 WV 114 West Virginia WV
48 WI 334 Wisconsin WI
49 WY 25 Wyoming WY

Now we have those items combined, but we don’t need the “STATE” or “Code” columns, so let’s drop them using the drop method. We’ll do this operation in place so that rather than having to write:

homicides3 = homicides_with_states.drop(columns=['STATE','Code'])

which leads to a proliferation of variables each time we want to do a new operation (homicide3, homicide4, etc.); instead we can perform the operation directly on the data using the inplace=True keyword.

homicides_with_states = homicides_with_states.drop(columns=["STATE", "Code"])
homicides_with_states
DEATHS State
0 654 Alabama
1 55 Alaska
2 525 Arizona
3 371 Arkansas
4 2368 California
5 336 Colorado
6 152 Connecticut
7 86 Delaware
8 1530 Florida
9 1093 Georgia
10 46 Hawaii
11 42 Idaho
12 1353 Illinois
13 620 Indiana
14 106 Iowa
15 195 Kansas
16 404 Kentucky
17 873 Louisiana
18 21 Maine
19 649 Maryland
20 183 Massachusetts
21 811 Michigan
22 198 Minnesota
23 576 Mississippi
24 803 Missouri
25 65 Montana
26 76 Nebraska
27 217 Nevada
28 14 New Hampshire
29 362 New Jersey
30 216 New Mexico
31 875 New York
32 884 North Carolina
33 30 North Dakota
34 1004 Ohio
35 342 Oklahoma
36 157 Oregon
37 994 Pennsylvania
38 29 Rhode Island
39 622 South Carolina
40 52 South Dakota
41 753 Tennessee
42 2212 Texas
43 95 Utah
44 14 Vermont
45 531 Virginia
46 322 Washington
47 114 West Virginia
48 334 Wisconsin
49 25 Wyoming

And just because it’s easier to reference cleaned and ordered labels, let’s rename ‘DEATHS’ to ‘homicides’ using rename.

homicides_with_states = homicides_with_states.rename(
    columns={"DEATHS": "homicides", "State": "state"}
)
homicides_with_states
homicides state
0 654 Alabama
1 55 Alaska
2 525 Arizona
3 371 Arkansas
4 2368 California
5 336 Colorado
6 152 Connecticut
7 86 Delaware
8 1530 Florida
9 1093 Georgia
10 46 Hawaii
11 42 Idaho
12 1353 Illinois
13 620 Indiana
14 106 Iowa
15 195 Kansas
16 404 Kentucky
17 873 Louisiana
18 21 Maine
19 649 Maryland
20 183 Massachusetts
21 811 Michigan
22 198 Minnesota
23 576 Mississippi
24 803 Missouri
25 65 Montana
26 76 Nebraska
27 217 Nevada
28 14 New Hampshire
29 362 New Jersey
30 216 New Mexico
31 875 New York
32 884 North Carolina
33 30 North Dakota
34 1004 Ohio
35 342 Oklahoma
36 157 Oregon
37 994 Pennsylvania
38 29 Rhode Island
39 622 South Carolina
40 52 South Dakota
41 753 Tennessee
42 2212 Texas
43 95 Utah
44 14 Vermont
45 531 Virginia
46 322 Washington
47 114 West Virginia
48 334 Wisconsin
49 25 Wyoming

Great - now we’re ready to merge in our population data. Let’s perform an outer merge so that we keep all our data, then we can check what’s missing and remove it as appropriate.

combined = pd.merge(
    homicides_with_states, population, how="outer", left_on="state", right_on="NAME"
)
combined
homicides state NAME POPESTIMATE2020
0 654.0 Alabama Alabama 5024803
1 55.0 Alaska Alaska 732441
2 525.0 Arizona Arizona 7177986
3 371.0 Arkansas Arkansas 3012232
4 2368.0 California California 39499738
5 336.0 Colorado Colorado 5784308
6 152.0 Connecticut Connecticut 3600260
7 86.0 Delaware Delaware 991886
8 1530.0 Florida Florida 21569932
9 1093.0 Georgia Georgia 10725800
10 46.0 Hawaii Hawaii 1451911
11 42.0 Idaho Idaho 1847772
12 1353.0 Illinois Illinois 12785245
13 620.0 Indiana Indiana 6785644
14 106.0 Iowa Iowa 3188669
15 195.0 Kansas Kansas 2935880
16 404.0 Kentucky Kentucky 4503958
17 873.0 Louisiana Louisiana 4651203
18 21.0 Maine Maine 1362280
19 649.0 Maryland Maryland 6172679
20 183.0 Massachusetts Massachusetts 7022220
21 811.0 Michigan Michigan 10067664
22 198.0 Minnesota Minnesota 5707165
23 576.0 Mississippi Mississippi 2956870
24 803.0 Missouri Missouri 6154481
25 65.0 Montana Montana 1086193
26 76.0 Nebraska Nebraska 1961455
27 217.0 Nevada Nevada 3114071
28 14.0 New Hampshire New Hampshire 1377848
29 362.0 New Jersey New Jersey 9279743
30 216.0 New Mexico New Mexico 2117566
31 875.0 New York New York 20154933
32 884.0 North Carolina North Carolina 10457177
33 30.0 North Dakota North Dakota 778962
34 1004.0 Ohio Ohio 11790587
35 342.0 Oklahoma Oklahoma 3962031
36 157.0 Oregon Oregon 4241544
37 994.0 Pennsylvania Pennsylvania 12989625
38 29.0 Rhode Island Rhode Island 1096229
39 622.0 South Carolina South Carolina 5130729
40 52.0 South Dakota South Dakota 887099
41 753.0 Tennessee Tennessee 6920119
42 2212.0 Texas Texas 29217653
43 95.0 Utah Utah 3281684
44 14.0 Vermont Vermont 642495
45 531.0 Virginia Virginia 8632044
46 322.0 Washington Washington 7718785
47 114.0 West Virginia West Virginia 1789798
48 334.0 Wisconsin Wisconsin 5892323
49 25.0 Wyoming Wyoming 577267
50 NaN NaN United States 331501080
51 NaN NaN Northeast Region 57525633
52 NaN NaN Midwest Region 68935174
53 NaN NaN South Region 126409007
54 NaN NaN West Region 78631266
55 NaN NaN District of Columbia 690093
56 NaN NaN Puerto Rico 3281538

We can see from the above that we have a number of NaN values. Let’s look at rows with any NaN values:

combined[combined.isna().any(axis=1)]
homicides state NAME POPESTIMATE2020
50 NaN NaN United States 331501080
51 NaN NaN Northeast Region 57525633
52 NaN NaN Midwest Region 68935174
53 NaN NaN South Region 126409007
54 NaN NaN West Region 78631266
55 NaN NaN District of Columbia 690093
56 NaN NaN Puerto Rico 3281538

Let’s just break down the above statement for a moment for comprehension. combined.isna() returns a True value for each entry in the DataFrame with a NaN value. The .any(axis=1) looks across axis 1 and if any of the entries in that row (across all columns) has a True value, then that row is represented with a True. Let’s look at the steps so it’s clear:

combined.isna()
homicides state NAME POPESTIMATE2020
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 False False False False
15 False False False False
16 False False False False
17 False False False False
18 False False False False
19 False False False False
20 False False False False
21 False False False False
22 False False False False
23 False False False False
24 False False False False
25 False False False False
26 False False False False
27 False False False False
28 False False False False
29 False False False False
30 False False False False
31 False False False False
32 False False False False
33 False False False False
34 False False False False
35 False False False False
36 False False False False
37 False False False False
38 False False False False
39 False False False False
40 False False False False
41 False False False False
42 False False False False
43 False False False False
44 False False False False
45 False False False False
46 False False False False
47 False False False False
48 False False False False
49 False False False False
50 True True False False
51 True True False False
52 True True False False
53 True True False False
54 True True False False
55 True True False False
56 True True False False
combined.isna().any(axis=1)
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50     True
51     True
52     True
53     True
54     True
55     True
56     True
dtype: bool

So when we put it altogether, it only displaces those rows with NaN values:

combined[combined.isna().any(axis=1)]
homicides state NAME POPESTIMATE2020
50 NaN NaN United States 331501080
51 NaN NaN Northeast Region 57525633
52 NaN NaN Midwest Region 68935174
53 NaN NaN South Region 126409007
54 NaN NaN West Region 78631266
55 NaN NaN District of Columbia 690093
56 NaN NaN Puerto Rico 3281538

What we can see here is that we don’t need any of these regions, as none are U.S. states, so let’s drop them using the dropna method:

combined = combined.dropna(axis=0)
combined
homicides state NAME POPESTIMATE2020
0 654.0 Alabama Alabama 5024803
1 55.0 Alaska Alaska 732441
2 525.0 Arizona Arizona 7177986
3 371.0 Arkansas Arkansas 3012232
4 2368.0 California California 39499738
5 336.0 Colorado Colorado 5784308
6 152.0 Connecticut Connecticut 3600260
7 86.0 Delaware Delaware 991886
8 1530.0 Florida Florida 21569932
9 1093.0 Georgia Georgia 10725800
10 46.0 Hawaii Hawaii 1451911
11 42.0 Idaho Idaho 1847772
12 1353.0 Illinois Illinois 12785245
13 620.0 Indiana Indiana 6785644
14 106.0 Iowa Iowa 3188669
15 195.0 Kansas Kansas 2935880
16 404.0 Kentucky Kentucky 4503958
17 873.0 Louisiana Louisiana 4651203
18 21.0 Maine Maine 1362280
19 649.0 Maryland Maryland 6172679
20 183.0 Massachusetts Massachusetts 7022220
21 811.0 Michigan Michigan 10067664
22 198.0 Minnesota Minnesota 5707165
23 576.0 Mississippi Mississippi 2956870
24 803.0 Missouri Missouri 6154481
25 65.0 Montana Montana 1086193
26 76.0 Nebraska Nebraska 1961455
27 217.0 Nevada Nevada 3114071
28 14.0 New Hampshire New Hampshire 1377848
29 362.0 New Jersey New Jersey 9279743
30 216.0 New Mexico New Mexico 2117566
31 875.0 New York New York 20154933
32 884.0 North Carolina North Carolina 10457177
33 30.0 North Dakota North Dakota 778962
34 1004.0 Ohio Ohio 11790587
35 342.0 Oklahoma Oklahoma 3962031
36 157.0 Oregon Oregon 4241544
37 994.0 Pennsylvania Pennsylvania 12989625
38 29.0 Rhode Island Rhode Island 1096229
39 622.0 South Carolina South Carolina 5130729
40 52.0 South Dakota South Dakota 887099
41 753.0 Tennessee Tennessee 6920119
42 2212.0 Texas Texas 29217653
43 95.0 Utah Utah 3281684
44 14.0 Vermont Vermont 642495
45 531.0 Virginia Virginia 8632044
46 322.0 Washington Washington 7718785
47 114.0 West Virginia West Virginia 1789798
48 334.0 Wisconsin Wisconsin 5892323
49 25.0 Wyoming Wyoming 577267

Now that we have our data let’s take one more step to clean the formatting of our data. Let’s drop the extra column with state names and rename ‘POPESTIMATE2020’ to ‘population’:

combined = combined.drop(columns=["NAME"])
combined = combined.rename(columns={"POPESTIMATE2020": "population"})
combined
homicides state population
0 654.0 Alabama 5024803
1 55.0 Alaska 732441
2 525.0 Arizona 7177986
3 371.0 Arkansas 3012232
4 2368.0 California 39499738
5 336.0 Colorado 5784308
6 152.0 Connecticut 3600260
7 86.0 Delaware 991886
8 1530.0 Florida 21569932
9 1093.0 Georgia 10725800
10 46.0 Hawaii 1451911
11 42.0 Idaho 1847772
12 1353.0 Illinois 12785245
13 620.0 Indiana 6785644
14 106.0 Iowa 3188669
15 195.0 Kansas 2935880
16 404.0 Kentucky 4503958
17 873.0 Louisiana 4651203
18 21.0 Maine 1362280
19 649.0 Maryland 6172679
20 183.0 Massachusetts 7022220
21 811.0 Michigan 10067664
22 198.0 Minnesota 5707165
23 576.0 Mississippi 2956870
24 803.0 Missouri 6154481
25 65.0 Montana 1086193
26 76.0 Nebraska 1961455
27 217.0 Nevada 3114071
28 14.0 New Hampshire 1377848
29 362.0 New Jersey 9279743
30 216.0 New Mexico 2117566
31 875.0 New York 20154933
32 884.0 North Carolina 10457177
33 30.0 North Dakota 778962
34 1004.0 Ohio 11790587
35 342.0 Oklahoma 3962031
36 157.0 Oregon 4241544
37 994.0 Pennsylvania 12989625
38 29.0 Rhode Island 1096229
39 622.0 South Carolina 5130729
40 52.0 South Dakota 887099
41 753.0 Tennessee 6920119
42 2212.0 Texas 29217653
43 95.0 Utah 3281684
44 14.0 Vermont 642495
45 531.0 Virginia 8632044
46 322.0 Washington 7718785
47 114.0 West Virginia 1789798
48 334.0 Wisconsin 5892323
49 25.0 Wyoming 577267

Now we’re ready to conduct our analysis. Let’s create a new column called ‘homicide_rate’ and it will be the number of homicides divided by the state population. Since this number is typically a small value, this is often reported as the number of homicides per 100,000 people, so we just multiply the result by 100,000.

combined["homicide_rate"] = combined["homicides"] / combined["population"] * 100000
combined
homicides state population homicide_rate
0 654.0 Alabama 5024803 13.015436
1 55.0 Alaska 732441 7.509137
2 525.0 Arizona 7177986 7.314029
3 371.0 Arkansas 3012232 12.316448
4 2368.0 California 39499738 5.994976
5 336.0 Colorado 5784308 5.808819
6 152.0 Connecticut 3600260 4.221917
7 86.0 Delaware 991886 8.670351
8 1530.0 Florida 21569932 7.093207
9 1093.0 Georgia 10725800 10.190382
10 46.0 Hawaii 1451911 3.168238
11 42.0 Idaho 1847772 2.273008
12 1353.0 Illinois 12785245 10.582511
13 620.0 Indiana 6785644 9.136937
14 106.0 Iowa 3188669 3.324271
15 195.0 Kansas 2935880 6.641961
16 404.0 Kentucky 4503958 8.969888
17 873.0 Louisiana 4651203 18.769338
18 21.0 Maine 1362280 1.541533
19 649.0 Maryland 6172679 10.514073
20 183.0 Massachusetts 7022220 2.606013
21 811.0 Michigan 10067664 8.055493
22 198.0 Minnesota 5707165 3.469323
23 576.0 Mississippi 2956870 19.480058
24 803.0 Missouri 6154481 13.047404
25 65.0 Montana 1086193 5.984204
26 76.0 Nebraska 1961455 3.874675
27 217.0 Nevada 3114071 6.968370
28 14.0 New Hampshire 1377848 1.016077
29 362.0 New Jersey 9279743 3.900970
30 216.0 New Mexico 2117566 10.200390
31 875.0 New York 20154933 4.341369
32 884.0 North Carolina 10457177 8.453524
33 30.0 North Dakota 778962 3.851279
34 1004.0 Ohio 11790587 8.515267
35 342.0 Oklahoma 3962031 8.631936
36 157.0 Oregon 4241544 3.701482
37 994.0 Pennsylvania 12989625 7.652261
38 29.0 Rhode Island 1096229 2.645433
39 622.0 South Carolina 5130729 12.123034
40 52.0 South Dakota 887099 5.861803
41 753.0 Tennessee 6920119 10.881316
42 2212.0 Texas 29217653 7.570766
43 95.0 Utah 3281684 2.894855
44 14.0 Vermont 642495 2.179005
45 531.0 Virginia 8632044 6.151498
46 322.0 Washington 7718785 4.171641
47 114.0 West Virginia 1789798 6.369434
48 334.0 Wisconsin 5892323 5.668393
49 25.0 Wyoming 577267 4.330752

Lastly, let’s sort the data by ‘homicide_rate’ to see what the states are with the highest and lowest rates:

combined = combined.sort_values(by="homicide_rate")
combined
homicides state population homicide_rate
28 14.0 New Hampshire 1377848 1.016077
18 21.0 Maine 1362280 1.541533
44 14.0 Vermont 642495 2.179005
11 42.0 Idaho 1847772 2.273008
20 183.0 Massachusetts 7022220 2.606013
38 29.0 Rhode Island 1096229 2.645433
43 95.0 Utah 3281684 2.894855
10 46.0 Hawaii 1451911 3.168238
14 106.0 Iowa 3188669 3.324271
22 198.0 Minnesota 5707165 3.469323
36 157.0 Oregon 4241544 3.701482
33 30.0 North Dakota 778962 3.851279
26 76.0 Nebraska 1961455 3.874675
29 362.0 New Jersey 9279743 3.900970
46 322.0 Washington 7718785 4.171641
6 152.0 Connecticut 3600260 4.221917
49 25.0 Wyoming 577267 4.330752
31 875.0 New York 20154933 4.341369
48 334.0 Wisconsin 5892323 5.668393
5 336.0 Colorado 5784308 5.808819
40 52.0 South Dakota 887099 5.861803
25 65.0 Montana 1086193 5.984204
4 2368.0 California 39499738 5.994976
45 531.0 Virginia 8632044 6.151498
47 114.0 West Virginia 1789798 6.369434
15 195.0 Kansas 2935880 6.641961
27 217.0 Nevada 3114071 6.968370
8 1530.0 Florida 21569932 7.093207
2 525.0 Arizona 7177986 7.314029
1 55.0 Alaska 732441 7.509137
42 2212.0 Texas 29217653 7.570766
37 994.0 Pennsylvania 12989625 7.652261
21 811.0 Michigan 10067664 8.055493
32 884.0 North Carolina 10457177 8.453524
34 1004.0 Ohio 11790587 8.515267
35 342.0 Oklahoma 3962031 8.631936
7 86.0 Delaware 991886 8.670351
16 404.0 Kentucky 4503958 8.969888
13 620.0 Indiana 6785644 9.136937
9 1093.0 Georgia 10725800 10.190382
30 216.0 New Mexico 2117566 10.200390
19 649.0 Maryland 6172679 10.514073
12 1353.0 Illinois 12785245 10.582511
41 753.0 Tennessee 6920119 10.881316
39 622.0 South Carolina 5130729 12.123034
3 371.0 Arkansas 3012232 12.316448
0 654.0 Alabama 5024803 13.015436
24 803.0 Missouri 6154481 13.047404
17 873.0 Louisiana 4651203 18.769338
23 576.0 Mississippi 2956870 19.480058

We can see that the states of New Hampshire, Maine, Vermont, Idaho, and Massachusetts, each have homicide rates between and 1 and 2.6 people per 100,000. Arkansas, Alabama, Missouri, Louisiana, and Mississippi have far higher rates ranging from 12.3 to 19.5 people per 100,000. The homicide rate is nearly 20 times higher in Mississippi than it is in New Hampshire.

All of the skills and tools shown here scale up to larger datasets as well. As we can see, data preparation was a significant part of the effort and being able to programmatically complete the process was essential. Data preparation requires careful design to take care of potential imperfections, omissions, or mismatches in the data. Once taken care of, the subsequent analysis can proceed for us to derive insights from our data.

Summary#

In this section, we saw that merging is a key tool for combining datasets and is able to accommodate many of the common types of merges that we would want to execute. Merging data allows us to integrate the information from two DataFrames logically and coherently to prepare it for further analysis and querying, but care must be taken in structuring the merge so that we don’t lose data or introduce unwanted NaN values. Next, you’ll get some practice merging data. Once we have our data combined, in an upcoming lesson, we’ll explore how to group our data by its columns.