Processing Data with Data Frames
Now that we have the basics of using data frames out of the way, let’s dig a little deeper into how to use them to do some basic operations and analysis. We’ll use the same data file we finished up with last time: USC00200228.csv which contains rain, snowfall, snow depth, minimum, maximum, and observation temperature data each day for January - March 2020 in Ann Arbor, MI.
So, let’s read that data again and populate our data frame.
[1]:
import pandas as pd
import numpy as np
file = "USC00200228.csv"
data = pd.read_csv(file)
print(data.head(9))
STATION NAME DATE PRCP SNOW SNWD TMAX TMIN \
0 USC00200228 ANN ARBOR SE, MI US 2020-01-01 0.05 0.8 2.0 30 24
1 USC00200228 ANN ARBOR SE, MI US 2020-01-02 0.00 0.0 1.0 37 23
2 USC00200228 ANN ARBOR SE, MI US 2020-01-03 0.00 0.0 0.0 47 30
3 USC00200228 ANN ARBOR SE, MI US 2020-01-04 0.00 0.0 0.0 43 33
4 USC00200228 ANN ARBOR SE, MI US 2020-01-05 0.02 0.2 0.0 35 29
5 USC00200228 ANN ARBOR SE, MI US 2020-01-06 0.04 0.6 0.0 39 29
6 USC00200228 ANN ARBOR SE, MI US 2020-01-07 0.00 0.0 0.0 45 25
7 USC00200228 ANN ARBOR SE, MI US 2020-01-08 0.02 0.4 0.0 43 23
8 USC00200228 ANN ARBOR SE, MI US 2020-01-09 0.00 0.1 0.0 25 12
TOBS
0 24
1 30
2 40
3 33
4 29
5 31
6 25
7 23
8 19
Note that by passing the 9 as an argument tohead(), the first 9 columns were returned.
Basic Stats
Pandas gives us an all quick tool to get some basic information about the dataset, describe():
[2]:
data.describe()
[2]:
| PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|
| count | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 |
| mean | 0.114333 | 0.358889 | 1.188889 | 40.877778 | 23.633333 | 28.633333 |
| std | 0.279084 | 0.893534 | 1.606828 | 9.840101 | 8.423416 | 9.985889 |
| min | 0.000000 | 0.000000 | 0.000000 | 21.000000 | 1.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 34.000000 | 18.250000 | 22.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 38.500000 | 23.000000 | 29.000000 |
| 75% | 0.085000 | 0.200000 | 2.000000 | 48.000000 | 29.000000 | 33.000000 |
| max | 1.540000 | 5.700000 | 6.000000 | 62.000000 | 42.000000 | 61.000000 |
which returns some basic statistical information for all columns that contain numeric data. You can control the output of describe() to only gather a subset of this information or change the percentile values to something else. See the documentation for more details.
While describe() is uesful to get a quick look at your data, typically, we may want to store the results of one of these statistical operations in a variable for use later on. In that case, it is simple to use the proper pandas statistical tool:
[3]:
datamax = data.max()
print(datamax)
STATION USC00200228
NAME ANN ARBOR SE, MI US
DATE 2020-03-30
PRCP 1.54
SNOW 5.7
SNWD 6
TMAX 62
TMIN 42
TOBS 61
dtype: object
The result of this type of operation (max(), min(), mean(), std(), count() all behave in a similar same way) is a pandas Series:
[4]:
type(datamax)
[4]:
pandas.core.series.Series
which means that we can quickly extract the maximum value for snowfall and save it to a variable:
[5]:
maxsnow = datamax['SNOW']
print(maxsnow)
5.7
Again, a similar procedure would be used to get specific values of the other basic statistical operations.
Sorting and querying
Another tool that is often needed when working with a dataset is the ability to sort the data. There are two ways do this, we can sort by the index (the row or column names):
[6]:
data.sort_index(axis=1)
[6]:
| DATE | NAME | PRCP | SNOW | SNWD | STATION | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-01-01 | ANN ARBOR SE, MI US | 0.05 | 0.8 | 2.0 | USC00200228 | 30 | 24 | 24 |
| 1 | 2020-01-02 | ANN ARBOR SE, MI US | 0.00 | 0.0 | 1.0 | USC00200228 | 37 | 23 | 30 |
| 2 | 2020-01-03 | ANN ARBOR SE, MI US | 0.00 | 0.0 | 0.0 | USC00200228 | 47 | 30 | 40 |
| 3 | 2020-01-04 | ANN ARBOR SE, MI US | 0.00 | 0.0 | 0.0 | USC00200228 | 43 | 33 | 33 |
| 4 | 2020-01-05 | ANN ARBOR SE, MI US | 0.02 | 0.2 | 0.0 | USC00200228 | 35 | 29 | 29 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85 | 2020-03-26 | ANN ARBOR SE, MI US | 0.00 | 0.0 | 0.0 | USC00200228 | 57 | 27 | 44 |
| 86 | 2020-03-27 | ANN ARBOR SE, MI US | 0.27 | 0.0 | 0.0 | USC00200228 | 58 | 38 | 38 |
| 87 | 2020-03-28 | ANN ARBOR SE, MI US | 1.54 | 0.0 | 0.0 | USC00200228 | 51 | 36 | 42 |
| 88 | 2020-03-29 | ANN ARBOR SE, MI US | 0.64 | 0.0 | 0.0 | USC00200228 | 50 | 42 | 50 |
| 89 | 2020-03-30 | ANN ARBOR SE, MI US | 0.03 | 0.0 | 0.0 | USC00200228 | 62 | 40 | 40 |
90 rows × 9 columns
Here I’m sorting by axis 1, or the columns. Axis 0 corresponds to the rows and if I sort those by index, the data frame would look identical to what it would be if I simply printed it, since the index values are just a counter (0, 1, 2, 3, …).
Alternatively, I can sort by value:
[7]:
data.sort_values(by="TMAX")
[7]:
| STATION | NAME | DATE | PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|
| 19 | USC00200228 | ANN ARBOR SE, MI US | 2020-01-20 | 0.02 | 0.8 | 5.0 | 21 | 8 | 11 |
| 45 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-15 | 0.00 | 0.0 | 3.0 | 22 | 1 | 7 |
| 51 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-21 | 0.00 | 0.0 | 0.0 | 25 | 11 | 12 |
| 8 | USC00200228 | ANN ARBOR SE, MI US | 2020-01-09 | 0.00 | 0.1 | 0.0 | 25 | 12 | 19 |
| 20 | USC00200228 | ANN ARBOR SE, MI US | 2020-01-21 | 0.00 | 0.0 | 4.0 | 26 | 11 | 19 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69 | USC00200228 | ANN ARBOR SE, MI US | 2020-03-10 | 0.22 | 0.0 | 0.0 | 61 | 41 | 51 |
| 68 | USC00200228 | ANN ARBOR SE, MI US | 2020-03-09 | 0.00 | 0.0 | 0.0 | 61 | 33 | 41 |
| 79 | USC00200228 | ANN ARBOR SE, MI US | 2020-03-20 | 0.36 | 0.0 | 0.0 | 62 | 39 | 61 |
| 80 | USC00200228 | ANN ARBOR SE, MI US | 2020-03-21 | 0.00 | 0.0 | 0.0 | 62 | 25 | 25 |
| 89 | USC00200228 | ANN ARBOR SE, MI US | 2020-03-30 | 0.03 | 0.0 | 0.0 | 62 | 40 | 40 |
90 rows × 9 columns
In this case, the by keyword is required. Which column is used to perform the sort. Also, the default is ascending sort order, but we can do descending instead:
[8]:
data.sort_values(by="SNOW",ascending=False)
[8]:
| STATION | NAME | DATE | PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|
| 17 | USC00200228 | ANN ARBOR SE, MI US | 2020-01-18 | 0.71 | 5.7 | 5.0 | 30 | 15 | 30 |
| 57 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-27 | 0.42 | 4.6 | 6.0 | 31 | 18 | 21 |
| 56 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-26 | 0.21 | 2.3 | 2.0 | 40 | 29 | 29 |
| 82 | USC00200228 | ANN ARBOR SE, MI US | 2020-03-23 | 0.20 | 2.1 | 2.0 | 40 | 18 | 32 |
| 36 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-06 | 0.15 | 1.6 | 2.0 | 31 | 21 | 23 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-19 | 0.00 | 0.0 | 1.0 | 40 | 20 | 20 |
| 50 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-20 | 0.00 | 0.0 | 0.0 | 31 | 14 | 14 |
| 51 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-21 | 0.00 | 0.0 | 0.0 | 25 | 11 | 12 |
| 52 | USC00200228 | ANN ARBOR SE, MI US | 2020-02-22 | 0.00 | 0.0 | 0.0 | 38 | 12 | 22 |
| 89 | USC00200228 | ANN ARBOR SE, MI US | 2020-03-30 | 0.03 | 0.0 | 0.0 | 62 | 40 | 40 |
90 rows × 9 columns
If you couldn’t already tell, the result of this operation is still a DataFrame:
[9]:
type(data.sort_values(by="SNOW",ascending=False))
[9]:
pandas.core.frame.DataFrame
For any DataFrame, you can select a subset of the data using it’s index. If you want a single index, the result is a Series.
[10]:
maxtemp = data['TMAX']
print(maxtemp)
print("Type: ",type(maxtemp))
0 30
1 37
2 47
3 43
4 35
..
85 57
86 58
87 51
88 50
89 62
Name: TMAX, Length: 90, dtype: int64
Type: <class 'pandas.core.series.Series'>
This means that I can combine the sort above to pull just the information that I want. For example,
[11]:
maxtemp_sorted = data.sort_values(by="TMAX",ascending=False)["TMAX"]
print(maxtemp_sorted)
89 62
80 62
79 62
69 61
68 61
..
20 26
51 25
8 25
45 22
19 21
Name: TMAX, Length: 90, dtype: int64
Remember, the numbers in the left column here are the index values of the series. The temperatures themselves are in the right column. Let’s say I also wanted the dates that correspond to those temperatures, still sorted by max temperature. Then I just have to change the column index to date:
[12]:
sortedDates = data.sort_values(by="TMAX",ascending=False)["DATE"]
print(sortedDates)
89 2020-03-30
80 2020-03-21
79 2020-03-20
69 2020-03-10
68 2020-03-09
...
20 2020-01-21
51 2020-02-21
8 2020-01-09
45 2020-02-15
19 2020-01-20
Name: DATE, Length: 90, dtype: object
And if I wanted to keep both of these sets of information together in a smaller data frame, I can do that too using the loc attribute, which locates the data in just the indices that I specify.
[13]:
sortedTemp = data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]]
print(sortedTemp)
TMAX DATE
89 62 2020-03-30
80 62 2020-03-21
79 62 2020-03-20
69 61 2020-03-10
68 61 2020-03-09
.. ... ...
20 26 2020-01-21
51 25 2020-02-21
8 25 2020-01-09
45 22 2020-02-15
19 21 2020-01-20
[90 rows x 2 columns]
The loc attribute works by selecting a subset of our initial data frame, by value. I need to specify 2 values when using loc (it might look like 3 values, but the second value is itself a list and considered to be a single object), since my data frame has 2 axes. In this example, I pulled all rows using the “:” and then only 2 of the columns.
Note that in that last operation, I combined the effects of multiple methods of the data object! This is totally fine and actually to be encouraged. This statement can be read like this: “take the data DataFrame, sort it by the values in the TMAX column, and only return the resulting TMAX and DATE columns”.
Normal python slicing also works on a DataFrame.
[14]:
sortedTemp[0:1]
[14]:
| TMAX | DATE | |
|---|---|---|
| 89 | 62 | 2020-03-30 |
Gives me the just the maximum temperature in our record and the date of that observation.
Try to understand the difference between slicing and using loc. Again, loc gets the data by “value” where as slicing gets the data by position. In this last example, I wanted the very first entry in my sorted data frame, so I used slicing. If I were to use loc, I would need to know the index value of that entry, which I now see is 89:
[15]:
print(sortedTemp.loc[89])
TMAX 62
DATE 2020-03-30
Name: 89, dtype: object
Since I want the “top row” of my sorted data frame, it makes more sense to use slicing to get that information, since I probably wouldn’t know that the max temp observation was the 89th entry in my dataset. All this is to say both slicing and using loc to extract values from a data frame are useful. It just depends on what operation I’m doing.
Of course, I could have done this entire operation in a single line using both of these tools. In other words, start with the original data frame, sort it by the max temperature data in decending order, only get the TMAX and DATE columns and only return the first row from the result:
[16]:
print(data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]][0:1])
TMAX DATE
89 62 2020-03-30
And no, I don’t know why you have to slice using “[0:1]” instead of just “[0]” if we only want the 1st row. I imagine it has to do with there being a row header in our data frame, but I’m too lazy to look it up right now. Although we can accomplish the same thing (selecting by position) using the iloc attribute instead of slicing:
[17]:
print(data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]].iloc[0])
TMAX 62
DATE 2020-03-30
Name: 89, dtype: object
And in that case, I do only use a single integer if I want a single row. Hopefully I’ve exhausted this topic at this point.
There is a lot to be done with pandas data frames beyond what we’ve covered so far. We will continue down this path in the next lesson so that we can see how to apply logic to our data frames, to mathematic operations, and a few other things. Again, I emphasize the efficiency and ease of use of pandas here. We aren’t going to cover anything in our discussions on pandas that couldn’t be otherwise accomplished using a Numpy array. However, the Pandas Series and DataFrame datatype make doing the same tasks significantly easier, as long as you understand what is going on with the more complex data type.