Processing Data with Data Frames

Now that we have the basics of using data frames out of the way, let’s dig a little deeper into how to use them to do some basic operations and analysis. We’ll use the same data file we finished up with last time: USC00200228.csv which contains rain, snowfall, snow depth, minimum, maximum, and observation temperature data each day for January - March 2020 in Ann Arbor, MI.

So, let’s read that data again and populate our data frame.

[1]:
import pandas as pd
import numpy as np

file = "USC00200228.csv"
data = pd.read_csv(file)
print(data.head(9))
       STATION                 NAME        DATE  PRCP  SNOW  SNWD  TMAX  TMIN  \
0  USC00200228  ANN ARBOR SE, MI US  2020-01-01  0.05   0.8   2.0    30    24
1  USC00200228  ANN ARBOR SE, MI US  2020-01-02  0.00   0.0   1.0    37    23
2  USC00200228  ANN ARBOR SE, MI US  2020-01-03  0.00   0.0   0.0    47    30
3  USC00200228  ANN ARBOR SE, MI US  2020-01-04  0.00   0.0   0.0    43    33
4  USC00200228  ANN ARBOR SE, MI US  2020-01-05  0.02   0.2   0.0    35    29
5  USC00200228  ANN ARBOR SE, MI US  2020-01-06  0.04   0.6   0.0    39    29
6  USC00200228  ANN ARBOR SE, MI US  2020-01-07  0.00   0.0   0.0    45    25
7  USC00200228  ANN ARBOR SE, MI US  2020-01-08  0.02   0.4   0.0    43    23
8  USC00200228  ANN ARBOR SE, MI US  2020-01-09  0.00   0.1   0.0    25    12

   TOBS
0    24
1    30
2    40
3    33
4    29
5    31
6    25
7    23
8    19

Note that by passing the 9 as an argument tohead(), the first 9 columns were returned.

Basic Stats

Pandas gives us an all quick tool to get some basic information about the dataset, describe():

[2]:
data.describe()
[2]:
PRCP SNOW SNWD TMAX TMIN TOBS
count 90.000000 90.000000 90.000000 90.000000 90.000000 90.000000
mean 0.114333 0.358889 1.188889 40.877778 23.633333 28.633333
std 0.279084 0.893534 1.606828 9.840101 8.423416 9.985889
min 0.000000 0.000000 0.000000 21.000000 1.000000 1.000000
25% 0.000000 0.000000 0.000000 34.000000 18.250000 22.000000
50% 0.000000 0.000000 0.000000 38.500000 23.000000 29.000000
75% 0.085000 0.200000 2.000000 48.000000 29.000000 33.000000
max 1.540000 5.700000 6.000000 62.000000 42.000000 61.000000

which returns some basic statistical information for all columns that contain numeric data. You can control the output of describe() to only gather a subset of this information or change the percentile values to something else. See the documentation for more details.

While describe() is uesful to get a quick look at your data, typically, we may want to store the results of one of these statistical operations in a variable for use later on. In that case, it is simple to use the proper pandas statistical tool:

[3]:
datamax = data.max()
print(datamax)
STATION            USC00200228
NAME       ANN ARBOR SE, MI US
DATE                2020-03-30
PRCP                      1.54
SNOW                       5.7
SNWD                         6
TMAX                        62
TMIN                        42
TOBS                        61
dtype: object

The result of this type of operation (max(), min(), mean(), std(), count() all behave in a similar same way) is a pandas Series:

[4]:
type(datamax)
[4]:
pandas.core.series.Series

which means that we can quickly extract the maximum value for snowfall and save it to a variable:

[5]:
maxsnow = datamax['SNOW']
print(maxsnow)
5.7

Again, a similar procedure would be used to get specific values of the other basic statistical operations.

Sorting and querying

Another tool that is often needed when working with a dataset is the ability to sort the data. There are two ways do this, we can sort by the index (the row or column names):

[6]:
data.sort_index(axis=1)
[6]:
DATE NAME PRCP SNOW SNWD STATION TMAX TMIN TOBS
0 2020-01-01 ANN ARBOR SE, MI US 0.05 0.8 2.0 USC00200228 30 24 24
1 2020-01-02 ANN ARBOR SE, MI US 0.00 0.0 1.0 USC00200228 37 23 30
2 2020-01-03 ANN ARBOR SE, MI US 0.00 0.0 0.0 USC00200228 47 30 40
3 2020-01-04 ANN ARBOR SE, MI US 0.00 0.0 0.0 USC00200228 43 33 33
4 2020-01-05 ANN ARBOR SE, MI US 0.02 0.2 0.0 USC00200228 35 29 29
... ... ... ... ... ... ... ... ... ...
85 2020-03-26 ANN ARBOR SE, MI US 0.00 0.0 0.0 USC00200228 57 27 44
86 2020-03-27 ANN ARBOR SE, MI US 0.27 0.0 0.0 USC00200228 58 38 38
87 2020-03-28 ANN ARBOR SE, MI US 1.54 0.0 0.0 USC00200228 51 36 42
88 2020-03-29 ANN ARBOR SE, MI US 0.64 0.0 0.0 USC00200228 50 42 50
89 2020-03-30 ANN ARBOR SE, MI US 0.03 0.0 0.0 USC00200228 62 40 40

90 rows × 9 columns

Here I’m sorting by axis 1, or the columns. Axis 0 corresponds to the rows and if I sort those by index, the data frame would look identical to what it would be if I simply printed it, since the index values are just a counter (0, 1, 2, 3, …).

Alternatively, I can sort by value:

[7]:
data.sort_values(by="TMAX")
[7]:
STATION NAME DATE PRCP SNOW SNWD TMAX TMIN TOBS
19 USC00200228 ANN ARBOR SE, MI US 2020-01-20 0.02 0.8 5.0 21 8 11
45 USC00200228 ANN ARBOR SE, MI US 2020-02-15 0.00 0.0 3.0 22 1 7
51 USC00200228 ANN ARBOR SE, MI US 2020-02-21 0.00 0.0 0.0 25 11 12
8 USC00200228 ANN ARBOR SE, MI US 2020-01-09 0.00 0.1 0.0 25 12 19
20 USC00200228 ANN ARBOR SE, MI US 2020-01-21 0.00 0.0 4.0 26 11 19
... ... ... ... ... ... ... ... ... ...
69 USC00200228 ANN ARBOR SE, MI US 2020-03-10 0.22 0.0 0.0 61 41 51
68 USC00200228 ANN ARBOR SE, MI US 2020-03-09 0.00 0.0 0.0 61 33 41
79 USC00200228 ANN ARBOR SE, MI US 2020-03-20 0.36 0.0 0.0 62 39 61
80 USC00200228 ANN ARBOR SE, MI US 2020-03-21 0.00 0.0 0.0 62 25 25
89 USC00200228 ANN ARBOR SE, MI US 2020-03-30 0.03 0.0 0.0 62 40 40

90 rows × 9 columns

In this case, the by keyword is required. Which column is used to perform the sort. Also, the default is ascending sort order, but we can do descending instead:

[8]:
data.sort_values(by="SNOW",ascending=False)
[8]:
STATION NAME DATE PRCP SNOW SNWD TMAX TMIN TOBS
17 USC00200228 ANN ARBOR SE, MI US 2020-01-18 0.71 5.7 5.0 30 15 30
57 USC00200228 ANN ARBOR SE, MI US 2020-02-27 0.42 4.6 6.0 31 18 21
56 USC00200228 ANN ARBOR SE, MI US 2020-02-26 0.21 2.3 2.0 40 29 29
82 USC00200228 ANN ARBOR SE, MI US 2020-03-23 0.20 2.1 2.0 40 18 32
36 USC00200228 ANN ARBOR SE, MI US 2020-02-06 0.15 1.6 2.0 31 21 23
... ... ... ... ... ... ... ... ... ...
49 USC00200228 ANN ARBOR SE, MI US 2020-02-19 0.00 0.0 1.0 40 20 20
50 USC00200228 ANN ARBOR SE, MI US 2020-02-20 0.00 0.0 0.0 31 14 14
51 USC00200228 ANN ARBOR SE, MI US 2020-02-21 0.00 0.0 0.0 25 11 12
52 USC00200228 ANN ARBOR SE, MI US 2020-02-22 0.00 0.0 0.0 38 12 22
89 USC00200228 ANN ARBOR SE, MI US 2020-03-30 0.03 0.0 0.0 62 40 40

90 rows × 9 columns

If you couldn’t already tell, the result of this operation is still a DataFrame:

[9]:
type(data.sort_values(by="SNOW",ascending=False))
[9]:
pandas.core.frame.DataFrame

For any DataFrame, you can select a subset of the data using it’s index. If you want a single index, the result is a Series.

[10]:
maxtemp = data['TMAX']
print(maxtemp)
print("Type: ",type(maxtemp))
0     30
1     37
2     47
3     43
4     35
      ..
85    57
86    58
87    51
88    50
89    62
Name: TMAX, Length: 90, dtype: int64
Type:  <class 'pandas.core.series.Series'>

This means that I can combine the sort above to pull just the information that I want. For example,

[11]:
maxtemp_sorted = data.sort_values(by="TMAX",ascending=False)["TMAX"]
print(maxtemp_sorted)
89    62
80    62
79    62
69    61
68    61
      ..
20    26
51    25
8     25
45    22
19    21
Name: TMAX, Length: 90, dtype: int64

Remember, the numbers in the left column here are the index values of the series. The temperatures themselves are in the right column. Let’s say I also wanted the dates that correspond to those temperatures, still sorted by max temperature. Then I just have to change the column index to date:

[12]:
sortedDates = data.sort_values(by="TMAX",ascending=False)["DATE"]
print(sortedDates)
89    2020-03-30
80    2020-03-21
79    2020-03-20
69    2020-03-10
68    2020-03-09
         ...
20    2020-01-21
51    2020-02-21
8     2020-01-09
45    2020-02-15
19    2020-01-20
Name: DATE, Length: 90, dtype: object

And if I wanted to keep both of these sets of information together in a smaller data frame, I can do that too using the loc attribute, which locates the data in just the indices that I specify.

[13]:
sortedTemp = data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]]
print(sortedTemp)
    TMAX        DATE
89    62  2020-03-30
80    62  2020-03-21
79    62  2020-03-20
69    61  2020-03-10
68    61  2020-03-09
..   ...         ...
20    26  2020-01-21
51    25  2020-02-21
8     25  2020-01-09
45    22  2020-02-15
19    21  2020-01-20

[90 rows x 2 columns]

The loc attribute works by selecting a subset of our initial data frame, by value. I need to specify 2 values when using loc (it might look like 3 values, but the second value is itself a list and considered to be a single object), since my data frame has 2 axes. In this example, I pulled all rows using the “:” and then only 2 of the columns.

Note that in that last operation, I combined the effects of multiple methods of the data object! This is totally fine and actually to be encouraged. This statement can be read like this: “take the data DataFrame, sort it by the values in the TMAX column, and only return the resulting TMAX and DATE columns”.

Normal python slicing also works on a DataFrame.

[14]:
sortedTemp[0:1]
[14]:
TMAX DATE
89 62 2020-03-30

Gives me the just the maximum temperature in our record and the date of that observation.

Try to understand the difference between slicing and using loc. Again, loc gets the data by “value” where as slicing gets the data by position. In this last example, I wanted the very first entry in my sorted data frame, so I used slicing. If I were to use loc, I would need to know the index value of that entry, which I now see is 89:

[15]:
print(sortedTemp.loc[89])
TMAX            62
DATE    2020-03-30
Name: 89, dtype: object

Since I want the “top row” of my sorted data frame, it makes more sense to use slicing to get that information, since I probably wouldn’t know that the max temp observation was the 89th entry in my dataset. All this is to say both slicing and using loc to extract values from a data frame are useful. It just depends on what operation I’m doing.

Of course, I could have done this entire operation in a single line using both of these tools. In other words, start with the original data frame, sort it by the max temperature data in decending order, only get the TMAX and DATE columns and only return the first row from the result:

[16]:
print(data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]][0:1])
    TMAX        DATE
89    62  2020-03-30

And no, I don’t know why you have to slice using “[0:1]” instead of just “[0]” if we only want the 1st row. I imagine it has to do with there being a row header in our data frame, but I’m too lazy to look it up right now. Although we can accomplish the same thing (selecting by position) using the iloc attribute instead of slicing:

[17]:
print(data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]].iloc[0])
TMAX            62
DATE    2020-03-30
Name: 89, dtype: object

And in that case, I do only use a single integer if I want a single row. Hopefully I’ve exhausted this topic at this point.

There is a lot to be done with pandas data frames beyond what we’ve covered so far. We will continue down this path in the next lesson so that we can see how to apply logic to our data frames, to mathematic operations, and a few other things. Again, I emphasize the efficiency and ease of use of pandas here. We aren’t going to cover anything in our discussions on pandas that couldn’t be otherwise accomplished using a Numpy array. However, the Pandas Series and DataFrame datatype make doing the same tasks significantly easier, as long as you understand what is going on with the more complex data type.