Processing Data with Data Frames

Now that we have the basics of using data frames out of the way, let’s dig a little deeper into how to use them to do some basic operations and analysis. We’ll use the same data file we finished up with last time: USC00200228.csv which contains rain, snowfall, snow depth, minimum, maximum, and observation temperature data each day for January - March 2020 in Ann Arbor, MI.

So, let’s read that data again and populate our data frame.

[1]:

import pandas as pd
import numpy as np

file = "USC00200228.csv"
data = pd.read_csv(file)
print(data.head(9))

       STATION                 NAME        DATE  PRCP  SNOW  SNWD  TMAX  TMIN  \
0  USC00200228  ANN ARBOR SE, MI US  2020-01-01  0.05   0.8   2.0    30    24
1  USC00200228  ANN ARBOR SE, MI US  2020-01-02  0.00   0.0   1.0    37    23
2  USC00200228  ANN ARBOR SE, MI US  2020-01-03  0.00   0.0   0.0    47    30
3  USC00200228  ANN ARBOR SE, MI US  2020-01-04  0.00   0.0   0.0    43    33
4  USC00200228  ANN ARBOR SE, MI US  2020-01-05  0.02   0.2   0.0    35    29
5  USC00200228  ANN ARBOR SE, MI US  2020-01-06  0.04   0.6   0.0    39    29
6  USC00200228  ANN ARBOR SE, MI US  2020-01-07  0.00   0.0   0.0    45    25
7  USC00200228  ANN ARBOR SE, MI US  2020-01-08  0.02   0.4   0.0    43    23
8  USC00200228  ANN ARBOR SE, MI US  2020-01-09  0.00   0.1   0.0    25    12

   TOBS
0    24
1    30
2    40
3    33
4    29
5    31
6    25
7    23
8    19

Note that by passing the 9 as an argument tohead(), the first 9 columns were returned.

Basic Stats

Pandas gives us an all quick tool to get some basic information about the dataset, describe():

[2]:

data.describe()

[2]:

	PRCP	SNOW	SNWD	TMAX	TMIN	TOBS
count	90.000000	90.000000	90.000000	90.000000	90.000000	90.000000
mean	0.114333	0.358889	1.188889	40.877778	23.633333	28.633333
std	0.279084	0.893534	1.606828	9.840101	8.423416	9.985889
min	0.000000	0.000000	0.000000	21.000000	1.000000	1.000000
25%	0.000000	0.000000	0.000000	34.000000	18.250000	22.000000
50%	0.000000	0.000000	0.000000	38.500000	23.000000	29.000000
75%	0.085000	0.200000	2.000000	48.000000	29.000000	33.000000
max	1.540000	5.700000	6.000000	62.000000	42.000000	61.000000

which returns some basic statistical information for all columns that contain numeric data. You can control the output of describe() to only gather a subset of this information or change the percentile values to something else. See the documentation for more details.

While describe() is uesful to get a quick look at your data, typically, we may want to store the results of one of these statistical operations in a variable for use later on. In that case, it is simple to use the proper pandas statistical tool:

[3]:

datamax = data.max()
print(datamax)

STATION            USC00200228
NAME       ANN ARBOR SE, MI US
DATE                2020-03-30
PRCP                      1.54
SNOW                       5.7
SNWD                         6
TMAX                        62
TMIN                        42
TOBS                        61
dtype: object

The result of this type of operation (max(), min(), mean(), std(), count() all behave in a similar same way) is a pandas Series:

[4]:

type(datamax)

[4]:

pandas.core.series.Series

which means that we can quickly extract the maximum value for snowfall and save it to a variable:

[5]:

maxsnow = datamax['SNOW']
print(maxsnow)

5.7

Again, a similar procedure would be used to get specific values of the other basic statistical operations.

Sorting and querying

Another tool that is often needed when working with a dataset is the ability to sort the data. There are two ways do this, we can sort by the index (the row or column names):

[6]:

data.sort_index(axis=1)

[6]:

	DATE	NAME	PRCP	SNOW	SNWD	STATION	TMAX	TMIN	TOBS
0	2020-01-01	ANN ARBOR SE, MI US	0.05	0.8	2.0	USC00200228	30	24	24
1	2020-01-02	ANN ARBOR SE, MI US	0.00	0.0	1.0	USC00200228	37	23	30
2	2020-01-03	ANN ARBOR SE, MI US	0.00	0.0	0.0	USC00200228	47	30	40
3	2020-01-04	ANN ARBOR SE, MI US	0.00	0.0	0.0	USC00200228	43	33	33
4	2020-01-05	ANN ARBOR SE, MI US	0.02	0.2	0.0	USC00200228	35	29	29
...	...	...	...	...	...	...	...	...	...
85	2020-03-26	ANN ARBOR SE, MI US	0.00	0.0	0.0	USC00200228	57	27	44
86	2020-03-27	ANN ARBOR SE, MI US	0.27	0.0	0.0	USC00200228	58	38	38
87	2020-03-28	ANN ARBOR SE, MI US	1.54	0.0	0.0	USC00200228	51	36	42
88	2020-03-29	ANN ARBOR SE, MI US	0.64	0.0	0.0	USC00200228	50	42	50
89	2020-03-30	ANN ARBOR SE, MI US	0.03	0.0	0.0	USC00200228	62	40	40

90 rows × 9 columns

Here I’m sorting by axis 1, or the columns. Axis 0 corresponds to the rows and if I sort those by index, the data frame would look identical to what it would be if I simply printed it, since the index values are just a counter (0, 1, 2, 3, …).

Alternatively, I can sort by value:

[7]:

data.sort_values(by="TMAX")

[7]:

	STATION	NAME	DATE	PRCP	SNOW	SNWD	TMAX	TMIN	TOBS
19	USC00200228	ANN ARBOR SE, MI US	2020-01-20	0.02	0.8	5.0	21	8	11
45	USC00200228	ANN ARBOR SE, MI US	2020-02-15	0.00	0.0	3.0	22	1	7
51	USC00200228	ANN ARBOR SE, MI US	2020-02-21	0.00	0.0	0.0	25	11	12
8	USC00200228	ANN ARBOR SE, MI US	2020-01-09	0.00	0.1	0.0	25	12	19
20	USC00200228	ANN ARBOR SE, MI US	2020-01-21	0.00	0.0	4.0	26	11	19
...	...	...	...	...	...	...	...	...	...
69	USC00200228	ANN ARBOR SE, MI US	2020-03-10	0.22	0.0	0.0	61	41	51
68	USC00200228	ANN ARBOR SE, MI US	2020-03-09	0.00	0.0	0.0	61	33	41
79	USC00200228	ANN ARBOR SE, MI US	2020-03-20	0.36	0.0	0.0	62	39	61
80	USC00200228	ANN ARBOR SE, MI US	2020-03-21	0.00	0.0	0.0	62	25	25
89	USC00200228	ANN ARBOR SE, MI US	2020-03-30	0.03	0.0	0.0	62	40	40

90 rows × 9 columns

In this case, the by keyword is required. Which column is used to perform the sort. Also, the default is ascending sort order, but we can do descending instead:

[8]:

data.sort_values(by="SNOW",ascending=False)

[8]:

	STATION	NAME	DATE	PRCP	SNOW	SNWD	TMAX	TMIN	TOBS
17	USC00200228	ANN ARBOR SE, MI US	2020-01-18	0.71	5.7	5.0	30	15	30
57	USC00200228	ANN ARBOR SE, MI US	2020-02-27	0.42	4.6	6.0	31	18	21
56	USC00200228	ANN ARBOR SE, MI US	2020-02-26	0.21	2.3	2.0	40	29	29
82	USC00200228	ANN ARBOR SE, MI US	2020-03-23	0.20	2.1	2.0	40	18	32
36	USC00200228	ANN ARBOR SE, MI US	2020-02-06	0.15	1.6	2.0	31	21	23
...	...	...	...	...	...	...	...	...	...
49	USC00200228	ANN ARBOR SE, MI US	2020-02-19	0.00	0.0	1.0	40	20	20
50	USC00200228	ANN ARBOR SE, MI US	2020-02-20	0.00	0.0	0.0	31	14	14
51	USC00200228	ANN ARBOR SE, MI US	2020-02-21	0.00	0.0	0.0	25	11	12
52	USC00200228	ANN ARBOR SE, MI US	2020-02-22	0.00	0.0	0.0	38	12	22
89	USC00200228	ANN ARBOR SE, MI US	2020-03-30	0.03	0.0	0.0	62	40	40

90 rows × 9 columns

If you couldn’t already tell, the result of this operation is still a DataFrame:

[9]:

type(data.sort_values(by="SNOW",ascending=False))

[9]:

pandas.core.frame.DataFrame

For any DataFrame, you can select a subset of the data using it’s index. If you want a single index, the result is a Series.

[10]:

maxtemp = data['TMAX']
print(maxtemp)
print("Type: ",type(maxtemp))

0     30
1     37
2     47
3     43
4     35
      ..
85    57
86    58
87    51
88    50
89    62
Name: TMAX, Length: 90, dtype: int64
Type:  <class 'pandas.core.series.Series'>

This means that I can combine the sort above to pull just the information that I want. For example,

[11]:

maxtemp_sorted = data.sort_values(by="TMAX",ascending=False)["TMAX"]
print(maxtemp_sorted)

89    62
80    62
79    62
69    61
68    61
      ..
20    26
51    25
8     25
45    22
19    21
Name: TMAX, Length: 90, dtype: int64

Remember, the numbers in the left column here are the index values of the series. The temperatures themselves are in the right column. Let’s say I also wanted the dates that correspond to those temperatures, still sorted by max temperature. Then I just have to change the column index to date:

[12]:

sortedDates = data.sort_values(by="TMAX",ascending=False)["DATE"]
print(sortedDates)

89    2020-03-30
80    2020-03-21
79    2020-03-20
69    2020-03-10
68    2020-03-09
         ...
20    2020-01-21
51    2020-02-21
8     2020-01-09
45    2020-02-15
19    2020-01-20
Name: DATE, Length: 90, dtype: object

And if I wanted to keep both of these sets of information together in a smaller data frame, I can do that too using the loc attribute, which locates the data in just the indices that I specify.

[13]:

sortedTemp = data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]]
print(sortedTemp)

    TMAX        DATE
89    62  2020-03-30
80    62  2020-03-21
79    62  2020-03-20
69    61  2020-03-10
68    61  2020-03-09
..   ...         ...
20    26  2020-01-21
51    25  2020-02-21
8     25  2020-01-09
45    22  2020-02-15
19    21  2020-01-20

[90 rows x 2 columns]

The loc attribute works by selecting a subset of our initial data frame, by value. I need to specify 2 values when using loc (it might look like 3 values, but the second value is itself a list and considered to be a single object), since my data frame has 2 axes. In this example, I pulled all rows using the “:” and then only 2 of the columns.

Note that in that last operation, I combined the effects of multiple methods of the data object! This is totally fine and actually to be encouraged. This statement can be read like this: “take the data DataFrame, sort it by the values in the TMAX column, and only return the resulting TMAX and DATE columns”.

Normal python slicing also works on a DataFrame.

[14]:

sortedTemp[0:1]

[14]:

	TMAX	DATE
89	62	2020-03-30

Gives me the just the maximum temperature in our record and the date of that observation.

Try to understand the difference between slicing and using loc. Again, loc gets the data by “value” where as slicing gets the data by position. In this last example, I wanted the very first entry in my sorted data frame, so I used slicing. If I were to use loc, I would need to know the index value of that entry, which I now see is 89:

[15]:

print(sortedTemp.loc[89])

TMAX            62
DATE    2020-03-30
Name: 89, dtype: object

Since I want the “top row” of my sorted data frame, it makes more sense to use slicing to get that information, since I probably wouldn’t know that the max temp observation was the 89th entry in my dataset. All this is to say both slicing and using loc to extract values from a data frame are useful. It just depends on what operation I’m doing.

Of course, I could have done this entire operation in a single line using both of these tools. In other words, start with the original data frame, sort it by the max temperature data in decending order, only get the TMAX and DATE columns and only return the first row from the result:

[16]:

print(data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]][0:1])

    TMAX        DATE
89    62  2020-03-30

And no, I don’t know why you have to slice using “[0:1]” instead of just “[0]” if we only want the 1st row. I imagine it has to do with there being a row header in our data frame, but I’m too lazy to look it up right now. Although we can accomplish the same thing (selecting by position) using the iloc attribute instead of slicing:

[17]:

print(data.sort_values(by="TMAX",ascending=False).loc[:,["TMAX","DATE"]].iloc[0])

TMAX            62
DATE    2020-03-30
Name: 89, dtype: object

And in that case, I do only use a single integer if I want a single row. Hopefully I’ve exhausted this topic at this point.

There is a lot to be done with pandas data frames beyond what we’ve covered so far. We will continue down this path in the next lesson so that we can see how to apply logic to our data frames, to mathematic operations, and a few other things. Again, I emphasize the efficiency and ease of use of pandas here. We aren’t going to cover anything in our discussions on pandas that couldn’t be otherwise accomplished using a Numpy array. However, the Pandas Series and DataFrame datatype make doing the same tasks significantly easier, as long as you understand what is going on with the more complex data type.