Skip to main content

Pandas

Quick reference to the Python Pandas library.

$ pip install pandas

Import the package into your .py file.

import pandas as pd

Creating and writing data

Using dict

df = pd.DataFrame({'Col_1': [34, 11], 'Col_2': [46, 79]})
pd.DataFrame({'Col_1': ['some_data_1-1', 12],
'Col_2': ['some_data_1-2', 22]},
index=['Row 1', 'Row 2'])
pd.Series([1, 2, 3, 4, 5])
pd.Series([11, 12, 13], index=['Row 1', 'Row 2', 'Row 3'], name='Col 1')

Reading CSV

df_data = pd.read_csv("/data-source/data.csv")
df_data.shape

>>> (no_rows, no_cols)
df_data.head()

df_data.head(10)
# Read the CSV and use the first column for the row index
df_data = pd.read_csv("/data-source/data.csv", index_col=0)

Writing CSV

df_data.to_csv('to_data.csv')

Indexing, selecting and assigning

df_data = pd.DataFrame({'Col1': ['some_data_1_1', 12],
'Col2': ['some_data_1_2', 22]},
index=['Row1', 'Row2'])

Native accessors

Accessing data from a column or a row of a column. A DataFrame can be considered to be a merge of Series.

df_data.Col1

df_data['Col1']

df_data['Col1'][0]

>>> some_data_1_1

Indexing

Both loc and iloc are row-first, column-second.

Using iloc, index based selection:

# Access the data of the first row
df_data.iloc[0]

# Access the data of all rows and first column
df_data.iloc[:, 0]

df_data.iloc[1:3, 0]

# indexing here is done based on the index column value
df_data.iloc[[1,2,3], 0]

df_data.iloc[-5, 0]

df_data.iloc[-5] # last 5 rows of all columns

Using loc, label based selection:

# Access the data of the first row
df_data.loc[0, 'Col1']

>>> some_data_1_1

df_data.loc[:, ['Col1', 'Col2']]

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Manipulating indexes

df_data.set_index('col1')

Selecting

reviews.country == 'Italy'

>>>
0 True
1 False
...
129969 False
129970 False
Name: country, Length: 129971, dtype: bool
reviews.loc[reviews.country == 'Italy']

>>> # Return dataframe with all rows that are true
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

>>> # Return dataframe with all rows that are true for the compund of conditions

isin

# isin select from a list
reviews.loc[reviews.country.isin(['Italy', 'France'])]

isnull and notnull

Methods lets you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset:

# filters with the conditional
reviews.loc[reviews.price.notnull()]

Assigning

# change all values in the column
reviews['critic'] = 'everyone'

# or with an iterable
reviews['index_backwards'] = range(len(reviews), 0, -1)

>>>
0 129971
1 129970
...
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64

Summary functions & maps

Summaries

reviews.points.describe()

>>>
count 129971.000000
mean 88.447138
...
75% 91.000000
max 100.000000
Name: points, Length: 8, dtype: float64

reviews.taster_name.describe()

>>>
count 103727
unique 19
top Roger Voss
freq 25514
Name: taster_name, dtype: object

reviews.points.mean()

reviews.taster_name.unique()

reviews.taster_name.value_counts()


Map functions

map()

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

Apply functions

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

def remean_points(row):
row.points = row.points - review_points_mean
return row

reviews.apply(remean_points, axis='columns')

If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on.

Built-in mapping operations

review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

>>>
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64


reviews.country + " - " + reviews.region_1

>>>
0 Italy - Etna
1 NaN
...
129969 France - Alsace
129970 France - Alsace
Length: 129971, dtype: object

These operators are faster than map() or apply() because they use speed ups built into pandas. All of the standard Python operators (><==, and so on) work in this manner. They are not as flexible as map() or apply(), for complex operations and conditions.

Grouping and sorting

groupby

reviews.groupby('points').points.count()
# same can be achieved by .value_counts()

reviews.groupby('points').price.min()

>>>
points
80 5.0
81 5.0
...
99 44.0
100 80.0
Name: price, Length: 21, dtype: float64

Each group generated can be considered as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to directly using the apply() method.

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

>>>
winery
1+1=3 1+1=3 NV Rosé Sparkling (Cava)
10 Knots 10 Knots 2010 Viognier (Paso Robles)
...
àMaurice àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object
reviews.groupby(['country', 'province']).apply(lambda df:df.loc[df.points.idxmax()])

reviews.groupby(['country']).price.agg([len, min, max])

Multi index


countries_reviewed = reviews.groupby(['country','province']).description.agg([len])

mi = countries_reviewed.index


# the multi-index method you will use most often is the one for converting back to a regular index, the `reset_index()`
countries_reviewed.reset_index()

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value.

Sorting

To get data in the order we want it in, we can sort it ourselves.

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

# descending
countries_reviewed.sort_values(by='len', ascending=False)

# by index
countries_reviewed.sort_index()

# by 2 values
countries_reviewed.sort_values(by=['country', 'len'])

Dtype

# Returns the data type.
reviews.price.dtype

# Convert a int64 to float64
reviews.points.astype('float64')

Missing data

Missing values are given NaN.

# selecting a null from a column.
reviews[pd.isnull(reviews.country)]

Filling missing values

reviews.region_2.fillna("Unknown")
# Fills all the NaNs with 'Unknown'

Backfill strategy

Fill each missing value with the first non-null value that appears sometime after the given record in the database.

Replace

reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown""Undisclosed""Invalid", and so on.

Renaming and combining

reviews.rename(columns={'points': 'score'})

# Renaming the index
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

# renaming the axis
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

Using a dict is most convenient.

Combining

Pandas has three core methods for doing this. In order of increasing complexity, these are concat()join(), and merge(). Most of what merge() can do can also be done more simply with join().

concat when the columns are the same.

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

join() lets you combine different DataFrame objects which have an index in common.

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

References

Sample DataFrame: reviews

countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive offering unripened apple, citrus and dried sage alongside brisk acidity.Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White Blend
1PortugalThis is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.Avidagos8715.0DouroNaNNaNRoger Voss@vossrogerQuinta dos Avidagos 2011 Avidagos Red (Douro)Portuguese Red
2USTart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.NaN8714.0OregonWillamette ValleyWillamette ValleyPaul Gregutt@paulgwineRainstorm 2013 Pinot Gris (Willamette Valley)Pinot Gris
3USPineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.Reserve Late Harvest8713.0MichiganLake Michigan ShoreNaNAlexander PeartreeNaNSt. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore)Riesling
4USMuch like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.Vintner's Reserve Wild Child Block8765.0OregonWillamette ValleyWillamette ValleyPaul Gregutt@paulgwineSweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley)Pinot Noir

ak © mmxxii