Pandas

Quick reference to the Python Pandas library.

$ pip install pandas

Import the package into your .py file.

import pandas as pd

Creating and writing data

Using dict

df = pd.DataFrame({'Col_1': [34, 11], 'Col_2': [46, 79]})

pd.DataFrame({'Col_1': ['some_data_1-1', 12],
              'Col_2': ['some_data_1-2', 22]},
             index=['Row 1', 'Row 2'])

pd.Series([1, 2, 3, 4, 5])

pd.Series([11, 12, 13], index=['Row 1', 'Row 2', 'Row 3'], name='Col 1')

Reading CSV

df_data = pd.read_csv("/data-source/data.csv")

df_data.shape

>>> (no_rows, no_cols)

df_data.head()

df_data.head(10)

# Read the CSV and use the first column for the row index
df_data = pd.read_csv("/data-source/data.csv", index_col=0)

Writing CSV

df_data.to_csv('to_data.csv')

Indexing, selecting and assigning

df_data = pd.DataFrame({'Col1': ['some_data_1_1', 12],
                          'Col2': ['some_data_1_2', 22]},
                         index=['Row1', 'Row2'])

Native accessors

Accessing data from a column or a row of a column. A DataFrame can be considered to be a merge of Series.

df_data.Col1

df_data['Col1']

df_data['Col1'][0]

>>> some_data_1_1

Indexing

Both loc and iloc are row-first, column-second.

Using iloc, index based selection:

# Access the data of the first row
df_data.iloc[0]

# Access the data of all rows and first column
df_data.iloc[:, 0]

df_data.iloc[1:3, 0]

# indexing here is done based on the index column value
df_data.iloc[[1,2,3], 0]

df_data.iloc[-5, 0]

df_data.iloc[-5] # last 5 rows of all columns

Using loc, label based selection:

# Access the data of the first row
df_data.loc[0, 'Col1']

>>> some_data_1_1

df_data.loc[:, ['Col1', 'Col2']]

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Manipulating indexes

df_data.set_index('col1')

Selecting

reviews.country == 'Italy'

>>>
0          True
1         False
          ...
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

reviews.loc[reviews.country == 'Italy']

>>> # Return dataframe with all rows that are true

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

>>> # Return dataframe with all rows that are true for the compund of conditions

isin

# isin select from a list
reviews.loc[reviews.country.isin(['Italy', 'France'])]

isnull and notnull

Methods lets you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset:

# filters with the conditional
reviews.loc[reviews.price.notnull()]

Assigning

# change all values in the column
reviews['critic'] = 'everyone'

# or with an iterable
reviews['index_backwards'] = range(len(reviews), 0, -1)

>>>
0         129971
1         129970
           ...
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

Summary functions & maps

Summaries

reviews.points.describe()

>>>
count    129971.000000
mean         88.447138
             ...
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

reviews.taster_name.describe()

>>>
count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

reviews.points.mean()

reviews.taster_name.unique()

reviews.taster_name.value_counts()

Map functions

map()

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

Apply functions

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on.

Built-in mapping operations

review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

>>>
0        -1.447138
1        -1.447138
            ...
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64


reviews.country + " - " + reviews.region_1

>>>
0            Italy - Etna
1                     NaN
               ...
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object

These operators are faster than map() or apply() because they use speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner. They are not as flexible as map() or apply(), for complex operations and conditions.

Grouping and sorting

groupby

reviews.groupby('points').points.count()
# same can be achieved by .value_counts()

reviews.groupby('points').price.min()

>>>
points
80      5.0
81      5.0
       ...
99     44.0
100    80.0
Name: price, Length: 21, dtype: float64

Each group generated can be considered as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to directly using the apply() method.

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

>>>
winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

reviews.groupby(['country', 'province']).apply(lambda df:df.loc[df.points.idxmax()])

reviews.groupby(['country']).price.agg([len, min, max])

Multi index

countries_reviewed = reviews.groupby(['country','province']).description.agg([len])

mi = countries_reviewed.index

# the multi-index method you will use most often is the one for converting back to a regular index, the `reset_index()`
countries_reviewed.reset_index()

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value.

Sorting

To get data in the order we want it in, we can sort it ourselves.

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

# descending
countries_reviewed.sort_values(by='len', ascending=False)

# by index
countries_reviewed.sort_index()

# by 2 values
countries_reviewed.sort_values(by=['country', 'len'])

Dtype

# Returns the data type.
reviews.price.dtype

# Convert a int64 to float64
reviews.points.astype('float64')

Missing data

Missing values are given NaN.

# selecting a null from a column.
reviews[pd.isnull(reviews.country)]

Filling missing values

reviews.region_2.fillna("Unknown")
# Fills all the NaNs with 'Unknown'

Backfill strategy

Fill each missing value with the first non-null value that appears sometime after the given record in the database.

Replace

reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.

Renaming and combining

reviews.rename(columns={'points': 'score'})

# Renaming the index
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

# renaming the axis
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

Using a dict is most convenient.

Combining

Pandas has three core methods for doing this. In order of increasing complexity, these are concat(), join(), and merge(). Most of what merge() can do can also be done more simply with join().

concat when the columns are the same.

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

join() lets you combine different DataFrame objects which have an index in common.

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

References

Pandas documentation
Kaggle

Sample DataFrame: reviews

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive offering unripened apple, citrus and dried sage alongside brisk acidity.	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend
1	Portugal	This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red
2	US	Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris
3	US	Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore)	Riesling
4	US	Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley)	Pinot Noir

Pandas

Creating and writing data​

Indexing, selecting and assigning​

Summary functions & maps​

Grouping and sorting​

Dtype​

Renaming and combining​