PANDAS ITERROWS f()
from PadhAI courseware
Pandas_DF_iterrows

Seaborn has some in-built databases, of which, one of them is planets, explore the way to injest this package into dataframe and observe on some parameters by using the above learnt work

In [ ]:
import seaborn as sns
import numpy as np
import pandas as pd
In [ ]:
snsPlanets = sns.load_dataset('planets')
In [ ]:
snsPlanets
Out[ ]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300000 7.10 77.40 2006
1 Radial Velocity 1 874.774000 2.21 56.95 2008
2 Radial Velocity 1 763.000000 2.60 19.84 2011
3 Radial Velocity 1 326.030000 19.40 110.62 2007
4 Radial Velocity 1 516.220000 10.50 119.47 2009
... ... ... ... ... ... ...
1030 Transit 1 3.941507 NaN 172.00 2006
1031 Transit 1 2.615864 NaN 148.00 2007
1032 Transit 1 3.191524 NaN 174.00 2007
1033 Transit 1 4.125083 NaN 293.00 2008
1034 Transit 1 4.187757 NaN 260.00 2008

1035 rows × 6 columns

In [ ]:
snsPlanets.describe()
Out[ ]:
number orbital_period mass distance year
count 1035.000000 992.000000 513.000000 808.000000 1035.000000
mean 1.785507 2002.917596 2.638161 264.069282 2009.070531
std 1.240976 26014.728304 3.818617 733.116493 3.972567
min 1.000000 0.090706 0.003600 1.350000 1989.000000
25% 1.000000 5.442540 0.229000 32.560000 2007.000000
50% 1.000000 39.979500 1.260000 55.250000 2010.000000
75% 2.000000 526.005000 3.040000 178.500000 2012.000000
max 7.000000 730000.000000 25.000000 8500.000000 2014.000000
In [ ]:
snsPlanets.describe
Out[ ]:
<bound method NDFrame.describe of                method  number  orbital_period   mass  distance  year
0     Radial Velocity       1      269.300000   7.10     77.40  2006
1     Radial Velocity       1      874.774000   2.21     56.95  2008
2     Radial Velocity       1      763.000000   2.60     19.84  2011
3     Radial Velocity       1      326.030000  19.40    110.62  2007
4     Radial Velocity       1      516.220000  10.50    119.47  2009
...               ...     ...             ...    ...       ...   ...
1030          Transit       1        3.941507    NaN    172.00  2006
1031          Transit       1        2.615864    NaN    148.00  2007
1032          Transit       1        3.191524    NaN    174.00  2007
1033          Transit       1        4.125083    NaN    293.00  2008
1034          Transit       1        4.187757    NaN    260.00  2008

[1035 rows x 6 columns]>
In [ ]:
snsPlanetsT = snsPlanets.T
In [ ]:
snsPlanetsT
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034
method Radial Velocity Radial Velocity Radial Velocity Radial Velocity Radial Velocity Radial Velocity Radial Velocity Radial Velocity Radial Velocity Radial Velocity ... Transit Transit Imaging Transit Imaging Transit Transit Transit Transit Transit
number 1 1 1 1 1 1 1 1 1 2 ... 1 1 1 1 1 1 1 1 1 1
orbital_period 269.3 874.774 763.0 326.03 516.22 185.84 1773.4 798.5 993.3 452.8 ... 3.06785 0.925542 NaN 3.352057 NaN 3.941507 2.615864 3.191524 4.125083 4.187757
mass 7.1 2.21 2.6 19.4 10.5 4.8 4.64 NaN 10.3 1.99 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
distance 77.4 56.95 19.84 110.62 119.47 76.39 18.15 21.41 73.1 74.79 ... 60.0 470.0 19.2 3200.0 10.1 172.0 148.0 174.0 293.0 260.0
year 2006 2008 2011 2007 2009 2008 2002 1996 2008 2010 ... 2012 2014 2011 2012 2012 2006 2007 2007 2008 2008

6 rows × 1035 columns

In [ ]:
snsPlanets.values
Out[ ]:
array([['Radial Velocity', 1, 269.3, 7.1, 77.4, 2006],
       ['Radial Velocity', 1, 874.774, 2.21, 56.95, 2008],
       ['Radial Velocity', 1, 763.0, 2.6, 19.84, 2011],
       ...,
       ['Transit', 1, 3.1915239, nan, 174.0, 2007],
       ['Transit', 1, 4.1250828, nan, 293.0, 2008],
       ['Transit', 1, 4.187757, nan, 260.0, 2008]], dtype=object)
In [ ]:
snsPlanets.index
Out[ ]:
RangeIndex(start=0, stop=1035, step=1)
In [ ]:
snsPlanets.head()
Out[ ]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
In [ ]:
snsPlanets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          1035 non-null   object 
 1   number          1035 non-null   int64  
 2   orbital_period  992 non-null    float64
 3   mass            513 non-null    float64
 4   distance        808 non-null    float64
 5   year            1035 non-null   int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 48.6+ KB

Go through each row of the Dataframe and delete it (drop) (if there is any cell without data, then delete the entire row)

In [ ]:
snsPlanets.describe()
Out[ ]:
number orbital_period mass distance year
count 1035.000000 992.000000 513.000000 808.000000 1035.000000
mean 1.785507 2002.917596 2.638161 264.069282 2009.070531
std 1.240976 26014.728304 3.818617 733.116493 3.972567
min 1.000000 0.090706 0.003600 1.350000 1989.000000
25% 1.000000 5.442540 0.229000 32.560000 2007.000000
50% 1.000000 39.979500 1.260000 55.250000 2010.000000
75% 2.000000 526.005000 3.040000 178.500000 2012.000000
max 7.000000 730000.000000 25.000000 8500.000000 2014.000000
In [ ]:
dfPlanets = snsPlanets
df2Planets = snsPlanets

Method - I => to drop null rows from the entire dataframe

In [ ]:
for r in snsPlanets.index:
    for c in snsPlanets.columns:
        if pd.isnull(snsPlanets.loc[r, c]):
            snsPlanets.drop(r, inplace = True)
            break
In [ ]:
snsPlanets.describe()
Out[ ]:
number orbital_period mass distance year
count 498.00000 498.000000 498.000000 498.000000 498.000000
mean 1.73494 835.778671 2.509320 52.068213 2007.377510
std 1.17572 1469.128259 3.636274 46.596041 4.167284
min 1.00000 1.328300 0.003600 1.350000 1989.000000
25% 1.00000 38.272250 0.212500 24.497500 2005.000000
50% 1.00000 357.000000 1.245000 39.940000 2009.000000
75% 2.00000 999.600000 2.867500 59.332500 2011.000000
max 6.00000 17337.500000 25.000000 354.000000 2014.000000

Method - II

DataFrame.iterrows() to iterate through the entire dataframe and return the index and the entire row as a Pandas Series Object

In [ ]:
for i, r in snsPlanets.iterrows():
    print(i, r)
    break
0 method            Radial Velocity
number                          1
orbital_period              269.3
mass                          7.1
distance                     77.4
year                         2006
Name: 0, dtype: object

modify the above code with the above df.iterrows()

In [ ]:
dfPlanets.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 0 to 784
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          498 non-null    object 
 1   number          498 non-null    int64  
 2   orbital_period  498 non-null    float64
 3   mass            498 non-null    float64
 4   distance        498 non-null    float64
 5   year            498 non-null    int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 27.2+ KB
In [ ]:
for i, r in dfPlanets.iterrows():
    print(pd.isnull(r))
    break       # to stop after first execution
method            False
number            False
orbital_period    False
mass              False
distance          False
year              False
Name: 0, dtype: bool
In [ ]:
for i, r in dfPlanets.iterrows():
    print(pd.isnull(r).any())
    break       # to stop after first iteration
False
In [ ]:
for i, r in dfPlanets.iterrows():
    if pd.isnull(r).any():
        dfPlanets.drop(i, inplace = True)
In [ ]:
dfPlanets.describe()
Out[ ]:
number orbital_period mass distance year
count 498.00000 498.000000 498.000000 498.000000 498.000000
mean 1.73494 835.778671 2.509320 52.068213 2007.377510
std 1.17572 1469.128259 3.636274 46.596041 4.167284
min 1.00000 1.328300 0.003600 1.350000 1989.000000
25% 1.00000 38.272250 0.212500 24.497500 2005.000000
50% 1.00000 357.000000 1.245000 39.940000 2009.000000
75% 2.00000 999.600000 2.867500 59.332500 2011.000000
max 6.00000 17337.500000 25.000000 354.000000 2014.000000
In [ ]:
dfPlanets.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 0 to 784
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          498 non-null    object 
 1   number          498 non-null    int64  
 2   orbital_period  498 non-null    float64
 3   mass            498 non-null    float64
 4   distance        498 non-null    float64
 5   year            498 non-null    int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 27.2+ KB

Method III

In [ ]:
df2Planets.dropna(inplace = True)
In [ ]:
df2Planets.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 0 to 784
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          498 non-null    object 
 1   number          498 non-null    int64  
 2   orbital_period  498 non-null    float64
 3   mass            498 non-null    float64
 4   distance        498 non-null    float64
 5   year            498 non-null    int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 27.2+ KB

Next Exercise

  • Filter and show only those rows which have planets that are found in the 2010s and method is 'Radial Velocity' and 'Transit' and distance is large (> 75 percentile)
In [ ]:
!pip install nbconvert
In [ ]: