PANDAS FILTERS
from Padhai Class
pdFilters

Next Exercise - FILTERS - QUERY

  • Filter and show only those rows which have planets that are found in the 2010s
  • and method is 'Radial Velocity'
  • and 'Transit'
  • and distance is large (> 75 percentile)
In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
In [ ]:
planets = sns.load_dataset('planets')
In [ ]:
planets = pd.DataFrame(planets)
In [ ]:
dfplanets = planets
In [ ]:
dfplanets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          1035 non-null   object 
 1   number          1035 non-null   int64  
 2   orbital_period  992 non-null    float64
 3   mass            513 non-null    float64
 4   distance        808 non-null    float64
 5   year            1035 non-null   int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 48.6+ KB
In [ ]:
dfplanets.dropna(inplace=True)
In [ ]:
dfplanets.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 0 to 784
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          498 non-null    object 
 1   number          498 non-null    int64  
 2   orbital_period  498 non-null    float64
 3   mass            498 non-null    float64
 4   distance        498 non-null    float64
 5   year            498 non-null    int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 27.2+ KB
In [ ]:
dfplanets.head(4)
Out[ ]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007

planets found in the year after and including 2010

In [ ]:
dfplanets[dfplanets['year'] > 2010]
Out[ ]:
method number orbital_period mass distance year
2 Radial Velocity 1 763.00000 2.6000 19.84 2011
28 Radial Velocity 1 181.40000 3.2000 45.52 2013
46 Radial Velocity 1 3.23570 0.0036 1.35 2012
58 Radial Velocity 1 277.02000 1.7000 80.64 2013
63 Radial Velocity 1 305.50000 20.6000 92.51 2013
... ... ... ... ... ... ...
634 Radial Velocity 3 431.70000 0.5270 25.87 2011
636 Radial Velocity 1 124.60000 9.1800 149.25 2013
641 Radial Velocity 1 5.05050 1.0680 44.46 2013
649 Transit 1 2.70339 1.4700 178.00 2013
784 Radial Velocity 3 580.00000 0.9470 135.00 2012

144 rows × 6 columns

planets found in the year 2010

In [ ]:
dfplanets[dfplanets['year'] == 2010]
Out[ ]:
method number orbital_period mass distance year
9 Radial Velocity 2 452.80000 1.990 74.79 2010
10 Radial Velocity 2 883.00000 0.860 74.79 2010
45 Radial Velocity 1 380.80000 1.800 20.21 2010
67 Radial Velocity 4 3848.86000 1.059 13.47 2010
137 Radial Velocity 3 1050.30000 4.950 16.13 2010
145 Radial Velocity 4 124.26000 0.046 4.70 2010
149 Radial Velocity 1 2.64561 0.022 19.80 2010
166 Radial Velocity 2 5584.00000 3.400 90.00 2010
203 Radial Velocity 1 533.00000 6.100 7.01 2010
225 Radial Velocity 1 1313.00000 0.630 56.34 2010
230 Radial Velocity 6 5.75962 0.041 39.39 2010
231 Radial Velocity 6 16.35670 0.038 39.39 2010
232 Radial Velocity 6 49.74700 0.080 39.39 2010
233 Radial Velocity 6 122.72000 0.074 39.39 2010
234 Radial Velocity 6 602.00000 0.067 39.39 2010
235 Radial Velocity 6 2248.00000 0.205 39.39 2010
276 Radial Velocity 1 1845.00000 0.950 56.05 2010
280 Radial Velocity 1 1117.00000 1.160 56.18 2010
286 Radial Velocity 1 466.00000 0.500 53.82 2010
320 Radial Velocity 1 327.80000 0.600 54.94 2010
323 Radial Velocity 1 472.00000 0.580 50.43 2010
399 Radial Velocity 1 9.49400 0.026 21.30 2010
404 Radial Velocity 2 4970.00000 0.360 17.99 2010
413 Radial Velocity 1 122.10000 0.050 9.24 2010
415 Radial Velocity 1 6.49500 0.960 121.07 2010
429 Radial Velocity 1 68.27000 0.770 65.62 2010
436 Radial Velocity 2 263.30000 0.270 15.71 2010
437 Radial Velocity 2 1657.00000 0.710 15.71 2010
483 Radial Velocity 1 176.30000 2.900 126.10 2010
506 Radial Velocity 1 4.64550 0.013 24.05 2010
519 Radial Velocity 2 75.29000 0.770 69.44 2010
520 Radial Velocity 2 1314.00000 2.290 69.44 2010
521 Radial Velocity 1 282.40000 0.480 51.81 2010
539 Radial Velocity 2 110.90000 0.150 59.03 2010
573 Radial Velocity 2 613.80000 1.850 68.35 2010
574 Radial Velocity 2 825.00000 0.895 68.35 2010
605 Radial Velocity 1 1319.00000 13.000 54.92 2010
606 Radial Velocity 1 225.70000 0.210 29.94 2010
626 Radial Velocity 1 57.00000 0.470 23.80 2010
627 Radial Velocity 1 16.20000 1.250 223.21 2010
631 Radial Velocity 1 41.39700 0.298 11.03 2010
In [ ]:
dfplanets.head(3)
Out[ ]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
In [ ]:
dfplanets[dfplanets['method'] == 'Radial Velocity']
Out[ ]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.3000 7.100 77.40 2006
1 Radial Velocity 1 874.7740 2.210 56.95 2008
2 Radial Velocity 1 763.0000 2.600 19.84 2011
3 Radial Velocity 1 326.0300 19.400 110.62 2007
4 Radial Velocity 1 516.2200 10.500 119.47 2009
... ... ... ... ... ... ...
638 Radial Velocity 1 511.0980 8.820 31.33 2002
640 Radial Velocity 1 111.7000 2.100 14.90 2009
641 Radial Velocity 1 5.0505 1.068 44.46 2013
642 Radial Velocity 1 311.2880 1.940 17.24 1999
784 Radial Velocity 3 580.0000 0.947 135.00 2012

497 rows × 6 columns

In [ ]:
dfplanets[dfplanets['method'] == "Transit"]
Out[ ]:
method number orbital_period mass distance year
649 Transit 1 2.70339 1.47 178.0 2013
In [ ]:
dfplanets.mean()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  """Entry point for launching an IPython kernel.
Out[ ]:
number               1.734940
orbital_period     835.778671
mass                 2.509320
distance            52.068213
year              2007.377510
dtype: float64
In [ ]:
dfplanets.median()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  """Entry point for launching an IPython kernel.
Out[ ]:
number               1.000
orbital_period     357.000
mass                 1.245
distance            39.940
year              2009.000
dtype: float64
In [ ]:
dfplanets.quantile([.75, .85])
Out[ ]:
number orbital_period mass distance year
0.75 2.0 999.60 2.8675 59.3325 2011.0
0.85 3.0 1700.85 5.1000 88.0455 2011.0
In [ ]:
dfplanets.quantile(.75)
Out[ ]:
number               2.0000
orbital_period     999.6000
mass                 2.8675
distance            59.3325
year              2011.0000
Name: 0.75, dtype: float64
In [ ]:
type(dfplanets['distance'].quantile(0.75))
Out[ ]:
numpy.float64
In [ ]:
dfplanets.quantile(0.75)
Out[ ]:
number               2.0000
orbital_period     999.6000
mass                 2.8675
distance            59.3325
year              2011.0000
Name: 0.75, dtype: float64

as taught in class

In [ ]:
df_ = dfplanets.copy()
per75 = df_.distance.quantile(0.75)
In [ ]:
per75
Out[ ]:
178.5
In [ ]:
df_.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 0 to 784
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          498 non-null    object 
 1   number          498 non-null    int64  
 2   orbital_period  498 non-null    float64
 3   mass            498 non-null    float64
 4   distance        498 non-null    float64
 5   year            498 non-null    int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 27.2+ KB

Method I

In [ ]:
for i, r in df_.iterrows():
    if r['year'] < 2010:
        df_.drop(i, inplace=True)
        continue
    if r['method'] != 'Radial Velocity' and r['method'] != 'Transit':
        df_.drop(i, inplace = True)
        continue
    if r['distance'] < per75:
        df_.drop(i, inplace = True)
        continue
In [ ]:
df_.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 324 entries, 87 to 1028
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          324 non-null    object 
 1   number          324 non-null    int64  
 2   orbital_period  324 non-null    float64
 3   mass            12 non-null     float64
 4   distance        154 non-null    float64
 5   year            324 non-null    int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 17.7+ KB

Method II

In [ ]:
df_ = dfplanets.copy()
In [ ]:
df_.describe()
Out[ ]:
number orbital_period mass distance year
count 324.000000 324.000000 12.000000 154.000000 324.000000
mean 2.135802 70.534822 3.851250 654.435649 2012.043210
std 1.472161 257.299126 3.685388 527.480906 1.243429
min 1.000000 0.355000 0.340000 180.000000 2010.000000
25% 1.000000 3.731436 1.165000 303.000000 2011.000000
50% 2.000000 9.264922 2.470000 429.000000 2012.000000
75% 2.000000 26.213000 5.618750 800.000000 2013.000000
max 7.000000 3000.000000 11.100000 3200.000000 2014.000000
In [ ]:
df_ = df_[
          (df_['year'] >= 2010) &
          ((df_['method'] == 'Radial Velocity') | (df_['method'] == 'Transit')) &
          (df_['distance'] > per75)
]
In [ ]:
df_.describe()
Out[ ]:
number orbital_period mass distance year
count 50.000000 50.000000 50.000000 50.000000 50.000000
mean 1.300000 763.904808 3.322740 133.142600 2011.360000
std 0.505076 966.789870 3.648002 70.378699 1.120496
min 1.000000 2.703390 0.770000 65.620000 2010.000000
25% 1.000000 255.555000 1.325000 80.205000 2011.000000
50% 1.000000 550.500000 1.875000 121.070000 2011.000000
75% 2.000000 873.625000 3.400000 150.097500 2012.000000
max 3.000000 5584.000000 20.600000 354.000000 2014.000000

Using QUERY f()

In [ ]:
df_ = dfplanets.copy()
per75 = df_.distance.quantile(0.75)
In [ ]:
per75
Out[ ]:
178.5
In [ ]:
# Code executes but the results are not correct
df_ = df_.query('(year >= 2010) & (method == "Radial Velocity" | method == "Transit") & (distance > @@per75)')
In [ ]:
df_.describe()
Out[ ]:
number orbital_period mass distance year
count 154.000000 154.000000 7.000000 154.000000 154.000000
mean 2.045455 55.483100 5.207143 654.435649 2011.448052
std 1.834103 159.178501 3.891918 527.480906 1.078865
min 1.000000 0.742960 1.250000 180.000000 2010.000000
25% 1.000000 3.276083 2.450000 303.000000 2011.000000
50% 1.000000 4.641130 4.000000 429.000000 2011.000000
75% 2.000000 16.228912 7.600000 800.000000 2012.000000
max 7.000000 1022.000000 11.100000 3200.000000 2014.000000

Next Exercise

  • Modify the method column to have only the abbreviation of each method
  • eg: instead of 'Radial Velocity', just make it RV and TR for 'Transit'
In [ ]: