DPLYR Tutorial – Data Manipulation using DPLYR Package in R Programming



DPLYR-TUTORIAL—Data-Manipulation-with-DPLYR-in-R-





Why to use dplyr ?

  • It is really useful for data exploration and transformation
  • Fast while dealing with data frames

Functionality of dplyr

  • It is useful while dealing with : ‘select’,‘filter’, ‘mutate’ , ’ arrange’ , ‘summarise’, which can be use as five basic verbs
  • Can be very useful in handling inner joins, left join, semi-join, anti-join
# loading packages 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(datasets)
#install.packages("hflights")

library(hflights)
# Lets explore data 

data("hflights")
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 5424 2011     1          1         6    1400    1500            AA       428
## 5425 2011     1          2         7    1401    1501            AA       428
## 5426 2011     1          3         1    1352    1502            AA       428
## 5427 2011     1          4         2    1403    1513            AA       428
## 5428 2011     1          5         3    1405    1507            AA       428
## 5429 2011     1          6         4    1359    1503            AA       428
##      TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 5424  N576AA                60      40      -10        0    IAH  DFW      224
## 5425  N557AA                60      45       -9        1    IAH  DFW      224
## 5426  N541AA                70      48       -8       -8    IAH  DFW      224
## 5427  N403AA                70      39        3        3    IAH  DFW      224
## 5428  N492AA                62      44       -3        5    IAH  DFW      224
## 5429  N262AA                64      45       -7       -1    IAH  DFW      224
##      TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424      7      13         0                         0
## 5425      6       9         0                         0
## 5426      5      17         0                         0
## 5427      9      22         0                         0
## 5428      9       9         0                         0
## 5429      6      13         0                         0
  • ‘as_tibble’ creates a “a local data frame”
  • Tibble data frame will print first ten rows and total columns which fit to the screen in a systematic manner in comparison to raw data
# Convert to tibble 

flights <-  as_tibble(hflights)
flights
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

Lets use filter to understand it ,

If we want to view all flights on February 1 we can use the following two methods

METHOD – I

flights[flights$Month ==2 & flights$DayofMonth ==1, ]
## # A tibble: 577 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     2          1         2    1401    1539 AA                  428
##  2  2011     2          1         2      NA      NA AA                  460
##  3  2011     2          1         2      NA      NA AA                  533
##  4  2011     2          1         2      NA      NA AA                 1121
##  5  2011     2          1         2    1746    2109 AA                 1294
##  6  2011     2          1         2      NA      NA AA                 1436
##  7  2011     2          1         2    1032    1358 AA                 1700
##  8  2011     2          1         2      NA      NA AA                 1820
##  9  2011     2          1         2     558     912 AA                 1994
## 10  2011     2          1         2    1820    2112 AS                  731
## # ... with 567 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

METHOD – II Using Filter

filter(flights, flights$Month ==2 , flights$DayofMonth ==1)
## # A tibble: 577 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     2          1         2    1401    1539 AA                  428
##  2  2011     2          1         2      NA      NA AA                  460
##  3  2011     2          1         2      NA      NA AA                  533
##  4  2011     2          1         2      NA      NA AA                 1121
##  5  2011     2          1         2    1746    2109 AA                 1294
##  6  2011     2          1         2      NA      NA AA                 1436
##  7  2011     2          1         2    1032    1358 AA                 1700
##  8  2011     2          1         2      NA      NA AA                 1820
##  9  2011     2          1         2     558     912 AA                 1994
## 10  2011     2          1         2    1820    2112 AS                  731
## # ... with 567 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

If we want to use AND that means applying condition on two attributes

filter(flights, UniqueCarrier =="AA" | UniqueCarrier == "UA" )
## # A tibble: 5,316 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 5,306 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

We can use filter in the following manner as well by uinsg %in%

filter(flights, UniqueCarrier %in% c("AA","UA") )
## # A tibble: 5,316 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 5,306 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

We can use select function to select columns by name

select(flights, DepTime, ArrTime, FlightNum)
## # A tibble: 227,496 x 3
##    DepTime ArrTime FlightNum
##      <int>   <int>     <int>
##  1    1400    1500       428
##  2    1401    1501       428
##  3    1352    1502       428
##  4    1403    1513       428
##  5    1405    1507       428
##  6    1359    1503       428
##  7    1359    1509       428
##  8    1355    1454       428
##  9    1443    1554       428
## 10    1443    1553       428
## # ... with 227,486 more rows

We can use colon to select multiple columns in continuation in a dataframe, also we can use ‘matches’ to match the exact word of column name

select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
## # A tibble: 227,496 x 7
##     Year Month DayofMonth TaxiIn TaxiOut ArrDelay DepDelay
##    <int> <int>      <int>  <int>   <int>    <int>    <int>
##  1  2011     1          1      7      13      -10        0
##  2  2011     1          2      6       9       -9        1
##  3  2011     1          3      5      17       -8       -8
##  4  2011     1          4      9      22        3        3
##  5  2011     1          5      9       9       -3        5
##  6  2011     1          6      6      13       -7       -1
##  7  2011     1          7     12      15       -1       -1
##  8  2011     1          8      7      12      -16       -5
##  9  2011     1          9      8      22       44       43
## 10  2011     1         10      6      19       43       43
## # ... with 227,486 more rows

We can also use pipeline or chain for using multiple operation

Here we are filtering two columns and also using certain condition on a column

flights %>%  select(UniqueCarrier, DepDelay) %>% 
  filter(DepDelay > 50)
## # A tibble: 12,901 x 2
##    UniqueCarrier DepDelay
##    <chr>            <int>
##  1 AA                  90
##  2 AA                  67
##  3 AA                  55
##  4 AA                  74
##  5 AA                 125
##  6 AA                  82
##  7 AA                  99
##  8 AA                  53
##  9 AA                  70
## 10 AA                  61
## # ... with 12,891 more rows

Using arrange : to reorder rows

flights %>% 
  select(UniqueCarrier, DepDelay) %>% 
  arrange(DepDelay) # By default it will give ascending order of  the rows in DepDelay
## # A tibble: 227,496 x 2
##    UniqueCarrier DepDelay
##    <chr>            <int>
##  1 OO                 -33
##  2 MQ                 -23
##  3 XE                 -19
##  4 XE                 -19
##  5 CO                 -18
##  6 EV                 -18
##  7 XE                 -17
##  8 CO                 -17
##  9 XE                 -17
## 10 MQ                 -17
## # ... with 227,486 more rows

If we want to use descending order then simply we can use the following :

flights %>% 
  select(UniqueCarrier, DepDelay) %>% 
  arrange(desc(DepDelay))
## # A tibble: 227,496 x 2
##    UniqueCarrier DepDelay
##    <chr>            <int>
##  1 CO                 981
##  2 AA                 970
##  3 MQ                 931
##  4 UA                 869
##  5 MQ                 814
##  6 MQ                 803
##  7 CO                 780
##  8 CO                 758
##  9 DL                 730
## 10 MQ                 691
## # ... with 227,486 more rows

note : is.na() can be useful while filtering na values

flights %>%  filter(!is.na(DepTime))
## # A tibble: 224,591 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 224,581 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

slice() filter rows by position

flights %>%  slice(100:150)
## # A tibble: 51 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1         12         3    1631    1739 AA                 1121
##  2  2011     1         13         4    1630    1733 AA                 1121
##  3  2011     1         14         5    1629    1734 AA                 1121
##  4  2011     1         15         6    1632    1736 AA                 1121
##  5  2011     1         16         7    1708    1819 AA                 1121
##  6  2011     1         17         1    1632    1744 AA                 1121
##  7  2011     1         18         2    1625    1740 AA                 1121
##  8  2011     1         19         3    1629    1731 AA                 1121
##  9  2011     1         20         4    1641    1752 AA                 1121
## 10  2011     1         21         5    1638    1746 AA                 1121
## # ... with 41 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

keeping the first four rows within each group

flights %>%  group_by(Month, DayofMonth) %>% slice(1:4) 
## # A tibble: 1,460 x 21
## # Groups:   Month, DayofMonth [365]
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          1         6     728     840 AA                  460
##  3  2011     1          1         6    1631    1736 AA                 1121
##  4  2011     1          1         6    1756    2112 AA                 1294
##  5  2011     1          2         7    1401    1501 AA                  428
##  6  2011     1          2         7     719     821 AA                  460
##  7  2011     1          2         7    1959    2106 AA                  533
##  8  2011     1          2         7    1636    1759 AA                 1121
##  9  2011     1          3         1    1352    1502 AA                  428
## 10  2011     1          3         1     717     834 AA                  460
## # ... with 1,450 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

Adding a new column by using functions on the existing data

flights %>% 
  select(Distance, AirTime) %>% 
  mutate(Speed = Distance / AirTime*60) # here mutate will give us new column by the name of the speed 
## # A tibble: 227,496 x 3
##    Distance AirTime Speed
##       <int>   <int> <dbl>
##  1      224      40  336 
##  2      224      45  299.
##  3      224      48  280 
##  4      224      39  345.
##  5      224      44  305.
##  6      224      45  299.
##  7      224      43  313.
##  8      224      40  336 
##  9      224      41  328.
## 10      224      45  299.
## # ... with 227,486 more rows

Using summarise , group_by functions to reduce variables to values

# You are advised to use the following two codes to see how it will show data by using na.rm = TRUE and without using na.rm 
# It is further advisable to use na.rm while finding mean, median or other calculation on any data where missing values are there 



flights %>% 
  group_by(Dest) %>% 
  summarise(avg_delay = mean(ArrDelay))
## # A tibble: 116 x 2
##    Dest  avg_delay
##    <chr>     <dbl>
##  1 ABQ          NA
##  2 AEX          NA
##  3 AGS           4
##  4 AMA          NA
##  5 ANC          NA
##  6 ASE          NA
##  7 ATL          NA
##  8 AUS          NA
##  9 AVL          NA
## 10 BFL          NA
## # ... with 106 more rows
flights %>% 
  group_by(Dest) %>% 
  summarise(avg_delay = mean(ArrDelay, na.rm = TRUE))
## # A tibble: 116 x 2
##    Dest  avg_delay
##    <chr>     <dbl>
##  1 ABQ        7.23
##  2 AEX        5.84
##  3 AGS        4   
##  4 AMA        6.84
##  5 ANC       26.1 
##  6 ASE        6.79
##  7 ATL        8.23
##  8 AUS        7.45
##  9 AVL        9.97
## 10 BFL      -13.2 
## # ... with 106 more rows
# Lets find the percentage of flights cancelled or diverted

flights %>% 
  group_by(UniqueCarrier) %>% 
  summarise_each(funs(mean), Cancelled , Diverted)
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 15 x 3
##    UniqueCarrier Cancelled Diverted
##    <chr>             <dbl>    <dbl>
##  1 AA              0.0185   0.00185
##  2 AS              0        0.00274
##  3 B6              0.0259   0.00576
##  4 CO              0.00678  0.00263
##  5 DL              0.0159   0.00303
##  6 EV              0.0345   0.00318
##  7 F9              0.00716  0      
##  8 FL              0.00982  0.00327
##  9 MQ              0.0290   0.00194
## 10 OO              0.0139   0.00349
## 11 UA              0.0164   0.00241
## 12 US              0.0113   0.00147
## 13 WN              0.0155   0.00229
## 14 XE              0.0155   0.00345
## 15 YV              0.0127   0

for each carrier, lets calculate the minimum and maximum arrival and departure delay

flights %>% 
  group_by(UniqueCarrier) %>% 
  summarise_each(funs(min( . , na.rm = TRUE), max(. , na.rm = TRUE)), matches("Delay"))
## # A tibble: 15 x 5
##    UniqueCarrier ArrDelay_min DepDelay_min ArrDelay_max DepDelay_max
##    <chr>                <int>        <int>        <int>        <int>
##  1 AA                     -39          -15          978          970
##  2 AS                     -43          -15          183          172
##  3 B6                     -44          -14          335          310
##  4 CO                     -55          -18          957          981
##  5 DL                     -32          -17          701          730
##  6 EV                     -40          -18          469          479
##  7 F9                     -24          -15          277          275
##  8 FL                     -30          -14          500          507
##  9 MQ                     -38          -23          918          931
## 10 OO                     -57          -33          380          360
## 11 UA                     -47          -11          861          869
## 12 US                     -42          -17          433          425
## 13 WN                     -44          -10          499          548
## 14 XE                     -70          -19          634          628
## 15 YV                     -32          -11           72           54

Helper functions ‘n()’ counts the number of rows in a group

Helper function ‘n_distinct(vector)’ counts the number of distinct items in that vector

# for each day of the year , count the total number of flights and sort in descending order 

library(dplyr)
flights %>% 
  group_by(Month, DayofMonth) %>% 
  summarise(flight_count = n()) %>% 
  arrange(desc(flight_count))
## `summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
## # A tibble: 365 x 3
## # Groups:   Month [12]
##    Month DayofMonth flight_count
##    <int>      <int>        <int>
##  1     8          4          706
##  2     8         11          706
##  3     8         12          706
##  4     8          5          705
##  5     8          3          704
##  6     8         10          704
##  7     1          3          702
##  8     7          7          702
##  9     7         14          702
## 10     7         28          701
## # ... with 355 more rows

We can also use “tally” function

flights %>% 
  group_by(Month, DayofMonth) %>% 
  tally(sort = TRUE)
## # A tibble: 365 x 3
## # Groups:   Month [12]
##    Month DayofMonth     n
##    <int>      <int> <int>
##  1     8          4   706
##  2     8         11   706
##  3     8         12   706
##  4     8          5   705
##  5     8          3   704
##  6     8         10   704
##  7     1          3   702
##  8     7          7   702
##  9     7         14   702
## 10     7         28   701
## # ... with 355 more rows

for each destination, count the total number of flights and the number of distinct planes that flew there

flights %>% 
  group_by(Dest) %>% 
  summarise(flight_count = n(), plane_count = n_distinct(TailNum))
## # A tibble: 116 x 3
##    Dest  flight_count plane_count
##    <chr>        <int>       <int>
##  1 ABQ           2812         716
##  2 AEX            724         215
##  3 AGS              1           1
##  4 AMA           1297         158
##  5 ANC            125          38
##  6 ASE            125          60
##  7 ATL           7886         983
##  8 AUS           5022        1015
##  9 AVL            350         142
## 10 BFL            504          70
## # ... with 106 more rows

for each destination lets check the cancelled and not cancelled flights

flights %>% 
  group_by(Dest) %>% 
  select(Cancelled) %>% 
  table()
## Adding missing grouping variables: `Dest`
##      Cancelled
## Dest     0    1
##   ABQ 2787   25
##   AEX  712   12
##   AGS    1    0
##   AMA 1265   32
##   ANC  125    0
##   ASE  120    5
##   ATL 7745  141
##   AUS 4995   27
##   AVL  347    3
##   BFL  503    1
##   BHM 2697   39
##   BKG  108    2
##   BNA 3451   30
##   BOS 1724   28
##   BPT    3    0
##   BRO 1665   27
##   BTR 1733   29
##   BWI 2527   24
##   CAE  547   14
##   CHS 1191    9
##   CID  408    2
##   CLE 2132    8
##   CLT 4671   64
##   CMH 1334   14
##   COS 1637   20
##   CRP 4720   93
##   CRW  350    7
##   CVG 1518   17
##   DAL 9378  442
##   DAY  446    5
##   DCA 2664   35
##   DEN 5892   28
##   DFW 6500  153
##   DSM  635   12
##   DTW 2568   33
##   ECP  727    2
##   EGE  108    2
##   ELP 3012   24
##   EWR 4244   70
##   FLL 2455    7
##   GJT  401    2
##   GPT 1586   32
##   GRK   40    2
##   GRR  672    5
##   GSO  624    6
##   GSP 1116    7
##   GUC   86    0
##   HDN  109    1
##   HNL  401    1
##   HOB  299   10
##   HRL 3881  102
##   HSV  911   12
##   IAD 1958   22
##   ICT 1484   33
##   IND 1726   24
##   JAN 1984   27
##   JAX 2123   12
##   JFK  677   18
##   LAS 4067   15
##   LAX 6031   33
##   LBB 1309   24
##   LCH  352   12
##   LEX  578    6
##   LFT 2257   56
##   LGA 2681   49
##   LIT 1553   26
##   LRD 1168   20
##   MAF 2263   43
##   MCI 3133   41
##   MCO 3671   16
##   MDW 2072   22
##   MEM 2352   47
##   MFE 1116   12
##   MIA 2439   24
##   MKE 1568   20
##   MLU  288    4
##   MOB 1641   33
##   MSP 1986   24
##   MSY 6783   40
##   MTJ  163    1
##   OAK  685    5
##   OKC 3114   56
##   OMA 2025   19
##   ONT  950    2
##   ORD 5649   99
##   ORF  711    6
##   PBI 1242   11
##   PDX 1232    3
##   PHL 2340   27
##   PHX 5067   29
##   PIT 1652   12
##   PNS 1516   23
##   PSP  106    0
##   RDU 1727   13
##   RIC  893    7
##   RNO  243    0
##   RSW  941    7
##   SAN 2924   12
##   SAT 4853   40
##   SAV  855    8
##   SDF 1269   10
##   SEA 2611    4
##   SFO 2804   14
##   SHV  778    9
##   SJC  884    1
##   SJU  389    2
##   SLC 2024    9
##   SMF 1011    3
##   SNA 1651   10
##   STL 2479   30
##   TPA 3074   11
##   TUL 2870   54
##   TUS 1550   15
##   TYS 1202    8
##   VPS  870   10
##   XNA 1138   34

Window functions

# for each carrier , calculate which two days of the year they had trheir longest departure delays 

flights %>% 
  group_by(UniqueCarrier) %>% 
  select(Month, DayofMonth, DepDelay) %>% 
  filter(min_rank(desc(DepDelay))<= 2) %>% 
  arrange(UniqueCarrier, desc(DepDelay))
## Adding missing grouping variables: `UniqueCarrier`
## # A tibble: 30 x 4
## # Groups:   UniqueCarrier [15]
##    UniqueCarrier Month DayofMonth DepDelay
##    <chr>         <int>      <int>    <int>
##  1 AA               12         12      970
##  2 AA               11         19      677
##  3 AS                2         28      172
##  4 AS                7          6      138
##  5 B6               10         29      310
##  6 B6                8         19      283
##  7 CO                8          1      981
##  8 CO                1         20      780
##  9 DL               10         25      730
## 10 DL                4          5      497
## # ... with 20 more rows

for each month, calculate the number of flights and the change from the preivous month

flights %>% 
  group_by(Month) %>% 
  summarise(flight_count = n()) %>% 
  mutate(change = flight_count - lag(flight_count))
## # A tibble: 12 x 3
##    Month flight_count change
##    <int>        <int>  <int>
##  1     1        18910     NA
##  2     2        17128  -1782
##  3     3        19470   2342
##  4     4        18593   -877
##  5     5        19172    579
##  6     6        19600    428
##  7     7        20548    948
##  8     8        20176   -372
##  9     9        18065  -2111
## 10    10        18696    631
## 11    11        18021   -675
## 12    12        19117   1096

we can also use tally() function

flights %>% 
  group_by(Month) %>% 
  tally() %>% 
  mutate(change = n - lag(n)) # where n is flight count 
## # A tibble: 12 x 3
##    Month     n change
##    <int> <int>  <int>
##  1     1 18910     NA
##  2     2 17128  -1782
##  3     3 19470   2342
##  4     4 18593   -877
##  5     5 19172    579
##  6     6 19600    428
##  7     7 20548    948
##  8     8 20176   -372
##  9     9 18065  -2111
## 10    10 18696    631
## 11    11 18021   -675
## 12    12 19117   1096

Few more function

# need to see few rows for random sampling without replacement

flights %>% sample_n(5)
## # A tibble: 5 x 21
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
## 1  2011     4          3         7    2102    2153 XE                 2803
## 2  2011    10         11         2    1440    1825 CO                 1539
## 3  2011     8         29         1    1100    1304 WN                 1317
## 4  2011     8         18         4    1545    1751 XE                 2529
## 5  2011     2         13         7    1440    1630 XE                 2998
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>, Dest <chr>,
## #   Distance <int>, TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## #   CancellationCode <chr>, Diverted <int>
# randomly sample a fraction of rows, with replacement 

flights %>%  sample_frac(0.25, replace = TRUE)
## # A tibble: 56,874 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     3         25         5      49     142 XE                 2450
##  2  2011     9          5         1    1148    1243 CO                 1101
##  3  2011     3          8         2    1334    1450 WN                 1896
##  4  2011    10          6         4    1041    1430 WN                  734
##  5  2011    10         28         5    1442    1535 WN                  249
##  6  2011     8          3         3    1929    2016 XE                 2418
##  7  2011     3          1         2     719     921 XE                 2888
##  8  2011     3         17         4    1946    2258 XE                 2822
##  9  2011     6          1         3    1301    1405 XE                 2189
## 10  2011     3         16         3     729    1215 CO                  282
## # ... with 56,864 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

In base R , we use the following to check the structure of object

str(flights)
## tibble [227,496 x 21] (S3: tbl_df/tbl/data.frame)
##  $ Year             : int [1:227496] 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int [1:227496] 1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int [1:227496] 1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int [1:227496] 6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int [1:227496] 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int [1:227496] 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr [1:227496] "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int [1:227496] 428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr [1:227496] "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int [1:227496] 60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int [1:227496] 40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int [1:227496] -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int [1:227496] 0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr [1:227496] "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr [1:227496] "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int [1:227496] 224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int [1:227496] 7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int [1:227496] 13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int [1:227496] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr [1:227496] "" "" "" "" ...
##  $ Diverted         : int [1:227496] 0 0 0 0 0 0 0 0 0 0 ...

Using dplyr to check the structure of object

glimpse(flights)
## Rows: 227,496
## Columns: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011~
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2~
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355, 1443~
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454, 1554~
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"~
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 428, 42~
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492AA", "N~
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56, 63, ~
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41, 44, ~
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5, -9, ~
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2, -3, ~
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IA~
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF~
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 22~
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, 12, 8,~
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, 13, 15~
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ~
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

Now time to connect the database

  • dplyr can connect to database
  • only generate select statements