Edit Template

DPLYR Tutorial – Data Manipulation using DPLYR Package in R Programming



DPLYR-TUTORIAL—Data-Manipulation-with-DPLYR-in-R-





Why to use dplyr ?

  • It is really useful for data exploration and transformation
  • Fast while dealing with data frames

Functionality of dplyr

  • It is useful while dealing with : ‘select’,‘filter’, ‘mutate’ , ’ arrange’ , ‘summarise’, which can be use as five basic verbs
  • Can be very useful in handling inner joins, left join, semi-join, anti-join
# loading packages 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(datasets)
#install.packages("hflights")

library(hflights)
# Lets explore data 

data("hflights")
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 5424 2011     1          1         6    1400    1500            AA       428
## 5425 2011     1          2         7    1401    1501            AA       428
## 5426 2011     1          3         1    1352    1502            AA       428
## 5427 2011     1          4         2    1403    1513            AA       428
## 5428 2011     1          5         3    1405    1507            AA       428
## 5429 2011     1          6         4    1359    1503            AA       428
##      TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 5424  N576AA                60      40      -10        0    IAH  DFW      224
## 5425  N557AA                60      45       -9        1    IAH  DFW      224
## 5426  N541AA                70      48       -8       -8    IAH  DFW      224
## 5427  N403AA                70      39        3        3    IAH  DFW      224
## 5428  N492AA                62      44       -3        5    IAH  DFW      224
## 5429  N262AA                64      45       -7       -1    IAH  DFW      224
##      TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424      7      13         0                         0
## 5425      6       9         0                         0
## 5426      5      17         0                         0
## 5427      9      22         0                         0
## 5428      9       9         0                         0
## 5429      6      13         0                         0
  • ‘as_tibble’ creates a “a local data frame”
  • Tibble data frame will print first ten rows and total columns which fit to the screen in a systematic manner in comparison to raw data
# Convert to tibble 

flights <-  as_tibble(hflights)
flights
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

Lets use filter to understand it ,

If we want to view all flights on February 1 we can use the following two methods

METHOD - I

flights[flights$Month ==2 & flights$DayofMonth ==1, ]
## # A tibble: 577 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     2          1         2    1401    1539 AA                  428
##  2  2011     2          1         2      NA      NA AA                  460
##  3  2011     2          1         2      NA      NA AA                  533
##  4  2011     2          1         2      NA      NA AA                 1121
##  5  2011     2          1         2    1746    2109 AA                 1294
##  6  2011     2          1         2      NA      NA AA                 1436
##  7  2011     2          1         2    1032    1358 AA                 1700
##  8  2011     2          1         2      NA      NA AA                 1820
##  9  2011     2          1         2     558     912 AA                 1994
## 10  2011     2          1         2    1820    2112 AS                  731
## # ... with 567 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

METHOD - II Using Filter

filter(flights, flights$Month ==2 , flights$DayofMonth ==1)
## # A tibble: 577 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     2          1         2    1401    1539 AA                  428
##  2  2011     2          1         2      NA      NA AA                  460
##  3  2011     2          1         2      NA      NA AA                  533
##  4  2011     2          1         2      NA      NA AA                 1121
##  5  2011     2          1         2    1746    2109 AA                 1294
##  6  2011     2          1         2      NA      NA AA                 1436
##  7  2011     2          1         2    1032    1358 AA                 1700
##  8  2011     2          1         2      NA      NA AA                 1820
##  9  2011     2          1         2     558     912 AA                 1994
## 10  2011     2          1         2    1820    2112 AS                  731
## # ... with 567 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

If we want to use AND that means applying condition on two attributes

filter(flights, UniqueCarrier =="AA" | UniqueCarrier == "UA" )
## # A tibble: 5,316 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 5,306 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

We can use filter in the following manner as well by uinsg %in%

filter(flights, UniqueCarrier %in% c("AA","UA") )
## # A tibble: 5,316 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 5,306 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

We can use select function to select columns by name

select(flights, DepTime, ArrTime, FlightNum)
## # A tibble: 227,496 x 3
##    DepTime ArrTime FlightNum
##      <int>   <int>     <int>
##  1    1400    1500       428
##  2    1401    1501       428
##  3    1352    1502       428
##  4    1403    1513       428
##  5    1405    1507       428
##  6    1359    1503       428
##  7    1359    1509       428
##  8    1355    1454       428
##  9    1443    1554       428
## 10    1443    1553       428
## # ... with 227,486 more rows

We can use colon to select multiple columns in continuation in a dataframe, also we can use ‘matches’ to match the exact word of column name

select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
## # A tibble: 227,496 x 7
##     Year Month DayofMonth TaxiIn TaxiOut ArrDelay DepDelay
##    <int> <int>      <int>  <int>   <int>    <int>    <int>
##  1  2011     1          1      7      13      -10        0
##  2  2011     1          2      6       9       -9        1
##  3  2011     1          3      5      17       -8       -8
##  4  2011     1          4      9      22        3        3
##  5  2011     1          5      9       9       -3        5
##  6  2011     1          6      6      13       -7       -1
##  7  2011     1          7     12      15       -1       -1
##  8  2011     1          8      7      12      -16       -5
##  9  2011     1          9      8      22       44       43
## 10  2011     1         10      6      19       43       43
## # ... with 227,486 more rows

We can also use pipeline or chain for using multiple operation

Here we are filtering two columns and also using certain condition on a column

flights %>%  select(UniqueCarrier, DepDelay) %>% 
  filter(DepDelay > 50)
## # A tibble: 12,901 x 2
##    UniqueCarrier DepDelay
##    <chr>            <int>
##  1 AA                  90
##  2 AA                  67
##  3 AA                  55
##  4 AA                  74
##  5 AA                 125
##  6 AA                  82
##  7 AA                  99
##  8 AA                  53
##  9 AA                  70
## 10 AA                  61
## # ... with 12,891 more rows

Using arrange : to reorder rows

flights %>% 
  select(UniqueCarrier, DepDelay) %>% 
  arrange(DepDelay) # By default it will give ascending order of  the rows in DepDelay
## # A tibble: 227,496 x 2
##    UniqueCarrier DepDelay
##    <chr>            <int>
##  1 OO                 -33
##  2 MQ                 -23
##  3 XE                 -19
##  4 XE                 -19
##  5 CO                 -18
##  6 EV                 -18
##  7 XE                 -17
##  8 CO                 -17
##  9 XE                 -17
## 10 MQ                 -17
## # ... with 227,486 more rows

If we want to use descending order then simply we can use the following :

flights %>% 
  select(UniqueCarrier, DepDelay) %>% 
  arrange(desc(DepDelay))
## # A tibble: 227,496 x 2
##    UniqueCarrier DepDelay
##    <chr>            <int>
##  1 CO                 981
##  2 AA                 970
##  3 MQ                 931
##  4 UA                 869
##  5 MQ                 814
##  6 MQ                 803
##  7 CO                 780
##  8 CO                 758
##  9 DL                 730
## 10 MQ                 691
## # ... with 227,486 more rows

note : is.na() can be useful while filtering na values

flights %>%  filter(!is.na(DepTime))
## # A tibble: 224,591 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # ... with 224,581 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

slice() filter rows by position

flights %>%  slice(100:150)
## # A tibble: 51 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1         12         3    1631    1739 AA                 1121
##  2  2011     1         13         4    1630    1733 AA                 1121
##  3  2011     1         14         5    1629    1734 AA                 1121
##  4  2011     1         15         6    1632    1736 AA                 1121
##  5  2011     1         16         7    1708    1819 AA                 1121
##  6  2011     1         17         1    1632    1744 AA                 1121
##  7  2011     1         18         2    1625    1740 AA                 1121
##  8  2011     1         19         3    1629    1731 AA                 1121
##  9  2011     1         20         4    1641    1752 AA                 1121
## 10  2011     1         21         5    1638    1746 AA                 1121
## # ... with 41 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

keeping the first four rows within each group

flights %>%  group_by(Month, DayofMonth) %>% slice(1:4) 
## # A tibble: 1,460 x 21
## # Groups:   Month, DayofMonth [365]
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          1         6     728     840 AA                  460
##  3  2011     1          1         6    1631    1736 AA                 1121
##  4  2011     1          1         6    1756    2112 AA                 1294
##  5  2011     1          2         7    1401    1501 AA                  428
##  6  2011     1          2         7     719     821 AA                  460
##  7  2011     1          2         7    1959    2106 AA                  533
##  8  2011     1          2         7    1636    1759 AA                 1121
##  9  2011     1          3         1    1352    1502 AA                  428
## 10  2011     1          3         1     717     834 AA                  460
## # ... with 1,450 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

Adding a new column by using functions on the existing data

flights %>% 
  select(Distance, AirTime) %>% 
  mutate(Speed = Distance / AirTime*60) # here mutate will give us new column by the name of the speed 
## # A tibble: 227,496 x 3
##    Distance AirTime Speed
##       <int>   <int> <dbl>
##  1      224      40  336 
##  2      224      45  299.
##  3      224      48  280 
##  4      224      39  345.
##  5      224      44  305.
##  6      224      45  299.
##  7      224      43  313.
##  8      224      40  336 
##  9      224      41  328.
## 10      224      45  299.
## # ... with 227,486 more rows

Using summarise , group_by functions to reduce variables to values

# You are advised to use the following two codes to see how it will show data by using na.rm = TRUE and without using na.rm 
# It is further advisable to use na.rm while finding mean, median or other calculation on any data where missing values are there 



flights %>% 
  group_by(Dest) %>% 
  summarise(avg_delay = mean(ArrDelay))
## # A tibble: 116 x 2
##    Dest  avg_delay
##    <chr>     <dbl>
##  1 ABQ          NA
##  2 AEX          NA
##  3 AGS           4
##  4 AMA          NA
##  5 ANC          NA
##  6 ASE          NA
##  7 ATL          NA
##  8 AUS          NA
##  9 AVL          NA
## 10 BFL          NA
## # ... with 106 more rows
flights %>% 
  group_by(Dest) %>% 
  summarise(avg_delay = mean(ArrDelay, na.rm = TRUE))
## # A tibble: 116 x 2
##    Dest  avg_delay
##    <chr>     <dbl>
##  1 ABQ        7.23
##  2 AEX        5.84
##  3 AGS        4   
##  4 AMA        6.84
##  5 ANC       26.1 
##  6 ASE        6.79
##  7 ATL        8.23
##  8 AUS        7.45
##  9 AVL        9.97
## 10 BFL      -13.2 
## # ... with 106 more rows
# Lets find the percentage of flights cancelled or diverted

flights %>% 
  group_by(UniqueCarrier) %>% 
  summarise_each(funs(mean), Cancelled , Diverted)
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 15 x 3
##    UniqueCarrier Cancelled Diverted
##    <chr>             <dbl>    <dbl>
##  1 AA              0.0185   0.00185
##  2 AS              0        0.00274
##  3 B6              0.0259   0.00576
##  4 CO              0.00678  0.00263
##  5 DL              0.0159   0.00303
##  6 EV              0.0345   0.00318
##  7 F9              0.00716  0      
##  8 FL              0.00982  0.00327
##  9 MQ              0.0290   0.00194
## 10 OO              0.0139   0.00349
## 11 UA              0.0164   0.00241
## 12 US              0.0113   0.00147
## 13 WN              0.0155   0.00229
## 14 XE              0.0155   0.00345
## 15 YV              0.0127   0

for each carrier, lets calculate the minimum and maximum arrival and departure delay

flights %>% 
  group_by(UniqueCarrier) %>% 
  summarise_each(funs(min( . , na.rm = TRUE), max(. , na.rm = TRUE)), matches("Delay"))
## # A tibble: 15 x 5
##    UniqueCarrier ArrDelay_min DepDelay_min ArrDelay_max DepDelay_max
##    <chr>                <int>        <int>        <int>        <int>
##  1 AA                     -39          -15          978          970
##  2 AS                     -43          -15          183          172
##  3 B6                     -44          -14          335          310
##  4 CO                     -55          -18          957          981
##  5 DL                     -32          -17          701          730
##  6 EV                     -40          -18          469          479
##  7 F9                     -24          -15          277          275
##  8 FL                     -30          -14          500          507
##  9 MQ                     -38          -23          918          931
## 10 OO                     -57          -33          380          360
## 11 UA                     -47          -11          861          869
## 12 US                     -42          -17          433          425
## 13 WN                     -44          -10          499          548
## 14 XE                     -70          -19          634          628
## 15 YV                     -32          -11           72           54

Helper functions ‘n()’ counts the number of rows in a group

Helper function ‘n_distinct(vector)’ counts the number of distinct items in that vector

# for each day of the year , count the total number of flights and sort in descending order 

library(dplyr)
flights %>% 
  group_by(Month, DayofMonth) %>% 
  summarise(flight_count = n()) %>% 
  arrange(desc(flight_count))
## `summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
## # A tibble: 365 x 3
## # Groups:   Month [12]
##    Month DayofMonth flight_count
##    <int>      <int>        <int>
##  1     8          4          706
##  2     8         11          706
##  3     8         12          706
##  4     8          5          705
##  5     8          3          704
##  6     8         10          704
##  7     1          3          702
##  8     7          7          702
##  9     7         14          702
## 10     7         28          701
## # ... with 355 more rows

We can also use “tally” function

flights %>% 
  group_by(Month, DayofMonth) %>% 
  tally(sort = TRUE)
## # A tibble: 365 x 3
## # Groups:   Month [12]
##    Month DayofMonth     n
##    <int>      <int> <int>
##  1     8          4   706
##  2     8         11   706
##  3     8         12   706
##  4     8          5   705
##  5     8          3   704
##  6     8         10   704
##  7     1          3   702
##  8     7          7   702
##  9     7         14   702
## 10     7         28   701
## # ... with 355 more rows

for each destination, count the total number of flights and the number of distinct planes that flew there

flights %>% 
  group_by(Dest) %>% 
  summarise(flight_count = n(), plane_count = n_distinct(TailNum))
## # A tibble: 116 x 3
##    Dest  flight_count plane_count
##    <chr>        <int>       <int>
##  1 ABQ           2812         716
##  2 AEX            724         215
##  3 AGS              1           1
##  4 AMA           1297         158
##  5 ANC            125          38
##  6 ASE            125          60
##  7 ATL           7886         983
##  8 AUS           5022        1015
##  9 AVL            350         142
## 10 BFL            504          70
## # ... with 106 more rows

for each destination lets check the cancelled and not cancelled flights

flights %>% 
  group_by(Dest) %>% 
  select(Cancelled) %>% 
  table()
## Adding missing grouping variables: `Dest`
##      Cancelled
## Dest     0    1
##   ABQ 2787   25
##   AEX  712   12
##   AGS    1    0
##   AMA 1265   32
##   ANC  125    0
##   ASE  120    5
##   ATL 7745  141
##   AUS 4995   27
##   AVL  347    3
##   BFL  503    1
##   BHM 2697   39
##   BKG  108    2
##   BNA 3451   30
##   BOS 1724   28
##   BPT    3    0
##   BRO 1665   27
##   BTR 1733   29
##   BWI 2527   24
##   CAE  547   14
##   CHS 1191    9
##   CID  408    2
##   CLE 2132    8
##   CLT 4671   64
##   CMH 1334   14
##   COS 1637   20
##   CRP 4720   93
##   CRW  350    7
##   CVG 1518   17
##   DAL 9378  442
##   DAY  446    5
##   DCA 2664   35
##   DEN 5892   28
##   DFW 6500  153
##   DSM  635   12
##   DTW 2568   33
##   ECP  727    2
##   EGE  108    2
##   ELP 3012   24
##   EWR 4244   70
##   FLL 2455    7
##   GJT  401    2
##   GPT 1586   32
##   GRK   40    2
##   GRR  672    5
##   GSO  624    6
##   GSP 1116    7
##   GUC   86    0
##   HDN  109    1
##   HNL  401    1
##   HOB  299   10
##   HRL 3881  102
##   HSV  911   12
##   IAD 1958   22
##   ICT 1484   33
##   IND 1726   24
##   JAN 1984   27
##   JAX 2123   12
##   JFK  677   18
##   LAS 4067   15
##   LAX 6031   33
##   LBB 1309   24
##   LCH  352   12
##   LEX  578    6
##   LFT 2257   56
##   LGA 2681   49
##   LIT 1553   26
##   LRD 1168   20
##   MAF 2263   43
##   MCI 3133   41
##   MCO 3671   16
##   MDW 2072   22
##   MEM 2352   47
##   MFE 1116   12
##   MIA 2439   24
##   MKE 1568   20
##   MLU  288    4
##   MOB 1641   33
##   MSP 1986   24
##   MSY 6783   40
##   MTJ  163    1
##   OAK  685    5
##   OKC 3114   56
##   OMA 2025   19
##   ONT  950    2
##   ORD 5649   99
##   ORF  711    6
##   PBI 1242   11
##   PDX 1232    3
##   PHL 2340   27
##   PHX 5067   29
##   PIT 1652   12
##   PNS 1516   23
##   PSP  106    0
##   RDU 1727   13
##   RIC  893    7
##   RNO  243    0
##   RSW  941    7
##   SAN 2924   12
##   SAT 4853   40
##   SAV  855    8
##   SDF 1269   10
##   SEA 2611    4
##   SFO 2804   14
##   SHV  778    9
##   SJC  884    1
##   SJU  389    2
##   SLC 2024    9
##   SMF 1011    3
##   SNA 1651   10
##   STL 2479   30
##   TPA 3074   11
##   TUL 2870   54
##   TUS 1550   15
##   TYS 1202    8
##   VPS  870   10
##   XNA 1138   34

Window functions

# for each carrier , calculate which two days of the year they had trheir longest departure delays 

flights %>% 
  group_by(UniqueCarrier) %>% 
  select(Month, DayofMonth, DepDelay) %>% 
  filter(min_rank(desc(DepDelay))<= 2) %>% 
  arrange(UniqueCarrier, desc(DepDelay))
## Adding missing grouping variables: `UniqueCarrier`
## # A tibble: 30 x 4
## # Groups:   UniqueCarrier [15]
##    UniqueCarrier Month DayofMonth DepDelay
##    <chr>         <int>      <int>    <int>
##  1 AA               12         12      970
##  2 AA               11         19      677
##  3 AS                2         28      172
##  4 AS                7          6      138
##  5 B6               10         29      310
##  6 B6                8         19      283
##  7 CO                8          1      981
##  8 CO                1         20      780
##  9 DL               10         25      730
## 10 DL                4          5      497
## # ... with 20 more rows

for each month, calculate the number of flights and the change from the preivous month

flights %>% 
  group_by(Month) %>% 
  summarise(flight_count = n()) %>% 
  mutate(change = flight_count - lag(flight_count))
## # A tibble: 12 x 3
##    Month flight_count change
##    <int>        <int>  <int>
##  1     1        18910     NA
##  2     2        17128  -1782
##  3     3        19470   2342
##  4     4        18593   -877
##  5     5        19172    579
##  6     6        19600    428
##  7     7        20548    948
##  8     8        20176   -372
##  9     9        18065  -2111
## 10    10        18696    631
## 11    11        18021   -675
## 12    12        19117   1096

we can also use tally() function

flights %>% 
  group_by(Month) %>% 
  tally() %>% 
  mutate(change = n - lag(n)) # where n is flight count 
## # A tibble: 12 x 3
##    Month     n change
##    <int> <int>  <int>
##  1     1 18910     NA
##  2     2 17128  -1782
##  3     3 19470   2342
##  4     4 18593   -877
##  5     5 19172    579
##  6     6 19600    428
##  7     7 20548    948
##  8     8 20176   -372
##  9     9 18065  -2111
## 10    10 18696    631
## 11    11 18021   -675
## 12    12 19117   1096

Few more function

# need to see few rows for random sampling without replacement

flights %>% sample_n(5)
## # A tibble: 5 x 21
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
## 1  2011     4          3         7    2102    2153 XE                 2803
## 2  2011    10         11         2    1440    1825 CO                 1539
## 3  2011     8         29         1    1100    1304 WN                 1317
## 4  2011     8         18         4    1545    1751 XE                 2529
## 5  2011     2         13         7    1440    1630 XE                 2998
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>, Dest <chr>,
## #   Distance <int>, TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## #   CancellationCode <chr>, Diverted <int>
# randomly sample a fraction of rows, with replacement 

flights %>%  sample_frac(0.25, replace = TRUE)
## # A tibble: 56,874 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     3         25         5      49     142 XE                 2450
##  2  2011     9          5         1    1148    1243 CO                 1101
##  3  2011     3          8         2    1334    1450 WN                 1896
##  4  2011    10          6         4    1041    1430 WN                  734
##  5  2011    10         28         5    1442    1535 WN                  249
##  6  2011     8          3         3    1929    2016 XE                 2418
##  7  2011     3          1         2     719     921 XE                 2888
##  8  2011     3         17         4    1946    2258 XE                 2822
##  9  2011     6          1         3    1301    1405 XE                 2189
## 10  2011     3         16         3     729    1215 CO                  282
## # ... with 56,864 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

In base R , we use the following to check the structure of object

str(flights)
## tibble [227,496 x 21] (S3: tbl_df/tbl/data.frame)
##  $ Year             : int [1:227496] 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int [1:227496] 1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int [1:227496] 1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int [1:227496] 6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int [1:227496] 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int [1:227496] 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr [1:227496] "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int [1:227496] 428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr [1:227496] "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int [1:227496] 60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int [1:227496] 40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int [1:227496] -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int [1:227496] 0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr [1:227496] "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr [1:227496] "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int [1:227496] 224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int [1:227496] 7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int [1:227496] 13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int [1:227496] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr [1:227496] "" "" "" "" ...
##  $ Diverted         : int [1:227496] 0 0 0 0 0 0 0 0 0 0 ...

Using dplyr to check the structure of object

glimpse(flights)
## Rows: 227,496
## Columns: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011~
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2~
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355, 1443~
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454, 1554~
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"~
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 428, 42~
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492AA", "N~
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56, 63, ~
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41, 44, ~
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5, -9, ~
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2, -3, ~
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IA~
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF~
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 22~
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, 12, 8,~
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, 13, 15~
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ~
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

Now time to connect the database

  • dplyr can connect to database
  • only generate select statements



Company

Our ebook website brings you the convenience of instant access to a diverse range of titles, spanning genres from fiction and non-fiction to self-help, business.

Features

Most Recent Posts

eBook App for FREE

Lorem Ipsum is simply dumy text of the printing typesetting industry lorem.

Hot

Category

Our ebook website brings you the convenience of instant access.

Help

Privacy Policy

Mailing List

© 2023 Created with Royal Elementor Addons