DPLYR-TUTORIAL—Data-Manipulation-with-DPLYR-in-R-
Sachin Sharma
August 31, 2021
Why to use dplyr ?
- It is really useful for data exploration and transformation
- Fast while dealing with data frames
Functionality of dplyr
- It is useful while dealing with : ‘select’,‘filter’, ‘mutate’ , ’ arrange’ , ‘summarise’, which can be use as five basic verbs
- Can be very useful in handling inner joins, left join, semi-join, anti-join
# loading packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(datasets)
#install.packages("hflights")
library(hflights)
# Lets explore data
data("hflights")
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 5424 2011 1 1 6 1400 1500 AA 428
## 5425 2011 1 2 7 1401 1501 AA 428
## 5426 2011 1 3 1 1352 1502 AA 428
## 5427 2011 1 4 2 1403 1513 AA 428
## 5428 2011 1 5 3 1405 1507 AA 428
## 5429 2011 1 6 4 1359 1503 AA 428
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 5424 N576AA 60 40 -10 0 IAH DFW 224
## 5425 N557AA 60 45 -9 1 IAH DFW 224
## 5426 N541AA 70 48 -8 -8 IAH DFW 224
## 5427 N403AA 70 39 3 3 IAH DFW 224
## 5428 N492AA 62 44 -3 5 IAH DFW 224
## 5429 N262AA 64 45 -7 -1 IAH DFW 224
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 7 13 0 0
## 5425 6 9 0 0
## 5426 5 17 0 0
## 5427 9 22 0 0
## 5428 9 9 0 0
## 5429 6 13 0 0
- ‘as_tibble’ creates a “a local data frame”
- Tibble data frame will print first ten rows and total columns which fit to the screen in a systematic manner in comparison to raw data
# Convert to tibble
flights <- as_tibble(hflights)
flights
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # ... with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
Lets use filter to understand it ,
If we want to view all flights on February 1 we can use the following two methods
METHOD – I
flights[flights$Month ==2 & flights$DayofMonth ==1, ]
## # A tibble: 577 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 2 1 2 1401 1539 AA 428
## 2 2011 2 1 2 NA NA AA 460
## 3 2011 2 1 2 NA NA AA 533
## 4 2011 2 1 2 NA NA AA 1121
## 5 2011 2 1 2 1746 2109 AA 1294
## 6 2011 2 1 2 NA NA AA 1436
## 7 2011 2 1 2 1032 1358 AA 1700
## 8 2011 2 1 2 NA NA AA 1820
## 9 2011 2 1 2 558 912 AA 1994
## 10 2011 2 1 2 1820 2112 AS 731
## # ... with 567 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
METHOD – II Using Filter
filter(flights, flights$Month ==2 , flights$DayofMonth ==1)
## # A tibble: 577 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 2 1 2 1401 1539 AA 428
## 2 2011 2 1 2 NA NA AA 460
## 3 2011 2 1 2 NA NA AA 533
## 4 2011 2 1 2 NA NA AA 1121
## 5 2011 2 1 2 1746 2109 AA 1294
## 6 2011 2 1 2 NA NA AA 1436
## 7 2011 2 1 2 1032 1358 AA 1700
## 8 2011 2 1 2 NA NA AA 1820
## 9 2011 2 1 2 558 912 AA 1994
## 10 2011 2 1 2 1820 2112 AS 731
## # ... with 567 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
If we want to use AND that means applying condition on two attributes
filter(flights, UniqueCarrier =="AA" | UniqueCarrier == "UA" )
## # A tibble: 5,316 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # ... with 5,306 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
We can use filter in the following manner as well by uinsg %in%
filter(flights, UniqueCarrier %in% c("AA","UA") )
## # A tibble: 5,316 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # ... with 5,306 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
We can use select function to select columns by name
select(flights, DepTime, ArrTime, FlightNum)
## # A tibble: 227,496 x 3
## DepTime ArrTime FlightNum
## <int> <int> <int>
## 1 1400 1500 428
## 2 1401 1501 428
## 3 1352 1502 428
## 4 1403 1513 428
## 5 1405 1507 428
## 6 1359 1503 428
## 7 1359 1509 428
## 8 1355 1454 428
## 9 1443 1554 428
## 10 1443 1553 428
## # ... with 227,486 more rows
We can use colon to select multiple columns in continuation in a dataframe, also we can use ‘matches’ to match the exact word of column name
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
## # A tibble: 227,496 x 7
## Year Month DayofMonth TaxiIn TaxiOut ArrDelay DepDelay
## <int> <int> <int> <int> <int> <int> <int>
## 1 2011 1 1 7 13 -10 0
## 2 2011 1 2 6 9 -9 1
## 3 2011 1 3 5 17 -8 -8
## 4 2011 1 4 9 22 3 3
## 5 2011 1 5 9 9 -3 5
## 6 2011 1 6 6 13 -7 -1
## 7 2011 1 7 12 15 -1 -1
## 8 2011 1 8 7 12 -16 -5
## 9 2011 1 9 8 22 44 43
## 10 2011 1 10 6 19 43 43
## # ... with 227,486 more rows
We can also use pipeline or chain for using multiple operation
Here we are filtering two columns and also using certain condition on a column
flights %>% select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 50)
## # A tibble: 12,901 x 2
## UniqueCarrier DepDelay
## <chr> <int>
## 1 AA 90
## 2 AA 67
## 3 AA 55
## 4 AA 74
## 5 AA 125
## 6 AA 82
## 7 AA 99
## 8 AA 53
## 9 AA 70
## 10 AA 61
## # ... with 12,891 more rows
Using arrange : to reorder rows
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay) # By default it will give ascending order of the rows in DepDelay
## # A tibble: 227,496 x 2
## UniqueCarrier DepDelay
## <chr> <int>
## 1 OO -33
## 2 MQ -23
## 3 XE -19
## 4 XE -19
## 5 CO -18
## 6 EV -18
## 7 XE -17
## 8 CO -17
## 9 XE -17
## 10 MQ -17
## # ... with 227,486 more rows
If we want to use descending order then simply we can use the following :
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
## # A tibble: 227,496 x 2
## UniqueCarrier DepDelay
## <chr> <int>
## 1 CO 981
## 2 AA 970
## 3 MQ 931
## 4 UA 869
## 5 MQ 814
## 6 MQ 803
## 7 CO 780
## 8 CO 758
## 9 DL 730
## 10 MQ 691
## # ... with 227,486 more rows
note : is.na() can be useful while filtering na values
flights %>% filter(!is.na(DepTime))
## # A tibble: 224,591 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # ... with 224,581 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
slice() filter rows by position
flights %>% slice(100:150)
## # A tibble: 51 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 12 3 1631 1739 AA 1121
## 2 2011 1 13 4 1630 1733 AA 1121
## 3 2011 1 14 5 1629 1734 AA 1121
## 4 2011 1 15 6 1632 1736 AA 1121
## 5 2011 1 16 7 1708 1819 AA 1121
## 6 2011 1 17 1 1632 1744 AA 1121
## 7 2011 1 18 2 1625 1740 AA 1121
## 8 2011 1 19 3 1629 1731 AA 1121
## 9 2011 1 20 4 1641 1752 AA 1121
## 10 2011 1 21 5 1638 1746 AA 1121
## # ... with 41 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
keeping the first four rows within each group
flights %>% group_by(Month, DayofMonth) %>% slice(1:4)
## # A tibble: 1,460 x 21
## # Groups: Month, DayofMonth [365]
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 1 6 728 840 AA 460
## 3 2011 1 1 6 1631 1736 AA 1121
## 4 2011 1 1 6 1756 2112 AA 1294
## 5 2011 1 2 7 1401 1501 AA 428
## 6 2011 1 2 7 719 821 AA 460
## 7 2011 1 2 7 1959 2106 AA 533
## 8 2011 1 2 7 1636 1759 AA 1121
## 9 2011 1 3 1 1352 1502 AA 428
## 10 2011 1 3 1 717 834 AA 460
## # ... with 1,450 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
Adding a new column by using functions on the existing data
flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance / AirTime*60) # here mutate will give us new column by the name of the speed
## # A tibble: 227,496 x 3
## Distance AirTime Speed
## <int> <int> <dbl>
## 1 224 40 336
## 2 224 45 299.
## 3 224 48 280
## 4 224 39 345.
## 5 224 44 305.
## 6 224 45 299.
## 7 224 43 313.
## 8 224 40 336
## 9 224 41 328.
## 10 224 45 299.
## # ... with 227,486 more rows
Using summarise , group_by functions to reduce variables to values
# You are advised to use the following two codes to see how it will show data by using na.rm = TRUE and without using na.rm
# It is further advisable to use na.rm while finding mean, median or other calculation on any data where missing values are there
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay))
## # A tibble: 116 x 2
## Dest avg_delay
## <chr> <dbl>
## 1 ABQ NA
## 2 AEX NA
## 3 AGS 4
## 4 AMA NA
## 5 ANC NA
## 6 ASE NA
## 7 ATL NA
## 8 AUS NA
## 9 AVL NA
## 10 BFL NA
## # ... with 106 more rows
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay, na.rm = TRUE))
## # A tibble: 116 x 2
## Dest avg_delay
## <chr> <dbl>
## 1 ABQ 7.23
## 2 AEX 5.84
## 3 AGS 4
## 4 AMA 6.84
## 5 ANC 26.1
## 6 ASE 6.79
## 7 ATL 8.23
## 8 AUS 7.45
## 9 AVL 9.97
## 10 BFL -13.2
## # ... with 106 more rows
# Lets find the percentage of flights cancelled or diverted
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean), Cancelled , Diverted)
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 15 x 3
## UniqueCarrier Cancelled Diverted
## <chr> <dbl> <dbl>
## 1 AA 0.0185 0.00185
## 2 AS 0 0.00274
## 3 B6 0.0259 0.00576
## 4 CO 0.00678 0.00263
## 5 DL 0.0159 0.00303
## 6 EV 0.0345 0.00318
## 7 F9 0.00716 0
## 8 FL 0.00982 0.00327
## 9 MQ 0.0290 0.00194
## 10 OO 0.0139 0.00349
## 11 UA 0.0164 0.00241
## 12 US 0.0113 0.00147
## 13 WN 0.0155 0.00229
## 14 XE 0.0155 0.00345
## 15 YV 0.0127 0
for each carrier, lets calculate the minimum and maximum arrival and departure delay
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(min( . , na.rm = TRUE), max(. , na.rm = TRUE)), matches("Delay"))
## # A tibble: 15 x 5
## UniqueCarrier ArrDelay_min DepDelay_min ArrDelay_max DepDelay_max
## <chr> <int> <int> <int> <int>
## 1 AA -39 -15 978 970
## 2 AS -43 -15 183 172
## 3 B6 -44 -14 335 310
## 4 CO -55 -18 957 981
## 5 DL -32 -17 701 730
## 6 EV -40 -18 469 479
## 7 F9 -24 -15 277 275
## 8 FL -30 -14 500 507
## 9 MQ -38 -23 918 931
## 10 OO -57 -33 380 360
## 11 UA -47 -11 861 869
## 12 US -42 -17 433 425
## 13 WN -44 -10 499 548
## 14 XE -70 -19 634 628
## 15 YV -32 -11 72 54
Helper functions ‘n()’ counts the number of rows in a group
Helper function ‘n_distinct(vector)’ counts the number of distinct items in that vector
# for each day of the year , count the total number of flights and sort in descending order
library(dplyr)
flights %>%
group_by(Month, DayofMonth) %>%
summarise(flight_count = n()) %>%
arrange(desc(flight_count))
## `summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
## # A tibble: 365 x 3
## # Groups: Month [12]
## Month DayofMonth flight_count
## <int> <int> <int>
## 1 8 4 706
## 2 8 11 706
## 3 8 12 706
## 4 8 5 705
## 5 8 3 704
## 6 8 10 704
## 7 1 3 702
## 8 7 7 702
## 9 7 14 702
## 10 7 28 701
## # ... with 355 more rows
We can also use “tally” function
flights %>%
group_by(Month, DayofMonth) %>%
tally(sort = TRUE)
## # A tibble: 365 x 3
## # Groups: Month [12]
## Month DayofMonth n
## <int> <int> <int>
## 1 8 4 706
## 2 8 11 706
## 3 8 12 706
## 4 8 5 705
## 5 8 3 704
## 6 8 10 704
## 7 1 3 702
## 8 7 7 702
## 9 7 14 702
## 10 7 28 701
## # ... with 355 more rows
for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
group_by(Dest) %>%
summarise(flight_count = n(), plane_count = n_distinct(TailNum))
## # A tibble: 116 x 3
## Dest flight_count plane_count
## <chr> <int> <int>
## 1 ABQ 2812 716
## 2 AEX 724 215
## 3 AGS 1 1
## 4 AMA 1297 158
## 5 ANC 125 38
## 6 ASE 125 60
## 7 ATL 7886 983
## 8 AUS 5022 1015
## 9 AVL 350 142
## 10 BFL 504 70
## # ... with 106 more rows
for each destination lets check the cancelled and not cancelled flights
flights %>%
group_by(Dest) %>%
select(Cancelled) %>%
table()
## Adding missing grouping variables: `Dest`
## Cancelled
## Dest 0 1
## ABQ 2787 25
## AEX 712 12
## AGS 1 0
## AMA 1265 32
## ANC 125 0
## ASE 120 5
## ATL 7745 141
## AUS 4995 27
## AVL 347 3
## BFL 503 1
## BHM 2697 39
## BKG 108 2
## BNA 3451 30
## BOS 1724 28
## BPT 3 0
## BRO 1665 27
## BTR 1733 29
## BWI 2527 24
## CAE 547 14
## CHS 1191 9
## CID 408 2
## CLE 2132 8
## CLT 4671 64
## CMH 1334 14
## COS 1637 20
## CRP 4720 93
## CRW 350 7
## CVG 1518 17
## DAL 9378 442
## DAY 446 5
## DCA 2664 35
## DEN 5892 28
## DFW 6500 153
## DSM 635 12
## DTW 2568 33
## ECP 727 2
## EGE 108 2
## ELP 3012 24
## EWR 4244 70
## FLL 2455 7
## GJT 401 2
## GPT 1586 32
## GRK 40 2
## GRR 672 5
## GSO 624 6
## GSP 1116 7
## GUC 86 0
## HDN 109 1
## HNL 401 1
## HOB 299 10
## HRL 3881 102
## HSV 911 12
## IAD 1958 22
## ICT 1484 33
## IND 1726 24
## JAN 1984 27
## JAX 2123 12
## JFK 677 18
## LAS 4067 15
## LAX 6031 33
## LBB 1309 24
## LCH 352 12
## LEX 578 6
## LFT 2257 56
## LGA 2681 49
## LIT 1553 26
## LRD 1168 20
## MAF 2263 43
## MCI 3133 41
## MCO 3671 16
## MDW 2072 22
## MEM 2352 47
## MFE 1116 12
## MIA 2439 24
## MKE 1568 20
## MLU 288 4
## MOB 1641 33
## MSP 1986 24
## MSY 6783 40
## MTJ 163 1
## OAK 685 5
## OKC 3114 56
## OMA 2025 19
## ONT 950 2
## ORD 5649 99
## ORF 711 6
## PBI 1242 11
## PDX 1232 3
## PHL 2340 27
## PHX 5067 29
## PIT 1652 12
## PNS 1516 23
## PSP 106 0
## RDU 1727 13
## RIC 893 7
## RNO 243 0
## RSW 941 7
## SAN 2924 12
## SAT 4853 40
## SAV 855 8
## SDF 1269 10
## SEA 2611 4
## SFO 2804 14
## SHV 778 9
## SJC 884 1
## SJU 389 2
## SLC 2024 9
## SMF 1011 3
## SNA 1651 10
## STL 2479 30
## TPA 3074 11
## TUL 2870 54
## TUS 1550 15
## TYS 1202 8
## VPS 870 10
## XNA 1138 34
Window functions
# for each carrier , calculate which two days of the year they had trheir longest departure delays
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
filter(min_rank(desc(DepDelay))<= 2) %>%
arrange(UniqueCarrier, desc(DepDelay))
## Adding missing grouping variables: `UniqueCarrier`
## # A tibble: 30 x 4
## # Groups: UniqueCarrier [15]
## UniqueCarrier Month DayofMonth DepDelay
## <chr> <int> <int> <int>
## 1 AA 12 12 970
## 2 AA 11 19 677
## 3 AS 2 28 172
## 4 AS 7 6 138
## 5 B6 10 29 310
## 6 B6 8 19 283
## 7 CO 8 1 981
## 8 CO 1 20 780
## 9 DL 10 25 730
## 10 DL 4 5 497
## # ... with 20 more rows
for each month, calculate the number of flights and the change from the preivous month
flights %>%
group_by(Month) %>%
summarise(flight_count = n()) %>%
mutate(change = flight_count - lag(flight_count))
## # A tibble: 12 x 3
## Month flight_count change
## <int> <int> <int>
## 1 1 18910 NA
## 2 2 17128 -1782
## 3 3 19470 2342
## 4 4 18593 -877
## 5 5 19172 579
## 6 6 19600 428
## 7 7 20548 948
## 8 8 20176 -372
## 9 9 18065 -2111
## 10 10 18696 631
## 11 11 18021 -675
## 12 12 19117 1096
we can also use tally() function
flights %>%
group_by(Month) %>%
tally() %>%
mutate(change = n - lag(n)) # where n is flight count
## # A tibble: 12 x 3
## Month n change
## <int> <int> <int>
## 1 1 18910 NA
## 2 2 17128 -1782
## 3 3 19470 2342
## 4 4 18593 -877
## 5 5 19172 579
## 6 6 19600 428
## 7 7 20548 948
## 8 8 20176 -372
## 9 9 18065 -2111
## 10 10 18696 631
## 11 11 18021 -675
## 12 12 19117 1096
Few more function
# need to see few rows for random sampling without replacement
flights %>% sample_n(5)
## # A tibble: 5 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 4 3 7 2102 2153 XE 2803
## 2 2011 10 11 2 1440 1825 CO 1539
## 3 2011 8 29 1 1100 1304 WN 1317
## 4 2011 8 18 4 1545 1751 XE 2529
## 5 2011 2 13 7 1440 1630 XE 2998
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## # AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>, Dest <chr>,
## # Distance <int>, TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## # CancellationCode <chr>, Diverted <int>
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace = TRUE)
## # A tibble: 56,874 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 3 25 5 49 142 XE 2450
## 2 2011 9 5 1 1148 1243 CO 1101
## 3 2011 3 8 2 1334 1450 WN 1896
## 4 2011 10 6 4 1041 1430 WN 734
## 5 2011 10 28 5 1442 1535 WN 249
## 6 2011 8 3 3 1929 2016 XE 2418
## 7 2011 3 1 2 719 921 XE 2888
## 8 2011 3 17 4 1946 2258 XE 2822
## 9 2011 6 1 3 1301 1405 XE 2189
## 10 2011 3 16 3 729 1215 CO 282
## # ... with 56,864 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>, DepDelay <int>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
In base R , we use the following to check the structure of object
str(flights)
## tibble [227,496 x 21] (S3: tbl_df/tbl/data.frame)
## $ Year : int [1:227496] 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int [1:227496] 1 1 1 1 1 1 1 1 1 1 ...
## $ DayofMonth : int [1:227496] 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int [1:227496] 6 7 1 2 3 4 5 6 7 1 ...
## $ DepTime : int [1:227496] 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
## $ ArrTime : int [1:227496] 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
## $ UniqueCarrier : chr [1:227496] "AA" "AA" "AA" "AA" ...
## $ FlightNum : int [1:227496] 428 428 428 428 428 428 428 428 428 428 ...
## $ TailNum : chr [1:227496] "N576AA" "N557AA" "N541AA" "N403AA" ...
## $ ActualElapsedTime: int [1:227496] 60 60 70 70 62 64 70 59 71 70 ...
## $ AirTime : int [1:227496] 40 45 48 39 44 45 43 40 41 45 ...
## $ ArrDelay : int [1:227496] -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
## $ DepDelay : int [1:227496] 0 1 -8 3 5 -1 -1 -5 43 43 ...
## $ Origin : chr [1:227496] "IAH" "IAH" "IAH" "IAH" ...
## $ Dest : chr [1:227496] "DFW" "DFW" "DFW" "DFW" ...
## $ Distance : int [1:227496] 224 224 224 224 224 224 224 224 224 224 ...
## $ TaxiIn : int [1:227496] 7 6 5 9 9 6 12 7 8 6 ...
## $ TaxiOut : int [1:227496] 13 9 17 22 9 13 15 12 22 19 ...
## $ Cancelled : int [1:227496] 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : chr [1:227496] "" "" "" "" ...
## $ Diverted : int [1:227496] 0 0 0 0 0 0 0 0 0 0 ...
Using dplyr to check the structure of object
glimpse(flights)
## Rows: 227,496
## Columns: 21
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011~
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2~
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355, 1443~
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454, 1554~
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"~
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 428, 42~
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492AA", "N~
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56, 63, ~
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41, 44, ~
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5, -9, ~
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2, -3, ~
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IA~
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF~
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 22~
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, 12, 8,~
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, 13, 15~
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ~
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
Now time to connect the database
- dplyr can connect to database
- only generate select statements
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.odd').parent('tbody').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });