Provides supercharged version of mutate
with group_by
, order_by
and aggregation over arbitrary window frame
around a row for dataframes and lazy (remote) tbl
s of class tbl_lazy
.
Arguments
- x
(
data.frame
ortbl_lazy
)- ...
expressions to be passed to
mutate
- .by
(character vector, optional: Yes) Columns to group by
- .order_by
(string, optional: Yes) Columns to order by
- .frame
(vector, optional: Yes) Vector of length 2 indicating the number of rows to consider before and after the current row. When argument
.index
is provided (typically a column of type date or datetime), before and after can be interval objects. See examples. When input istbl_lazy
, only number of rows as vector of length 2 is supported.- .index
(string, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only.
- .desc
(flag, default: FALSE) Whether to order in descending order
- .complete
(flag, default: FALSE) This will be passed to
slider::slide
/slider::slide_vec
. Should the function be evaluated on complete windows only? If FALSE or NULL, the default, then partial computations will be allowed. This is supported when input is a dataframe only.
Details
A window function returns a value for every input row of a dataframe
or lazy_tbl
based on a group of rows (frame) in the neighborhood of the
input row. This function implements computation over groups (partition_by
in SQL) in a predefined order (order_by
in SQL) across a neighborhood of
rows (frame) defined by a (up, down) where
up/down are number of rows before and after the corresponding row
up/down are interval objects (ex:
c(days(2), days(1))
). Interval objects are currently supported for dataframe only. (nottbl_lazy
)
This implementation is inspired by spark's window API.
Implementation Details:
For dataframe input:
Iteration per row over the window is implemented using the versatile
slider
.Application of a window aggregation can be optionally run in parallel over multiple groups (see argument
.by
) by setting a future parallel backend. This is implemented using furrr package.function subsumes regular usecases of
mutate
For tbl_lazy
input:
Uses
dbplyr::window_order
anddbplyr::window_frame
to translate topartition_by
and window frame specification.
Examples
library("magrittr")
# example 1 (simple case with dataframe)
# Using iris dataset,
# compute cumulative mean of column `Sepal.Length`
# ordered by `Petal.Width` and `Sepal.Width` columns
# grouped by `Petal.Length` column
iris %>%
tidier::mutate_(sl_mean = mean(Sepal.Length),
.order_by = c("Petal.Width", "Sepal.Width"),
.by = "Petal.Length",
.frame = c(Inf, 0),
) %>%
dplyr::slice_min(n = 3, Petal.Width, by = Species)
#> # A tibble: 15 × 6
#> Petal.Length Sepal.Length Sepal.Width Petal.Width Species sl_mean
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 1.4 4.8 3 0.1 setosa 4.8
#> 2 1.4 4.9 3.6 0.1 setosa 4.85
#> 3 1.1 4.3 3 0.1 setosa 4.3
#> 4 1.5 4.9 3.1 0.1 setosa 4.9
#> 5 1.5 5.2 4.1 0.1 setosa 5.05
#> 6 3.5 5 2 1 versicolor 5
#> 7 3.5 5.7 2.6 1 versicolor 5.35
#> 8 3.3 4.9 2.4 1 versicolor 4.95
#> 9 3.3 5 2.3 1 versicolor 5
#> 10 4.1 5.8 2.7 1 versicolor 5.8
#> 11 4 6 2.2 1 versicolor 6
#> 12 3.7 5.5 2.4 1 versicolor 5.5
#> 13 5.6 6.1 2.6 1.4 virginica 6.1
#> 14 5.1 6.3 2.8 1.5 virginica 6.3
#> 15 5 6 2.2 1.5 virginica 6
# example 2 (detailed case with dataframe)
# Using a sample airquality dataset,
# compute mean temp over last seven days in the same month for every row
set.seed(101)
airquality %>%
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>%
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) %>%
dplyr::arrange(date_col) %>%
# compute mean temperature over last seven days in the same month
tidier::mutate_(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = "Day",
.by = "Month",
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = "date_col"
)
#> # A tibble: 122 × 8
#> Month Ozone Solar.R Wind Temp Day date_col avg_temp_over_last_week
#> <int> <int> <int> <dbl> <int> <int> <date> <dbl>
#> 1 6 NA 264 14.3 79 6 1973-06-06 76.3
#> 2 6 NA 250 9.2 92 12 1973-06-12 85.2
#> 3 6 37 284 20.7 72 17 1973-06-17 85.2
#> 4 6 NA 135 8 75 25 1973-06-25 75.5
#> 5 6 NA 31 14.9 77 29 1973-06-29 76.4
#> 6 7 49 248 9.2 85 2 1973-07-02 84
#> 7 7 77 276 5.1 88 7 1973-07-07 83.4
#> 8 7 27 175 14.9 81 13 1973-07-13 83
#> 9 7 48 260 6.9 81 16 1973-07-16 82.7
#> 10 7 52 82 12 86 27 1973-07-27 82.4
#> # ℹ 112 more rows
# example 3
airquality %>%
# create date column as character
dplyr::mutate(date_col =
as.character(lubridate::make_date(1973, Month, Day))
) %>%
tibble::as_tibble() %>%
# as `tbl_lazy`
dbplyr::memdb_frame() %>%
mutate_(avg_temp = mean(Temp),
.by = "Month",
.order_by = "date_col",
.frame = c(3, 3)
) %>%
dplyr::collect() %>%
dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
#> # A tibble: 153 × 8
#> Ozone Solar.R Wind Temp Month Day date_col avg_temp
#> <int> <int> <dbl> <int> <int> <int> <chr> <dbl>
#> 1 41 190 7.4 67 5 1 1973-05-01 68.8
#> 2 36 118 8 72 5 2 1973-05-02 66.2
#> 3 12 149 12.6 74 5 3 1973-05-03 66.2
#> 4 18 313 11.5 62 5 4 1973-05-04 66
#> 5 NA NA 14.3 56 5 5 1973-05-05 64.9
#> 6 28 NA 14.9 66 5 6 1973-05-06 63.3
#> 7 23 299 8.6 65 5 7 1973-05-07 62.6
#> 8 19 99 13.8 59 5 8 1973-05-08 64.3
#> 9 8 19 20.1 61 5 9 1973-05-09 66.1
#> 10 NA 194 8.6 69 5 10 1973-05-10 66.1
#> # ℹ 143 more rows