Provides supercharged version of mutate
with group_by
, order_by
and aggregation over arbitrary window frame
around a row for dataframes and lazy (remote) tbl
s of class tbl_lazy
.
Arguments
- x
(
data.frame
ortbl_lazy
)- ...
expressions to be passed to
mutate
- .by
(expression, optional: Yes) Columns to group by
- .order_by
(expression, optional: Yes) Columns to order by
- .frame
(vector, optional: Yes) Vector of length 2 indicating the number of rows to consider before and after the current row. When argument
.index
is provided (typically a column of type date or datetime), before and after can be interval objects. See examples. When input istbl_lazy
, only number of rows as vector of length 2 is supported.- .index
(expression, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only.
- .complete
(flag, default: FALSE) This will be passed to
slider::slide
/slider::slide_vec
. Should the function be evaluated on complete windows only? If FALSE or NULL, the default, then partial computations will be allowed. This is supported when input is a dataframe only.
Details
A window function returns a value for every input row of a dataframe
or lazy_tbl
based on a group of rows (frame) in the neighborhood of the
input row. This function implements computation over groups (partition_by
in SQL) in a predefined order (order_by
in SQL) across a neighborhood of
rows (frame) defined by a (up, down) where
up/down are number of rows before and after the corresponding row
up/down are interval objects (ex:
c(days(2), days(1))
). Interval objects are currently supported for dataframe only. (nottbl_lazy
)
This implementation is inspired by spark's window API.
Implementation Details:
For dataframe input:
Iteration per row over the window is implemented using the versatile
slider
.Application of a window aggregation can be optionally run in parallel over multiple groups (see argument
.by
) by setting a future parallel backend. This is implemented using furrr package.function subsumes regular usecases of
mutate
For tbl_lazy
input:
Uses
dbplyr::window_order
anddbplyr::window_frame
to translate topartition_by
and window frame specification.
Examples
library("magrittr")
# example 1 (simple case with dataframe)
# Using iris dataset,
# compute cumulative mean of column `Sepal.Length`
# ordered by `Petal.Width` and `Sepal.Width` columns
# grouped by `Petal.Length` column
iris %>%
mutate(sl_mean = mean(Sepal.Length),
.order_by = c(Petal.Width, Sepal.Width),
.by = Petal.Length,
.frame = c(Inf, 0),
) %>%
dplyr::slice_min(n = 3, Petal.Width, by = Species)
#>
#> Attaching package: ‘purrr’
#> The following object is masked from ‘package:magrittr’:
#>
#> set_names
#> # A tibble: 15 × 6
#> Petal.Length Sepal.Length Sepal.Width Petal.Width Species sl_mean
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 1.4 4.8 3 0.1 setosa 4.8
#> 2 1.4 4.9 3.6 0.1 setosa 4.85
#> 3 1.1 4.3 3 0.1 setosa 4.3
#> 4 1.5 4.9 3.1 0.1 setosa 4.9
#> 5 1.5 5.2 4.1 0.1 setosa 5.05
#> 6 3.5 5 2 1 versicolor 5
#> 7 3.5 5.7 2.6 1 versicolor 5.35
#> 8 4 6 2.2 1 versicolor 6
#> 9 3.3 5 2.3 1 versicolor 5
#> 10 3.3 4.9 2.4 1 versicolor 4.95
#> 11 3.7 5.5 2.4 1 versicolor 5.5
#> 12 4.1 5.8 2.7 1 versicolor 5.8
#> 13 5.6 6.1 2.6 1.4 virginica 6.1
#> 14 5 6 2.2 1.5 virginica 6
#> 15 5.1 6.3 2.8 1.5 virginica 6.3
# example 2 (detailed case with dataframe)
# Using a sample airquality dataset,
# compute mean temp over last seven days in the same month for every row
set.seed(101)
airquality %>%
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>%
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) %>%
dplyr::arrange(date_col) %>%
# compute mean temperature over last seven days in the same month
tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = Day,
.by = Month,
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = date_col
)
#> # A tibble: 122 × 8
#> Month Ozone Solar.R Wind Temp Day date_col avg_temp_over_last_week
#> <int> <int> <int> <dbl> <int> <int> <date> <dbl>
#> 1 6 NA 286 8.6 78 1 1973-06-01 NaN
#> 2 6 NA 242 16.1 67 3 1973-06-03 78
#> 3 6 NA 186 9.2 84 4 1973-06-04 72.5
#> 4 6 NA 264 14.3 79 6 1973-06-06 76.3
#> 5 6 29 127 9.7 82 7 1973-06-07 77
#> 6 6 NA 273 6.9 87 8 1973-06-08 78
#> 7 6 NA 259 10.9 93 11 1973-06-11 83
#> 8 6 NA 250 9.2 92 12 1973-06-12 85.2
#> 9 6 23 148 8 82 13 1973-06-13 86.6
#> 10 6 NA 332 13.8 80 14 1973-06-14 87.2
#> # ℹ 112 more rows
# example 3
airquality %>%
# create date column as character
dplyr::mutate(date_col =
as.character(lubridate::make_date(1973, Month, Day))
) %>%
tibble::as_tibble() %>%
# as `tbl_lazy`
dbplyr::memdb_frame() %>%
mutate(avg_temp = mean(Temp),
.by = Month,
.order_by = date_col,
.frame = c(3, 3)
) %>%
dplyr::collect() %>%
dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 153 × 8
#> Ozone Solar.R Wind Temp Month Day date_col avg_temp
#> <int> <int> <dbl> <int> <int> <int> <chr> <dbl>
#> 1 41 190 7.4 67 5 1 1973-05-01 68.8
#> 2 36 118 8 72 5 2 1973-05-02 66.2
#> 3 12 149 12.6 74 5 3 1973-05-03 66.2
#> 4 18 313 11.5 62 5 4 1973-05-04 66
#> 5 NA NA 14.3 56 5 5 1973-05-05 64.9
#> 6 28 NA 14.9 66 5 6 1973-05-06 63.3
#> 7 23 299 8.6 65 5 7 1973-05-07 62.6
#> 8 19 99 13.8 59 5 8 1973-05-08 64.3
#> 9 8 19 20.1 61 5 9 1973-05-09 66.1
#> 10 NA 194 8.6 69 5 10 1973-05-10 66.1
#> # ℹ 143 more rows