Computes distances via dist and saves then as file-backed matrix(FBM) using bigstatsr package or connects existing FBM backup file on disk.

bigdist(mat, file, method = "euclidean", type = "float")

Arguments

mat

Numeric matrix. When missing, attempts to connect to existing backup file. See 'file' argument.

file

(string) Name of the backing file to be created or an existing backup file. Do not include trailing ".bk". See details for the backup file format.

method

(string or function) See method argument of dist. This ignored when mat is missing.

type

(string, default: 'float') Storage type of FBM. See FBM. This ignored when mat is missing.

Value

An object of class 'bigdist'.

Details

bigdist class is a list where the key 'fbm' holds the FBM connection. The filename format is of the form <somename>_<size>_<type>.bk where size is the number of observations and type is the data type like 'double', 'float'.

bigstatsr package stores matrices on disk and allows efficient computation on them. The disto provides a unified frontend to read parts of distance matrices and apply functions over rows/columns. For efficient operations, write C++ functions to talk to bigstatsr's FBM.

The distance computation and writing to FBM may be parallelized by setting a future backend

Examples

# basics of 'bigdist' # create a random matrix set.seed(1) amat <- matrix(rnorm(1e3), ncol = 10) td <- tempdir() # create a bigdist object with FBM (file-backed matrix) on disk temp <- bigdist(mat = amat, file = file.path(td, "temp_ex1"))
#> ----
#> Location: /tmp/RtmpNBdlQr/temp_ex1_100_float.bk
#> Size on disk: 0 GB
#> Computing distances ...
#> Completed!
#> ----
temp
#> $fbm #> A Filebacked Big Matrix of type 'float' with 100 rows and 100 columns. #> #> attr(,"class") #> [1] "bigdist"
temp$fbm$backingfile
#> [1] "/tmp/RtmpNBdlQr/temp_ex1_100_float.bk"
temp$fbm[1, 2]
#> [1] 4.631341
# connect to FBM on disk as a bigdist object temp2 <- bigdist(file = file.path(td, "temp_ex1_100_float")) temp2
#> $fbm #> A Filebacked Big Matrix of type 'float' with 100 rows and 100 columns. #> #> attr(,"class") #> [1] "bigdist"
temp2$fbm[1,2]
#> [1] 4.631341
# check the size of bigdist object bigdist_size(temp)
#> [1] 100
# bigdist accessors # ij bigdist_extract(temp, 1, 2)
#> [,1] #> [1,] 4.631341
bigdist_extract(temp, 1:2, 3:4)
#> [,1] [,2] #> [1,] 3.976406 3.531089 #> [2,] 5.591309 4.661480
bigdist_extract(temp, 1:2, 3:4, product = "inner")
#> [1] 3.976406 4.661480
dim(bigdist_extract(temp, 1:2,))
#> [1] 2 100
dim(bigdist_extract(temp, , 3:4))
#> [1] 100 2
# k (lower trianle indexing) bigdist_extract(temp, k = 3:7)
#> [1] 3.531089 4.131712 4.124086 3.900174 3.730360
# bigdist replacers # ij bigdist_replace(temp, 1, 2, 10)
#> $fbm #> A Filebacked Big Matrix of type 'float' with 100 rows and 100 columns. #> #> attr(,"class") #> [1] "bigdist"
bigdist_extract(temp, 1, 2)
#> [,1] #> [1,] 10
bigdist_replace(temp, 1:2, 3:4, 11:12)
#> $fbm #> A Filebacked Big Matrix of type 'float' with 100 rows and 100 columns. #> #> attr(,"class") #> [1] "bigdist"
bigdist_extract(temp, 1:2, 3:4, product = "inner")
#> [1] 11 12
# k (lower trianle indexing) bigdist_replace(temp, k = 3:7, value = 51:55)
#> $fbm #> A Filebacked Big Matrix of type 'float' with 100 rows and 100 columns. #> #> attr(,"class") #> [1] "bigdist"
bigdist_extract(temp, k = 3:7)
#> [1] 51 52 53 54 55
# subset a bigdist object temp_subset <- bigdist_subset(temp, index = 21:30, file = file.path(td, "temp_ex2")) temp_subset
#> $fbm #> A Filebacked Big Matrix of type 'float' with 10 rows and 10 columns. #> #> attr(,"class") #> [1] "bigdist"
temp_subset$fbm$backingfile
#> [1] "/tmp/RtmpNBdlQr/temp_ex2_10_float.bk"
# convert a dist object(in memory) to a bigdist object temp3 <- as_bigdist(dist(mtcars), file = file.path(td, "temp_ex3"))
#> ----
#> Location: /tmp/RtmpNBdlQr/temp_ex3_32_double.bk
#> Size on disk: 0 GB
#> completed!
#> ----
temp3
#> $fbm #> A Filebacked Big Matrix of type 'double' with 32 rows and 32 columns. #> #> attr(,"class") #> [1] "bigdist"