Skip to contents

This function partitions a parquet file into multiple partitions based on the specified maximum partition size.

Usage

partition_data(
  original_file,
  partition_folder,
  max_partition_size = 25,
  units = "mb"
)

Arguments

original_file

Path to the original parquet file.

partition_folder

Path to the folder where partitions must be stored.

max_partition_size

Maximum size of each partition (the default is 25).

units

Units of storage supported by files_size() (the default is 'mb').

Value

None. Writes the partitions in partition_folder.

Note

In the urge enhance the performance, the size of each partition is forecast by assuming homogeneous storage demand along the original file. However this may be unrealistic, thus, the max_partition_size do not guarantee that the partition with the largest size have at most this size. The above is specially true for small files/partitions, since the memory gains due to the use of parquet becomes weaker.

Why only partion parquet files?

This function is intended to share files via a communication tool that limits the size per message (e.g. the mail). Then, before partition the data is recommended to convert the file into a more efficient format.

Examples

if (FALSE) {
partition_data("data/original_data.csv", "partitions/",
               max_partition_size = 25, units = 'mb')
}