
Standardize column's datatypes in a file based on an unifying dictionary
unify_classes.RdThis function is part of a group of functions intended to solve a scenario where there is equivalent data that is potentially stored heterogeneously (e.g., different column names and datatypes). In particular, the function unifies a the datatype of a group of file's columns based on a unifying dictionary.
Arguments
- df
Data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). Its columns should be a subset of
selected_columns.- dict
Data frame that represents a refined unifying dictionary (possible created by
sort_partial_dictionary()and refined by the user) that contains information about a group of files intended to be process together. It must have at least three columns:uniname,uniclassand thefilename.- file
String that represents a file name that is part of the group of files (i.e. is a column name in
dict).- selected_columns
Atomic vector that is a subset of the
uninamesdict's column. In other words, is a set of desired columns that individually, should be in at least one file of the group of files.
Value
An object of the same type as df. The output has the following
properties:
Have the name and order of
df's columns.A subset of the columns may change its datatypes based on the
uniclasscolumn ofdict.
See also
For a full example, see the vignette
process_data_with_partial_dict in the
website
or with the command vignette('process_data_with_partial_dict', package = dataRC').
Examples
if (FALSE) {
# Parameters
folder <- 'my_folder'
files <- list.files(folder)
dict <- readxl::read_excel('my_dict.xlsx')
selected_columns <- dict$uniname[1L:3L]
# Make a unique database using lazy evaluation.
df <- NULL
for (file in files) {
df0 <- arrow::open_dataset(file.path(folder, file)) %>%
unify_colnames(dict, file, selected_columns) %>%
unify_classes(dict, file, selected_columns) %>% collect
df <- dplyr::bind_rows(df, df0)
}
df %>% relocate_columns(selected_columns) %>%
write_parquet('unified_data.parquet')
}
if (FALSE) {
# Parameters
folder <- 'my_folder'
files <- list.files(folder)
dict <- readxl::read_excel('my_dict.xlsx')
selected_columns <- c('ID', 'YEAR', 'MONTH')
# Count number of people per month using lazy evaluation.
df <- NULL
for (file in files) {
df0 <- arrow::read_parquet(file.path(folder, file)) %>%
unify_colnames(dict, file, selected_columns) %>%
unify_classes(dict, file, selected_columns) %>%
dplyr::distinct() %>%
dplyr::group_by(YEAR, MONTH) %>%
dplyr::summarise(n_ = n()) %>% collect
df <- rbind(df, df0)
}
df %>% relocate_columns(selected_columns) %>%
dplyr::group_by(YEAR, MONTH) %>%
dplyr::summarise(n_ = sum(n_)) %>%
write_parquet('people_per_month.parquet')
}