Skip to contents

This is a first step to create a "raw partial dictionary" that ease the unification of databases that have common information but may have heterogeneity across files. For instance, databases could have different number of variables, or the name and data type may change across files.

Usage

create_partial_dictionary(
  folder,
  files,
  dict_path,
  n_infer = 100L,
  overwrite = F,
  verbose = T
)

Arguments

folder

A character with the root folder where is stored the data.

files

A character vector of file paths (from the folder) from which to extract column names and classes.

dict_path

The path where is going to be saved the dictionary.

n_infer

The number of rows to infer column classes from in each file (the default is 100).

overwrite

A boolean indicating whether to overwrite the dictionary if it already exists. The default is FALSE to protect your existing dictionaries.

verbose

A boolean (the default is TRUE) indicating whether to display progress messages.

Value

None. The function saves the raw partial dictionary as an Excel file.

Details

This function extracts the column names and classes (data types) from each file, and stores them in a dictionary saved as an Excel file with two sheets (one for the names and other for the classes). This raw dictionary lacks polish and is almost useless in this form, then is highly recommended to refine it. The function sort_partial_dictionary() accomplished this job. So, Why don't merge both functions in the first place? Well, there are at least two reasons. First, the creation of this preliminary partial dictionary is potentially time demanding, so by splitting the process we guarantee that an error in the refinement does not affect the heavy work. Second, allows the user to elaborate a custom processing.

Note

The n_infer is a critical parameter that comes with a trade off between speed and certainty that the class is properly inferred. If your data is small or you do not have a hurry, you could replace it by Inf. However, even with a small value of 100 I have not experience any problem with hundreds of files with millions of observations and tens of variables. Is important to have all the data files in a common folder (root folder). Of course it may be partitioned in sub-folders.

See also

For a full example, see the vignette process_data_with_partial_dict in the website or with the command vignette('process_data_with_partial_dict', package = dataRC').

Examples

if (FALSE) {
# Create a partial dictionary from a list of files.
folder <- 'my_folder/my_data'
files <- list.files(folder, recursive = T)
create_partial_dictionary(folder, files, "my_folder/my_dictionary.xlsx")
}