
Write an Excel file containing the names and classes of each file
create_partial_dictionary.Rd
This is a first step to create a "raw partial dictionary" that ease the unification of databases that have common information but may have heterogeneity across files. For instance, databases could have different number of variables, or the name and data type may change across files.
Usage
create_partial_dictionary(
folder,
files,
dict_path,
n_infer = 100L,
overwrite = F,
verbose = T
)
Arguments
- folder
A character with the root folder where is stored the data.
- files
A character vector of file paths (from the
folder
) from which to extract column names and classes.- dict_path
The path where is going to be saved the dictionary.
- n_infer
The number of rows to infer column classes from in each file (the default is
100
).- overwrite
A boolean indicating whether to overwrite the dictionary if it already exists. The default is
FALSE
to protect your existing dictionaries.- verbose
A boolean (the default is
TRUE
) indicating whether to display progress messages.
Details
This function extracts the column names and classes (data types)
from each file, and stores them in a dictionary saved as an Excel file with
two sheets (one for the names and other for the classes). This raw
dictionary lacks polish and is almost useless in this form, then is highly
recommended to refine it. The function sort_partial_dictionary()
accomplished this job. So, Why don't merge both functions in the first
place? Well, there are at least two reasons. First, the creation of this
preliminary partial dictionary is potentially time demanding, so by
splitting the process we guarantee that an error in the refinement does not
affect the heavy work. Second, allows the user to elaborate a custom
processing.
Note
The n_infer
is a critical parameter that comes with a trade off
between speed and certainty that the class is properly inferred. If your
data is small or you do not have a hurry, you could replace it by Inf
.
However, even with a small value of 100 I have not experience any problem
with hundreds of files with millions of observations and tens of variables.
Is important to have all the data files in a common folder (root folder).
Of course it may be partitioned in sub-folders.
See also
For a full example, see the vignette
process_data_with_partial_dict
in the
website
or with the command
vignette('process_data_with_partial_dict', package = dataRC')
.
Examples
if (FALSE) {
# Create a partial dictionary from a list of files.
folder <- 'my_folder/my_data'
files <- list.files(folder, recursive = T)
create_partial_dictionary(folder, files, "my_folder/my_dictionary.xlsx")
}