Load a dataset
HF_load_dataset(
path,
name = NULL,
data_dir = NULL,
data_files = NULL,
split = NULL,
cache_dir = NULL,
features = NULL,
download_config = NULL,
download_mode = NULL,
ignore_verifications = FALSE,
save_infos = FALSE,
script_version = NULL,
...
)
path
name
dataset dir
dataset files
split
cache directory
features
download configuration
download mode
ignore verifications or not
save information or not
script version
additional arguments
data frame
This method does the following under the hood: 1. Download and import in the library the dataset loading script from ``path`` if it's not already cached inside the library. Processing scripts are small python scripts that define the citation, info and format of the dataset, contain the URL to the original data files and the code to load examples from the original data files. You can find some of the scripts here: https://github.com/huggingface/datasets/datasets and easily upload yours to share them using the CLI ``datasets-cli``. 2. Run the dataset loading script which will: * Download the dataset file from the original URL (see the script) if it's not already downloaded and cached. * Process and cache the dataset in typed Arrow tables for caching. Arrow table are arbitrarily long, typed tables which can store nested objects and be mapped to numpy/pandas/python standard types. They can be directly access from drive, loaded in RAM or even streamed over the web. 3. Return a dataset build from the requested splits in ``split`` (default: all).