Load_dataset — HF_load_dataset • fastai

Load a dataset

HF_load_dataset(
  path,
  name = NULL,
  data_dir = NULL,
  data_files = NULL,
  split = NULL,
  cache_dir = NULL,
  features = NULL,
  download_config = NULL,
  download_mode = NULL,
  ignore_verifications = FALSE,
  save_infos = FALSE,
  script_version = NULL,
  ...
)

Arguments

path: path
name: name
data_dir: dataset dir
data_files: dataset files
split: split
cache_dir: cache directory
features: features
download_config: download configuration
download_mode: download mode
ignore_verifications: ignore verifications or not
save_infos: save information or not
script_version: script version
...: additional arguments

Value

data frame

Details

This method does the following under the hood: 1. Download and import in the library the dataset loading script from ``path`` if it's not already cached inside the library. Processing scripts are small python scripts that define the citation, info and format of the dataset, contain the URL to the original data files and the code to load examples from the original data files. You can find some of the scripts here: https://github.com/huggingface/datasets/datasets and easily upload yours to share them using the CLI ``datasets-cli``. 2. Run the dataset loading script which will: * Download the dataset file from the original URL (see the script) if it's not already downloaded and cached. * Process and cache the dataset in typed Arrow tables for caching. Arrow table are arbitrarily long, typed tables which can store nested objects and be mapped to numpy/pandas/python standard types. They can be directly access from drive, loaded in RAM or even streamed over the web. 3. Return a dataset build from the requested splits in ``split`` (default: all).