Loading...
The dataset is split into 30 distinct blocks (00.jsonl.zst to 29.jsonl.zst).
from The Pile’s official repository:
pip install pile
Alternatively, download the .torrent file from the-eye.eu or huggingface.co/datasets/EleutherAI/the_pile . how to download the pile dataset
Because of the massive file size, it is highly recommended to use to inspect the data without downloading the entire 825 GiB at once: The dataset is split into 30 distinct blocks (00
Originally curated by EleutherAI, The Pile is an 825 GiB diverse, open-source language modeling dataset composed of smaller datasets ranging from English Wikipedia to academic papers, legal documents, and code. how to download the pile dataset