If you are looking for the specific manifest or code that generated this file, you can find it in the official . The dataset is hosted via TensorFlow Datasets (TFDS) .
The files labeled with "junk" in their name contain the data that was discarded during these cleaning steps [1, 2]. 8376271910630849junk752148515597128846745.7z
2019 (Journal of Machine Learning Research, 2020). If you are looking for the specific manifest
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020). Colin Raffel