COLLECTED BY
The Open Syllabus collection contains WARC files from a mid-2021 crawl of about 50 million unique seed URLs extracted from the Open Syllabus version 2.6 dataset and their page requisites. The bulk of the seed URLs are from ".com", ".org", ".edu", and ".uk" TLDs.
Crawl Summary
Crawl start: 2021-04-12 Crawl end: 2021-09-05 Seed URLs: 49,735,419 Archived URLs: 338,690,414 Collection Size: 25 TB Crawler: Heritrix/3.3.0-hq1-SNAPSHOT-2015-03-16T18:09:23Z Crawl depth: maxHops=0
Seed Summary
Unique URLs: 49,735,419 Unique Canonical URLs: 48,956,395 Unique Hosts: 984,223 IPv4 Addresses: 3,328 Unique TLDs: 21,761 Unique IANA Valid TLDs: 739 Wayback Machine URLs*: 6,568,213 * NOTE: More than 13% URLs in the dataset point to Wayback Machine!
The Wayback Machine - https://web.archive.org/web/20210413144146/https://github.com/ipython-books/cookbook-2nd/issues
Labels
3
Milestones
0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and
privacy statement . We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
ProTip!
Adding
no:label will show everything without a label.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.