README

To ensure the format of this archive works for you, please download and inspect at least the first manifest file -- and preferably the first manifest and the associated .tar.xz

1.7M manifest_0000000000_0000100000.tsv.xz

212M data_0000000000_0000100000.tar.xz

cdd582ea4b95ec9d9a48d8e75e67464c  data_0000000000_0000100000.tar.xz
b1255acf700c74af04a82e199c2b3aee  manifest_0000000000_0000100000.tsv.xz

Total Size

The archive currently consists of 830 individual segments totalling roughly 274G of data while compressed.

Manifest Format

All manifest files are TSV files. Their names follow a common format of: manifest_${start}_${end}.tsv.xz . The manifest and corresponding archive will contain requests with ids in [$start, $end).

The first line contains a header:

id timestamp url length md5
id
Nondecreasing integer primary key for web requests.
timestamp
Unix timestamp of roughly when the request was made. Not necessarilly monotonically increasing due to clock skew amongst nodes and backfilled requests.
url
URL.
length
Length in bytes of the UTF-8 encoded response. If the original response could not be cleanly round-tripped through UTF-8 it is not included in the resulting archive.
md5
md5 hash of the response.

The manifest files can be used to locate requests for a specific url without extracting every archive, or similarly, to locate the most recent version of a url if more than one request for it is present.

Master Manifest

A master manifest containing data about all data and other manifest files is also available (please note this is the hash as of 2020-09-23):

108K master_manifest.tsv

cfc2cc827cae6c04ce77dcaaf57762f6  master_manifest.tsv

The first line contains a header:

file_name type range_start range_end size md5
file_name
Name of data or manifest file.
type
Single character indicating the type of the file:
d
Data file
m
Manifest file
range_start
Start of half-open included web requests: [start, end)
range_end
End of half-open included web requests: [start, end)
size
Length in bytes of the corresponding file.
md5
md5 hash of the corresponding file.

Archive Format

All archives are .tar.xz files. They follow the same naming scheme as the manifest files. In addition, the corresponding manifest is also always the first entry in the tar.

The remaining entries are all HTML files named ${id}.html. The first \n terminated line is an abbreviated header so they can be used independently of the manifest files. The header has the form:

<!-- ${timestamp} ${url} -->

(note: this line is also tab separated for ease of processing)

Please refer to the manifest format definition if the field names are not self explanatory.

The remainder of the file is the best-effort original response. It may be \r\n or only \n terminated lines depending on which server responded to the initial request. For example, in the first archive ./2.html has \n terminated lines, while ./3.html has \r\n terminated lines.

Best-effort original response

Recent changes have altered the response format for some pages starting near id 68184762. The newer method of access currently necessitates the html to first be normalized by a browser capable of running javascript. This means that the markup is altered from the true source due to the browser:

and in some cases due to javascript that runs on the page:

The semantic meaning of the html is generally unchanged, so any consumer using css style selectors and a true html parser are likely unaffected.