To ensure the format of this archive works for you, please download and inspect at least the first manifest file -- and preferably the first manifest and the associated .tar.xz
cdd582ea4b95ec9d9a48d8e75e67464c data_0000000000_0000100000.tar.xz
b1255acf700c74af04a82e199c2b3aee manifest_0000000000_0000100000.tsv.xz
The archive currently consists of 830 individual segments totalling roughly 274G of data while compressed.
All manifest files are TSV files. Their names follow a common format
of:
manifest_${start}_${end}.tsv.xz . The
manifest and corresponding archive will contain requests with ids in
[$start, $end).
The first line contains a header:
id timestamp url length md5
idtimestampurllengthmd5The manifest files can be used to locate requests for a specific url without extracting every archive, or similarly, to locate the most recent version of a url if more than one request for it is present.
A master manifest containing data about all data and other manifest files is also available (please note this is the hash as of 2020-09-23):
cfc2cc827cae6c04ce77dcaaf57762f6 master_manifest.tsv
The first line contains a header:
file_name type range_start range_end size md5
file_nametypedmrange_start[start, end)range_end[start, end)sizemd5All archives are .tar.xz files. They follow the same naming scheme as the manifest files. In addition, the corresponding manifest is also always the first entry in the tar.
The remaining entries are all HTML files named
${id}.html. The first
\n terminated line is an abbreviated header
so they can be used independently of the manifest files. The header has
the form:
<!-- ${timestamp} ${url} -->
(note: this line is also tab separated for ease of processing)
Please refer to the manifest format definition if the field names are not self explanatory.
The remainder of the file is the best-effort original response. It may be
\r\n or only
\n terminated lines depending on which
server responded to the initial request. For example, in the first archive
./2.html has \n terminated
lines, while ./3.html has
\r\n terminated lines.
Recent changes have altered the response format for some pages starting
near id 68184762. The newer method of access
currently necessitates the html to first be normalized by a browser
capable of running javascript. This means that the markup is altered from
the true source due to the browser:
and in some cases due to javascript that runs on the page:
The semantic meaning of the html is generally unchanged, so any consumer using css style selectors and a true html parser are likely unaffected.