I don’t think of New Yorker magazine as a technical references, but “The Cobweb” by Jill Lepore explains archive.org remarkably well and also helps to explain the many gaps in archive.org. It turns out to be trivial to remove stuff from the archive:
The Wayback Machine collects every Web page it can find, unless that page is blocked; blocking a Web crawler requires adding only a simple text file, “robots.txt,” to the root of a Web site. The Wayback Machine will honor that file and not crawl that site, and it will also, when it comes across a robots.txt, remove all past versions of that site. When the Conservative Party in Britain deleted ten years’ worth of speeches from its Web site, it also added a robots.txt, which meant that, the next time the Wayback Machine tried to crawl the site, all its captures of those speeches went away, too.
So it’s an archive only of stuff that publishers want archived…
Even if you already knew the above, I recommend Lepore’s article for the quality of the writing.