I don’t think of New Yorker magazine as a technical references, but “The Cobweb” by Jill Lepore explains archive.org remarkably well and also helps to explain the many gaps in archive.org. It turns out to be trivial to remove stuff from the archive:
The Wayback Machine collects every Web page it can find, unless that page is blocked; blocking a Web crawler requires adding only a simple text file, “robots.txt,” to the root of a Web site. The Wayback Machine will honor that file and not crawl that site, and it will also, when it comes across a robots.txt, remove all past versions of that site. When the Conservative Party in Britain deleted ten years’ worth of speeches from its Web site, it also added a robots.txt, which meant that, the next time the Wayback Machine tried to crawl the site, all its captures of those speeches went away, too.
So it’s an archive only of stuff that publishers want archived…
Even if you already knew the above, I recommend Lepore’s article for the quality of the writing.
If they didn’t honor robots.txt, they would be exposing themselves to all sorts of infringement liability, thanks to our unbalanced copyright law. At least this way they are coasting on precedents set by Google, which could afford the lawyers to make “opt-out” via robots.txt rather than opt-in the default.
Now it would be awesome if the Library of Congress were the ones doing this, but until then. the Internet Archive is doing sterling work on a shoestring budget, another contribution of San Francisco to the betterment of the Internet.
Nice – but we get a Streisand Effect: http://en.wikipedia.org/wiki/Streisand_effect
If you’re meticulously erasing history, alarm bells will start ringing. On the other hand – feel free to scrub info – I’m not stopping you; just take note of the interesting effect that has (immediately felt or not).