Software ideas for a Web archive?

A friend’s daughter is tasked with developing a Web-accessible archive for a multi-year collection of material that has been generated by an organization within a university. All of the material will be public, so there are no security issues and everything can be indexed by search engines. Ideally all of this can be maintained by non-programmers from Web browsers and minimal technical effort will be required for setup (though perhaps some programming would be useful/needed for an ingestion step).

The material is a mixture of PDFs, images, text, etc. She found some interesting software targeted at this very problem. Examples:

All of these provide for comprehensive tagging of each item, boolean searches, etc. But I wonder/worry that these are overkill. The collection is not especially valuable and I don’t know if people want to take the trouble to craft elaborate queries.

I was thinking that she might be better off using standard WordPress. Every item that is in the archive can become a WordPress post dated whenever the item was created (maybe this can be done via a batch process inserting things into the WordPress tables). She and anyone else involved in the project can tag items with however many tags make sense. At that point users can

  1. search with Google
  2. search by date (WordPress lets you go back and look at posts by date)
  3. search by tag

One advantage for WordPress over the above systems that are built for archiving is that WordPress is much more popular and constantly being improved (changed, anyway!). There are plugin modules available, e.g., to improve full-text searching through PDFs. For those who already have a museum collection organized, there is even a “Culture Object” plugin that is designed to import a collection into WordPress.

Readers: Better ideas?

14 thoughts on “Software ideas for a Web archive?

  1. Note: I haven’t actually used this system and haven’t talked to anyone who has. Seems to be oriented at researchers making their supplemental content available, so it may not be tuned or appropriate to your use case.
    SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.

  2. If the contents change only rarely, I recommend using a static site generator like Hugo. Serving it can then be ultra-cheap, the attack surface is minimized, and performance is maximized.

    • I had similar thoughts. Over the years I’ve seen perfectly good data pools be lost due to 1) neglect, 2) software obsolescence, 3) simple hackers. Nobody really wants to take on the job of keeping them current. I can’t count the small organizations who followed the advice to “put it in a database” and either had their data stuck in Dbase III or Foxpro databases or had to pay for conversions.

      A static web page written in HTML and NOT a bunch of script will be easily accessible for decades, and be fairly easy to secure. Even more so, easy to back up, reload and move to a new platform if necessary. Even if we don’t have “web browsers” I am sure HTML will manage to be displayed for a long time.

      I’m not familiar with Hugo, but I’d suggest keeping away from final sites with javascript and even site generators that produce extensive CSS. You want the data marked up, not formatted for display. Trust the user and the browser. Don’t be a marketer control freak.

    • Arthur: That’s a great idea, but I don’t know if it is as flexible for querying as WordPress. For example, you might be able to have this do “browse by tag” for one tag, but not combine tags or combine tag+date search. Also, I think it is unlikely that they’ll be happy with the v1.0 site. Inevitably people will want to add/modify tags.

  3. If you’re going minimalist and letting google do the work, why not just an s3 bucket? (Ok maybe with a static catalog)

  4. The institution likely already has a subscription to Archive-It. Press a button, done.

    • Thanks for the suggestion. Archive-It! seems to be targeted at materials that are already on the Web. I’m not sure that it would support even minimally structured searching either. The material here is a mishmash of stuff that is mostly not on the Web, but nearly all in electronic form.

  5. Sounds like a job for Apache Solr and file management scripts. An upload script where someone up loads the files, with fields for the freetags, Solr will index the file contents and entered tags when uploaded. File retrieval interface could just be the default web server “file browser” interface (like under a “/files” directory) for best compatibility with web crawlers and migration to different systems, with top level being an interface to Solr for searching (we know web crawlers can be hit and miss at times)..

    If you want to create a more “api” friendly application, use CouchDB with the entered metadata in the json document and uploaded file as document attachment.

    I assume you have already analyzed the obvious, ? Once you upload a certain number of items, you can get your own collection in

    • [Disclaimer: I used to be Apache Solr / Lucene PMC and contributor]

      I was going to suggest Solr too. Very powerful search engine (Lucene under the hood) that can do all that you want. It also comes with Apache Tika (to extract raw text off binary files such as PDF, DOC, etc.). It also support multiple languages. The issue with using Solr and Tika means you need to be a developer. Also is just a SE API / library, it is not a SE application. You have to write your front end UI and pump data to it for indexing. However, there are some front end UI [1] (I never used them).

      It sounds like your friend’s daughter is not a developer (why would have suggested WordPress otherwise?) If so, WordPress, so far, looks to be the ideal solution.


Comments are closed.