Tool for retrieving a subset of Gmail correspondence?

Folks:

It is pretty common in litigation for a party to demand, via discovery, “all email correspondence between the defendant and Joe Smith” or “all email correspondence in which ‘rebar failure’ is discussed.”

I’m wondering if there is a good automated tool for extracting this from Gmail or another IMAP source.

It would be nice if the tool could do the selection by sending a search term to Gmail. But it would also be acceptable if the tool were capable only of grabbing one IMAP folder. In that case the Gmail user could use Gmail tools to search for a particular recipient or subject line substring, then put all of the results in a folder named “JoeSmith” and have the tool pull all of JoeSmith.

Another nice-to-have feature would be to preserve the conversation threading, but I don’t think it is necessary to fulfill the requirements of the legal discovery process.

Output could be one text file per conversation, one PDF file per conversation, or one huge file with page breaks between messages or, preferably, between conversations.

To me the most obvious way to implement this is as an IMAP client to Gmail. However, given that Gmail already does most of what one would want I wonder if it wouldn’t make more sense to implement this as browser action scripts to simulate user actions and clicks within a Web browser.

Does this exist already? If not, does it seem like it would make a welcome open-source software product?

Thanks in advance for ideas.

Note: About a year ago, I developed a specification for software to do something fairly similar. This was for a friend’s startup that was going to build databases of corporate email, but the company ended up moving in a different direction: ExtractingConversationTimelinesfromEmail

[Update: I should note that the Gmail web interface already does virtually everything that is required above. All that the Google programmers would need to add is an option to “print everything to PDF” where “everything” is a set of a search results, a folder, or all the messages that are selected via a checkbox.]

15 thoughts on “Tool for retrieving a subset of Gmail correspondence?

  1. How about using Thunderbird (the desktop email client) to interface with Gmail? It plays well with Gmail (my preference is for all my email to be saved locally, and it does that for all my Gmail accounts, some of which use POP and some IMAP), and once you have the mail in Thunderbird, you can run searches against it through existing UI, or, if that’s not enough, write an add-on to run queries against the email database. There may also be suitable existing Thunderbird add-ons.

    Regarding exporting messages to file, trying it right now shows that messages are saved one to a file, with a .eml extension, and are text files, which sounds like it could meet your requirements after running that output through a script to rename/reformat/combine.

  2. How long would it take to download 10 GB of email into Thunderbird? When I look at just my personal Gmail account it says that I have 105,000 messages in “All Mail”. I can’t figure out how many GB this is because the only report I have of my usage is, I think, summing Picasa, Google Drive, etc.

  3. This sounds just like the kind of thing Postini services might do. (At least worth looking into.)

  4. I would use getmail, a python app I use to keep my Gmail backed up, on the advice of an ex Google employee who uses it for same. Getmail stores email in the very common Maildir format, which means one file per mail. Not as user friendly as archiving conversation threads per file, perhaps, but probably sufficient for lawyers, who can use their armies of staff to properly re-assemble conversations (billable hours!).

    Getmail can be configured to retrieve from a particular folder
    http://pyropus.ca/software/getmail/configuration.html#retriever-parameters
    http://pyropus.ca/software/getmail/configuration.html#retriever-examples

    ….so I would go with your idea to pre-filter everything into a Gmail tag. Then retrieve via getmail on that tag. I do not have experience with how Gmail converts tag names to IMAP folder names, but I would think that should be Google-able, or that Gmail does the obvious thing (tag name == folder name).

    I’m not personally aware of a user script for this, but never really went looking for one.

  5. You can use Thunderbird with Gmail via IMAP. No need to download all your mail, the headers will be enough for using Thunderbird’s search. You can do a custom search or a simple search by ‘sender’ or ‘recipient’ to be the other party in the conversation.

    Using the ImportExportTools extension for Thunderbird, you can save the selected messages into a variety of formats.

    Also, the Extra Folder Columns for Thunderbird will let you see how much each of your Gmail folders occupies (without downloading the whole stuff, with IMAP).

  6. I use the Mac Mail app and all my mail is up on servers and accessed through IMAP. Apple spotlight indexes the whole thing, and I frequently search just as you are describing.

  7. Larry: As discussed in the original posting, searching is not the problem (though I guess it would be more luxurious to do it on a $2000 Apple computer and also pay them every year for a service that Google provides free). The problem is turning a set of search results into a big PDF or folder of text files that Acrobat can process.

  8. Phil, sorry, I thought you had a Mac at home so I was trying to propose a solution with no additional cost to you. Assuming you did, you could create a smart mailbox with all your search terms, then select all the messages and save them to a richtext file, which can then be easily saved as a PDF. No need for any fancy tools if you just want to get the messages out to satisfy a litigation request. I’m sure some mail apps for the PC can handle the job as well. To me the hard part is the search. Gmail can do it, but most other mail clients do a poor job of it. Seems most people have multiple email accounts, so using IMAP and a mail client to aggregate them all makes sense, based on your use case. Then do the search and export based on that. I don’t know anyone who only uses gmail, so a solution that relies on gmail is also pretty limiting.

  9. Use a tool like offlineimap to dump messages into a local maildir. Use notmuch (notmuchmail.org) to index it, then search and output files of individual messages, or display a thread.

    notmuch search –output=files from:joe

    notmuch show `notmuch search –output=threads from:bob`

  10. (I pushed “enter” too soon) … and there are other apps-script APIs for loading stuff into docs or spreadsheets (and probably other formats) that you can then download later.
    You can also set these up to trigger at particular times, so you could have a daily script that runs over your gmail and applies labels according to some more complex criteria, for example.

Comments are closed.