U.S. Government creates jobs for data warehouse experts and recent CS grads

It seems that the U.S. government is now collecting data on all phone calls made within the U.S. (example story from the Guardian). If the Federales are getting data from Verizon it seems safe to assume that they are getting it from all telcos. Let’s think about how much data this might be.

There are 310 million Americans. The FCC publishes some statistics on telephone usage but I can’t find just the simple total number of calls (see http://hraunfoss.fcc.gov/edocs_public/attachmatch/DOC-301823A1.pdf ).  Back of the envelope the typical cell phone plan is 900 minutes and the average call might last 3 minutes so that’s 300 calls per month or 10 per day (if the 900 minutes are used up, which is probably not reasonable to assume, but this is back-of-the-envelope). Assume another 10 per day made on wired phones (lots of business lines and people whose only job is to answer the phone) and that’s 20 calls per day for the average person who uses a phone. We’ll subtract out the very young and old so that is 200 million people times 20 calls per day = 4 billion calls.

A call record has to include two 10-digit numbers, a date-time stamp, and a number of minutes. If it is wireless there is presumably some additional data about the cell(s), such as location or at least cell ID. That has to be 100 bytes per record. So the government would be collecting a minimum of 400 GB per day of information that would have to be stored in a data warehouse. The data warehouse machinery, such as links to additional dimensions, would at least double this to 800 GB.  Round up to 1 TB to make calculation easier and that is 365 TB of data per year or a minimum of 1.6 petabytes for the years that the Obama Administration has been in power.

[I’ll be very interested to see comments from readers who work in this area and can tell me where the above calculations are wrong.]

Questions of civil liberties aside, it would seem that our government has created an interesting data warehousing challenge. Is 1.6 petabytes off the charts for size? No. This article says that eBay is at 9.2 petabytes and some individual telcos are in the same league.

This looks like a great opportunity for young people. Check out the NSA’s careers section, in particular their Computer Science Development Program for recent CS grads where they get rotating assignments and explicit classroom training (and pay that can be as much as \$97,33 per year (plus benefits!)). It is hard to imagine a better opportunity to start a career in Big Data than this three-year program.

[Separately… as long as the government is collecting all of that stuff maybe they could give it back to us in a useful form! A friend was recently involved in a tax dispute with New York State. The state asserted that someone they wanted to tax was a resident. My friend wanted wireline phone records to show that in fact the “New York resident” was making daily phone calls from a wired phone in the Boston area. In our age of unlimited long distance it turns out not to be that easy to get such records. Wouldn’t it have been nice to go to www.nsa.gov, type in your name and some sort of authentication and get an official printout of all the phone calls that you’ve ever made and where you were located at the time? If the government can have these data, why can’t we citizens at least get them too?]

6 thoughts on “U.S. Government creates jobs for data warehouse experts and recent CS grads”

1. Alexey says:

Don’t forget that this kind of data is very suitable for data compression algorithms. Probably 5 to 10 times at least.

2. Jess says:

There isn’t much point to warehousing the data if you can’t search, index, slice, and dice it. That reduces the opportunities for compression. Perhaps a geohash is slightly more compact than a given lat-long pair (although probably not as compact as a somewhat-equivalent cell site ID), but e.g. phone numbers are going to be indexed as-is.

3. Brian says:
4. “The NSA’s Utah Data Center will be able to handle and process five zettabytes of data, according to William Binney, a former NSA technical director turned whistleblower. Binney’s calculation is an estimate. An NSA spokeswoman says the actual data capacity of the center is classified.”

But it doesn’t if he is basing it on inside knowledge or if it’s estimation, what his estimation methodology is.