Data Warehousing

Frozen salmon at Petersburg Fisheries. Petersburg, Alaska.

Data Warehousing for Cavemen

by Philip Greenspun for Web Tools Review

"Another segment of society that has constructed a language of its own is business. ... [The businessman] is speaking a language that is familiar to him and dear to him. Its portentous nouns and verbs invest ordinary events with high adventure; the executive walks among ink erasers caparisoned like a knight. This we should be tolerant of--every man of spirit wants to ride a white horse. ... A good many of the special words of business seem designed more to express the user's dreams than to express his precise meaning."
-- last chapter of The Elements of Style, Strunk and White

This document is intended to make you buzzword-compliant with the MIS world. In terms simple enough even for an MIT computer science Ph.D. to understand, I'm going to explain OLTP, data warehousing, and OLAP. Kiss that ghetto post-doc goodbye and watch big companies line up to pay you $300/hour to romance their most critical data.

Let's imagine a conversation between the Chief Information Officer of WalMart and a sales guy from Sybase. Note that I've picked these companies for concreteness but they stand for "big Management Information System (MIS) user" and "big relational database management system (RDBMS) vendor". I picked WalMart because there aren't any near my house and Sybase because I use Informix and Oracle behind my Web sites and can't afford to annoy them any more than I did with my book.

Walmart: "I want to keep track of sales in all of my stores simultaneously."
Sybase: "You need our wonderful RDBMS software. You can stuff data as sales are rung up at cash registers and simultaneously query data out right here in your office. That's the beauty of concurrency control."

So Walmart buys a $1 million Hewlett-Packard multi-CPU server and a $500,000 Sybase license. They buy one of the beginner books I recommend on data modeling and build themselves a nice normalized SQL data model:

SALES table
product id store id quantity sold date/time of sale

567 17 1 1997-10-22 09:35:14

219 16 4 1997-10-22 09:35:14

219 17 1 1997-10-22 09:35:17

...

SALES table
product id	store id	quantity sold	date/time of sale
567	17	1	1997-10-22 09:35:14
219	16	4	1997-10-22 09:35:14
219	17	1	1997-10-22 09:35:17
...

PRODUCTS table
product id product name product category manufacturer id
567 Colgate Gel Pump 6.4 oz. toothpaste 68

219 Diet Coke 12 oz. can soda 5

...

PRODUCTS table
product id	product name	product category	manufacturer id
567	Colgate Gel Pump 6.4 oz.	toothpaste	68
219	Diet Coke 12 oz. can	soda	5
...

STORES table
store id city id store location phone number
16 34 510 Main Street 415-555-1212

17 58 13 Maple Avenue 914-555-1212

...

STORES table
store id	city id	store location	phone number
16	34	510 Main Street	415-555-1212
17	58	13 Maple Avenue	914-555-1212
...

CITIES table
city id city name state population
34 San Francisco California 700,000

58 East Fishkill New York 30,000

...

CITIES table
city id	city name	state	population
34	San Francisco	California	700,000
58	East Fishkill	New York	30,000
...

After a few months of stuffing data into these tables, a WalMart executive, call her Jennifer Amolucre asks "I noticed that there was a Colgate promotion recently, directed at people who live in small towns. How much Colgate toothpaste did we sell in those towns yesterday? And how much on the same day a month ago?"

At this point, reflect that because our data model is normalized, this information can't be obtained from scanning one table. A normalized data model is one in which all the information in a row depends only on the primary key. So for example, we don't store the city population in the STORES table. We store that information once/city in the CITIES table and only store a CITY_ID in the STORES table (so if there are multiple stores in the same city, the population will be pulled out of the same slot for all the stores).

Ms. Amolucre's query will look something like this...


select sum(sales.quantity_sold) 
from sales, products, stores, cities
where products.manufacturer_id = 68   -- restrict to Colgate-Palmolive
and products.product_category = 'toothpaste'
and cities.population < 40000
and sales.datetime_of_sale::date = 'yesterday'::date  -- restrict to yesterday
and sales.product_id = products.product_id
and sales.store_id = stores.store_id
and stores.city_id = cities.city_id

If you find this query tough to read, you might want to refresh your knowledge of SQL by reading my little tutorial in Chapter 11 of Database Backed Web Sites. Anyway, the basic idea is that you have to do a 4-way JOIN of some fairly good-sized tables. Moreover, these tables are being updated as Ms. Amolucre's query is executed.

The $1 million HP Unix box crawls to a near-halt. The cash registers stop being able to ring up customers. Eventually the dbadmins realize that the system collapses every time Ms. Amolucre's toothpaste query gets run. They complain to Sybase tech support.

Walmart: "We type in the toothpaste query and our system wedges."
Sybase: "Of course it does! You built an on-line transaction processing (OLTP) system. You can't feed it a decision support system (DSS) query and expect things to work!"
Walmart: "But I thought the whole point of SQL and your RDBMS was that users could query and insert simultaneously."
Sybase: "Uh, not exactly."
Walmart: "Can you fix your system so that it doesn't lock up?"
Sybase: "No. But we made this great loader tool so that you can copy everything from your OLTP system into a separate DSS system at 100 GB/hour."

Basically what Sybase wants Walmart to do is set up another RDBMS installation on a separate computer. Walmart needs to buy another $1 million monster HP Unix box. They need to buy another $500,000 RDBMS license. They also need to hire programmers to make sure that the OLTP data is copied out nightly and stuffed into the DSS system. Walmart is now building the data warehouse.

Insight 1

A data warehouse is a separate RDBMS installation that contains copies of data from on-line systems. A data warehouse would not be necessary if RDBMS software worked as advertised. It is merely a $10 million bandaid applied to the limitations of modern computers and RDBMS software.

As long as we're copying...

As long as you're copying data from the OLTP system into the DSS system (data warehouse), you might as well think about organizing and indexing it for faster retrieval. Extra indices on production tables are bad because they slow down inserts and updates. Every time you add or modify a row to a table, the RDBMS has to update the indices to keep them consistent. But in a data warehouse, the data are static. You build indices once and they take up space and sometimes make queries faster and that's it.

If you know that Jennifer Amolucre is going to do the toothpaste query every day then you can denormalize the data model for her. If you add a TOWN_POPULATION column to the STORES table and copy in data from the CITIES table, for example, you sacrifice some cleanliness of data model but now Ms. Amolucre's query only requires a 3-way JOIN. If you add MANUFACTURER_ID and PRODUCT_CATEGORY columns to the SALES table, then you don't need to JOIN in the PRODUCTS table.

This requires a lot of custom programming in SQL and procedural languages. In fact, I'd venture to say that more programmers are at work today doing jobs like this than are at work developing shrink-wrap apps. Sometimes MIS managers call this process "data transformation" or "data grooming".

As long as we're copying from Database A ...

... we might as well copy from Database B. Suppose Walmart acquires Kmart and that Kmart is using a different set of OLTP data models and a different brand of RDBMS to support them, e.g., Informix. It would be too expensive and disruptive to replace Kmart's OLTP system with the Walmart one. But you might want to do toothpaste queries across all the stores from both divisions. So you find some way of also copying data from the Kmart Informix into the Walmart Sybase DSS denormalized data models.

One of the more important functions of a data warehouse in a company that has disparate computing systems is to provide a view for management as though the company were in fact integrated.

Isn't that just what a db-backed Web server does?

It turns out that a company's $10 million data warehousing project and their $100,000 database-backed Web site project involve many of the same tasks and can be supported with the same tools. The essential common tasks are

gather up data from disparate sources
display it in some unified way

For both data warehouse and db-backed Web site, the objective is to quickly write some bug-free programs in a reasonably high-level language. Most of the intelligence is going to be reflected in SQL queries and the rest of it is just mechanically transporting bytes from machine A to machine B.

There are literally thousands of data warehousing tools on the market. All of these exist because popular programming languages like C are too unproductive and error-prone to be used with corporate data. What would I personally use if I had to build a data warehouse tomorrow? AOLserver!

AOLserver? The web server program?

AOLserver can maintain pools of open connections to a variety of relational databases. It includes an interpreter for a safe computer language (Tcl). It is reliable and has lots of nice debugging aids. In fact, I once used AOLserver to perform this very task. I wrote a bunch of Tcl scripts that queried Database A with Data Model 1 (developed by neophytes and requiring 7-way JOINs) and issued SQL inserts into Database B with Data Model 2 (developed by me and requiring a single table scan instead of a 7-way JOIN). It took about 8 hours to transfer 400 Mbytes of data on a SPARC Ultra 2 so I don't think I'm going to win any prizes for speed/byte here but it got the job done. I think 99% of the time was chewed up by the RDBMS installations themselves.

Anyway, the idea here is to underscore that if you're already a wizard at building db-backed Web sites, the task of building a data warehouse is very similar. Moreover, the things that you look for in a Web development tool will serve you well in the data warehousing world.

What if?

Suppose that a new MBA, Giovanni Giovanericco, arrives at Walmart and asks the DSS system "How many toothpaste tubes were sold as a function of advertising dollars spent per person in towns with populations less than 40,000?" No problem. Suppose that the query is changed to "How would we predict toothpaste sales to change if we doubled spending on advertising?" Can you formulate that in SQL?

Suppose you ask "Show me the sales in stores with the best-paid managers." Quite easy in SQL. "Find the correlation between management pay and sales." Quite difficult.

It turns out that SQL is actually the wrong language for many purposes. Just as a trivial example, consider how painful has been your experience with Web sites that directly expose SQL-style set queries to the users. If you specify what you want, you get zero rows back. If you loosen everything up to be sure of getting at least one row, you end up with 1000 results. The SQL query processor finds sets of rows that exactly satisfy the WHERE clauses. It then returns "0 rows selected". It does not return "0 rows selected but the last 17 were killed off by the hour_of_travel constraint and 98 before that were killed off by the cheapest_fare clause."

Walmart: "You said SQL was great and would solve our problems. But we can't ask our most important questions in SQL."
Sybase: "We only sold you an OLTP system and then a DSS system. These questions you've brought to us are online analytical processing (OLAP) queries. You can't expect to run these against a relational database. You need an OLAP system. It will only cost you another $1 million in hardware and $500,000 in software licenses.

At some point, it is much more efficient to suck all the data out of an RDBMS into virtual memory data structures and explore from there. That's more or less what OLAP systems do. If you can't standardize your data along a bunch of canned dimensions (e.g., geographic, time, product category), then you might have to write programs from scratch.

What do you in fact find at a big company?

A big company these days will have three copies of the same data. There will be a huge RDBMS installation for OLTP. Every night data from the OLTP system will be copied into the even bigger RDBMS installation for DSS queries. Every night data from the DSS system will be copied into some kind of OLAP system. This is why you have to buy stock in Seagate.

Buzzwords that you should recognize

The MIS world is filled with managers who make $500,000/year but couldn't type a correct SQL query to save their lives. They've papered over their complete lack of technical knowledge by developing a modest list of jargon. At the end of the day, all that you can really do is type SQL into an RDMBS so most of this jargon obscures the issues, but I guess it does intimidate even more clueless executives who are higher up the ladder. So that people who read Web Tools Review won't be embarrassed in conversation, here is a first stab at a jargon decoder:

datamart: a view or a copy of a subset of the data in a data warehouse. In a company with lots of divisions, it might be overwhelming for "the toothpaste" guys to see all the tables. So a collection of views and tables specific to toothpaste is produced. If the toothpaste guys want to do a lot of DSS queries, they sometimes will get their very own RDBMS installation (more money for HP; more money for Sybase).
EIS: executive information system (same as decision support system (DSS))

More Information

I wish I could say "Go over to IBM's Web site (or Oracle's or Informix's or Sybase's) and look at their much more erudite explanation of what these different kinds of products can do." But even companies that make products for OLTP, DSS, and OLAP don't bother to publish any high-level tutorial information. If you don't already know what all of their stuff is good for then you'll be completely lost on their Web sites.

Of course, after I wrote the above, a reader helpfully recommended Ralph Kimball's The Data Warehouse Toolkit.

philg@mit.edu

Reader's Comments

There actually are some fairly good data warehouse tutorials on the net. But I'd rather read a good book (like philg's, for example), so go out and buy Ralph Kimball's. He's the one who really understood that a *big* business opportunity existed where Oracle and DB2 (not to mention older "legacy" databases) feared to tread -- producing sales reports. Actually his stuff is good and readable and full of useful examples and even a little mini-warehouse system he wrote in Microsoft Access (a feat worth a medal in itself).
I went to a technical presentation at the Oregon Graduate Institute about a year ago -- their public seminars used to be on things like the internals of Chorus (an early-90s French variant of Mach) or n-dimensional object-oriented whoozis or whatever. Now that it's the Era of the Net, of course, these talks have turned into thinly disguised marketing pitches.
And at this one, the pitch was for a local company developing a "data cube" front-end analysis tool for the aforementioned Moby Data Warehouses, like Kimball's Red Brick or the equivalent in Oracle or what-have-you. And the example the guy used about how great their tool is was that Walmart has used data warehouses to figure out that their sales of diapers and beer go up at 5 pm, so they put them in the display areas at the front of the store. Why? Because dad goes home and gets a call from mom on the cell phone to remind him about picking up the diapers, so of course he then picks up a six pack at the same time. And this takes $20 million and a cafeteria full of DBAs, sysadmins, financial specialists, data extracting/mining/tuning/cleaning/ignoring specialists and so forth to figure out.
I'm being a bit unkind about the corporate value of data warehouses. The presenter also, after vigorous questioning, admitted *another* result of data warehouses. It seems that corporate America has never been able to relate the selling price of a retail item to its production cost. I know that is a big shock -- it was to me -- but it's true. In other words, if you have an EEE-wide shoe size, you can find it in the big stores because they know people require all kinds of shoe sizes and stock accordingly, even though some sizes move slowly or not at all.
Now come the data warehouse reports to inform regional sales managers that the marginal cost of producing those EEEs is higher than the regular ones, in fact it cuts significantly into the overall margins for shoes. So . . . they stop carrying your size and you have to go elsewhere at a much increased direct and indirect cost in time, car fumes and annoyance, not to mention the much larger lifetime sales loss to Big Mall Box, Inc. because they lost you as a customer forever.
This reminds me of the other efficient market created by computers that we know about -- the airline reservation racket, where highly-advanced theorists and programmers have created the yield management system to maximize airline ticket profits. This allows me to sit next to you and pay $400 for a round trip that cost you $1200, even though we bought our seats less than 24 hours apart. It has worked so well that the airlines are now being impelled to force small travel agents -- the bread and butter providers of business travelers and the backbone of their cash flow -- out of business. You say you prefer to book your flights with your friendly local travel agent? Sorry, go with Mega Agency or book 'em online . . . if you dare.
The moral of the story: no doubt data warehouses and yield management systems have their place, but never forget what their place is, especially when the well-dressed sales-, er, "corporate executive" is pitching you the latest Pet Rock, er, data cube presentation technology.

-- Fred Heutte, November 2, 1997

Only a couple days after my last comment, I was paging through the October 27 issue of Information Week and came across a Tandem ad, running double-truck across pages 54-55:
[First, you have to imagine a large thirty-ish guy standing in a doorway, wearing a diaper. Ok, got that?]
"AT 6:32 PM EVERY WEDNESDAY, OWEN BLY BUYS DIAPERS AND BEER. DO NOT JUDGE OWEN. ACCOMMODATE HIM."
"If a data mining query discovers that between 6 and 8pm men buy diapers and beer, chances are you'll see more diapers and beer. It's with this kind of valuable -- and sometimes odd -- information that Tandem is helping people in retail, banking, telecommunications and insurance uncover business opportunities."
This also uncovers a recurring theme in modern data processing, one that changed the traditional acronym of GIGO from "garbage in/garbage out" to "garbage in/gospel out".
http://www.journalism.wisc.edu/jargon/jargon_21.html

-- Fred Heutte, November 6, 1997

Whenever I see someone trying to explain or justify data warehousing, almost inevitably the Parable of the Beer and Diapers is brought up. It has entered the apocrypha of the data warehousing community as the single compelling example of the utility of data warehousing. In every case, it is attributed to a different corporation, often to some nameless "large retail store." It has all the markings of an Urban Legend, like the one about the guy who wakes up in a hotel in a foreign city in a bathtub full of ice, a splitting pain in his side, and a note written on the mirror saying "Call 911," his kidneys having been removed by organ bootleggers. Does anyone know if the Beer and Diapers story actually happened, or if some enterprising Tom Vu of the data warehousing world made it up?

-- Jin Choi, February 2, 1999

Here is where the Diapers/Beer NOT LEGEND came from:
From: Ronny Kohavi ronnyk@bluemartini.com Date: Thu, 06 Jul 2000 22:16:36 -0700 Subject: Origin of "diapers and beer" For my invited talk at ICML in 1998, I tracked the beer and diapers example further. Check out slide 21 in http://robotics.stanford.edu/~ronnyk/chasm.pdf
Basically, I found the person in Blischok's group who ran the queries. K. Heath ran self joins in SQL (1990), trying to find two itemsets that have baby items, which are particularly profitable. She found this beer and diapers pattern in their data of 50 stores over a day period.
When I talked to her, she mentioned that she didn't think the pattern was significant, but it was interesting.
-- Ronny

-- Tom Mathews, June 23, 2001

Add a comment