Data Warehousing for Cavemenby Philip Greenspun for Web Tools Review |
"Another segment of society that has constructed a language of its own is business. ... [The businessman] is speaking a language that is familiar to him and dear to him. Its portentous nouns and verbs invest ordinary events with high adventure; the executive walks among ink erasers caparisoned like a knight. This we should be tolerant of--every man of spirit wants to ride a white horse. ... A good many of the special words of business seem designed more to express the user's dreams than to express his precise meaning."This document is intended to make you buzzword-compliant with the MIS world. In terms simple enough even for an MIT computer science Ph.D. to understand, I'm going to explain OLTP, data warehousing, and OLAP. Kiss that ghetto post-doc goodbye and watch big companies line up to pay you $300/hour to romance their most critical data.
-- last chapter of The Elements of Style, Strunk and White
Let's imagine a conversation between the Chief Information Officer of WalMart and a sales guy from Sybase. Note that I've picked these companies for concreteness but they stand for "big Management Information System (MIS) user" and "big relational database management system (RDBMS) vendor". I picked WalMart because there aren't any near my house and Sybase because I use Informix and Oracle behind my Web sites and can't afford to annoy them any more than I did with my book.
Walmart: "I want to keep track of sales in all of my stores simultaneously."
Sybase: "You need our wonderful RDBMS software. You can stuff data as sales are rung up at cash registers and simultaneously query data out right here in your office. That's the beauty of concurrency control."
So Walmart buys a $1 million Hewlett-Packard multi-CPU server and a $500,000 Sybase license. They buy one of the beginner books I recommend on data modeling and build themselves a nice normalized SQL data model:
product id | store id | quantity sold | date/time of sale |
---|---|---|---|
567 | 17 | 1 | 1997-10-22 09:35:14 |
219 | 16 | 4 | 1997-10-22 09:35:14 |
219 | 17 | 1 | 1997-10-22 09:35:17 |
... |
product id | product name | product category | manufacturer id |
---|---|---|---|
567 | Colgate Gel Pump 6.4 oz. | toothpaste | 68 |
219 | Diet Coke 12 oz. can | soda | 5 |
... |
store id | city id | store location | phone number |
---|---|---|---|
16 | 34 | 510 Main Street | 415-555-1212 |
17 | 58 | 13 Maple Avenue | 914-555-1212 |
... |
city id | city name | state | population |
---|---|---|---|
34 | San Francisco | California | 700,000 |
58 | East Fishkill | New York | 30,000 |
... |
After a few months of stuffing data into these tables, a WalMart executive, call her Jennifer Amolucre asks "I noticed that there was a Colgate promotion recently, directed at people who live in small towns. How much Colgate toothpaste did we sell in those towns yesterday? And how much on the same day a month ago?"
At this point, reflect that because our data model is normalized, this information can't be obtained from scanning one table. A normalized data model is one in which all the information in a row depends only on the primary key. So for example, we don't store the city population in the STORES table. We store that information once/city in the CITIES table and only store a CITY_ID in the STORES table (so if there are multiple stores in the same city, the population will be pulled out of the same slot for all the stores).
Ms. Amolucre's query will look something like this...
select sum(sales.quantity_sold) from sales, products, stores, cities where products.manufacturer_id = 68 -- restrict to Colgate-Palmolive and products.product_category = 'toothpaste' and cities.population < 40000 and sales.datetime_of_sale::date = 'yesterday'::date -- restrict to yesterday and sales.product_id = products.product_id and sales.store_id = stores.store_id and stores.city_id = cities.city_id
If you find this query tough to read, you might want to refresh your knowledge of SQL by reading my little tutorial in Chapter 11 of Database Backed Web Sites. Anyway, the basic idea is that you have to do a 4-way JOIN of some fairly good-sized tables. Moreover, these tables are being updated as Ms. Amolucre's query is executed.
The $1 million HP Unix box crawls to a near-halt. The cash registers stop being able to ring up customers. Eventually the dbadmins realize that the system collapses every time Ms. Amolucre's toothpaste query gets run. They complain to Sybase tech support.
Walmart: "We type in the toothpaste query and our system wedges."
Sybase: "Of course it does! You built an on-line transaction processing (OLTP) system. You can't feed it a decision support system (DSS) query and expect things to work!"
Walmart: "But I thought the whole point of SQL and your RDBMS was that users could query and insert simultaneously."
Sybase: "Uh, not exactly."
Walmart: "Can you fix your system so that it doesn't lock up?"
Sybase: "No. But we made this great loader tool so that you can copy everything from your OLTP system into a separate DSS system at 100 GB/hour."
Basically what Sybase wants Walmart to do is set up another RDBMS installation on a separate computer. Walmart needs to buy another $1 million monster HP Unix box. They need to buy another $500,000 RDBMS license. They also need to hire programmers to make sure that the OLTP data is copied out nightly and stuffed into the DSS system. Walmart is now building the data warehouse.
If you know that Jennifer Amolucre is going to do the toothpaste query every day then you can denormalize the data model for her. If you add a TOWN_POPULATION column to the STORES table and copy in data from the CITIES table, for example, you sacrifice some cleanliness of data model but now Ms. Amolucre's query only requires a 3-way JOIN. If you add MANUFACTURER_ID and PRODUCT_CATEGORY columns to the SALES table, then you don't need to JOIN in the PRODUCTS table.
This requires a lot of custom programming in SQL and procedural languages. In fact, I'd venture to say that more programmers are at work today doing jobs like this than are at work developing shrink-wrap apps. Sometimes MIS managers call this process "data transformation" or "data grooming".
One of the more important functions of a data warehouse in a company that has disparate computing systems is to provide a view for management as though the company were in fact integrated.
For both data warehouse and db-backed Web site, the objective is to quickly write some bug-free programs in a reasonably high-level language. Most of the intelligence is going to be reflected in SQL queries and the rest of it is just mechanically transporting bytes from machine A to machine B.
There are literally thousands of data warehousing tools on the market. All of these exist because popular programming languages like C are too unproductive and error-prone to be used with corporate data. What would I personally use if I had to build a data warehouse tomorrow? AOLserver!
AOLserver? The web server program?
AOLserver can maintain pools of open connections to a variety of relational databases. It includes an interpreter for a safe computer language (Tcl). It is reliable and has lots of nice debugging aids. In fact, I once used AOLserver to perform this very task. I wrote a bunch of Tcl scripts that queried Database A with Data Model 1 (developed by neophytes and requiring 7-way JOINs) and issued SQL inserts into Database B with Data Model 2 (developed by me and requiring a single table scan instead of a 7-way JOIN). It took about 8 hours to transfer 400 Mbytes of data on a SPARC Ultra 2 so I don't think I'm going to win any prizes for speed/byte here but it got the job done. I think 99% of the time was chewed up by the RDBMS installations themselves.
Anyway, the idea here is to underscore that if you're already a wizard at building db-backed Web sites, the task of building a data warehouse is very similar. Moreover, the things that you look for in a Web development tool will serve you well in the data warehousing world.
Suppose you ask "Show me the sales in stores with the best-paid managers." Quite easy in SQL. "Find the correlation between management pay and sales." Quite difficult.
It turns out that SQL is actually the wrong language for many purposes. Just as a trivial example, consider how painful has been your experience with Web sites that directly expose SQL-style set queries to the users. If you specify what you want, you get zero rows back. If you loosen everything up to be sure of getting at least one row, you end up with 1000 results. The SQL query processor finds sets of rows that exactly satisfy the WHERE clauses. It then returns "0 rows selected". It does not return "0 rows selected but the last 17 were killed off by the hour_of_travel constraint and 98 before that were killed off by the cheapest_fare clause."
Walmart: "You said SQL was great and would solve our problems. But we can't ask our most important questions in SQL."At some point, it is much more efficient to suck all the data out of an RDBMS into virtual memory data structures and explore from there. That's more or less what OLAP systems do. If you can't standardize your data along a bunch of canned dimensions (e.g., geographic, time, product category), then you might have to write programs from scratch.
Sybase: "We only sold you an OLTP system and then a DSS system. These questions you've brought to us are online analytical processing (OLAP) queries. You can't expect to run these against a relational database. You need an OLAP system. It will only cost you another $1 million in hardware and $500,000 in software licenses.
Of course, after I wrote the above, a reader helpfully recommended Ralph Kimball's The Data Warehouse Toolkit.
There actually are some fairly good data warehouse tutorials on the net. But I'd rather read a good book (like philg's, for example), so go out and buy Ralph Kimball's. He's the one who really understood that a *big* business opportunity existed where Oracle and DB2 (not to mention older "legacy" databases) feared to tread -- producing sales reports. Actually his stuff is good and readable and full of useful examples and even a little mini-warehouse system he wrote in Microsoft Access (a feat worth a medal in itself).I went to a technical presentation at the Oregon Graduate Institute about a year ago -- their public seminars used to be on things like the internals of Chorus (an early-90s French variant of Mach) or n-dimensional object-oriented whoozis or whatever. Now that it's the Era of the Net, of course, these talks have turned into thinly disguised marketing pitches.
And at this one, the pitch was for a local company developing a "data cube" front-end analysis tool for the aforementioned Moby Data Warehouses, like Kimball's Red Brick or the equivalent in Oracle or what-have-you. And the example the guy used about how great their tool is was that Walmart has used data warehouses to figure out that their sales of diapers and beer go up at 5 pm, so they put them in the display areas at the front of the store. Why? Because dad goes home and gets a call from mom on the cell phone to remind him about picking up the diapers, so of course he then picks up a six pack at the same time. And this takes $20 million and a cafeteria full of DBAs, sysadmins, financial specialists, data extracting/mining/tuning/cleaning/ignoring specialists and so forth to figure out.
I'm being a bit unkind about the corporate value of data warehouses. The presenter also, after vigorous questioning, admitted *another* result of data warehouses. It seems that corporate America has never been able to relate the selling price of a retail item to its production cost. I know that is a big shock -- it was to me -- but it's true. In other words, if you have an EEE-wide shoe size, you can find it in the big stores because they know people require all kinds of shoe sizes and stock accordingly, even though some sizes move slowly or not at all.
Now come the data warehouse reports to inform regional sales managers that the marginal cost of producing those EEEs is higher than the regular ones, in fact it cuts significantly into the overall margins for shoes. So . . . they stop carrying your size and you have to go elsewhere at a much increased direct and indirect cost in time, car fumes and annoyance, not to mention the much larger lifetime sales loss to Big Mall Box, Inc. because they lost you as a customer forever.
This reminds me of the other efficient market created by computers that we know about -- the airline reservation racket, where highly-advanced theorists and programmers have created the yield management system to maximize airline ticket profits. This allows me to sit next to you and pay $400 for a round trip that cost you $1200, even though we bought our seats less than 24 hours apart. It has worked so well that the airlines are now being impelled to force small travel agents -- the bread and butter providers of business travelers and the backbone of their cash flow -- out of business. You say you prefer to book your flights with your friendly local travel agent? Sorry, go with Mega Agency or book 'em online . . . if you dare.
The moral of the story: no doubt data warehouses and yield management systems have their place, but never forget what their place is, especially when the well-dressed sales-, er, "corporate executive" is pitching you the latest Pet Rock, er, data cube presentation technology.
-- Fred Heutte, November 2, 1997
Only a couple days after my last comment, I was paging through the October 27 issue of Information Week and came across a Tandem ad, running double-truck across pages 54-55:[First, you have to imagine a large thirty-ish guy standing in a doorway, wearing a diaper. Ok, got that?]
"AT 6:32 PM EVERY WEDNESDAY, OWEN BLY BUYS DIAPERS AND BEER. DO NOT JUDGE OWEN. ACCOMMODATE HIM."
"If a data mining query discovers that between 6 and 8pm men buy diapers and beer, chances are you'll see more diapers and beer. It's with this kind of valuable -- and sometimes odd -- information that Tandem is helping people in retail, banking, telecommunications and insurance uncover business opportunities."
This also uncovers a recurring theme in modern data processing, one that changed the traditional acronym of GIGO from "garbage in/garbage out" to "garbage in/gospel out".
http://www.journalism.wisc.edu/jargon/jargon_21.html
-- Fred Heutte, November 6, 1997
Whenever I see someone trying to explain or justify data warehousing, almost inevitably the Parable of the Beer and Diapers is brought up. It has entered the apocrypha of the data warehousing community as the single compelling example of the utility of data warehousing. In every case, it is attributed to a different corporation, often to some nameless "large retail store." It has all the markings of an Urban Legend, like the one about the guy who wakes up in a hotel in a foreign city in a bathtub full of ice, a splitting pain in his side, and a note written on the mirror saying "Call 911," his kidneys having been removed by organ bootleggers. Does anyone know if the Beer and Diapers story actually happened, or if some enterprising Tom Vu of the data warehousing world made it up?
-- Jin Choi, February 2, 1999
Here is where the Diapers/Beer NOT LEGEND came from:From: Ronny Kohavi ronnyk@bluemartini.com Date: Thu, 06 Jul 2000 22:16:36 -0700 Subject: Origin of "diapers and beer" For my invited talk at ICML in 1998, I tracked the beer and diapers example further. Check out slide 21 in http://robotics.stanford.edu/~ronnyk/chasm.pdf
Basically, I found the person in Blischok's group who ran the queries. K. Heath ran self joins in SQL (1990), trying to find two itemsets that have baby items, which are particularly profitable. She found this beer and diapers pattern in their data of 50 stores over a day period.
When I talked to her, she mentioned that she didn't think the pattern was significant, but it was interesting.
-- Ronny
-- Tom Mathews, June 23, 2001