Organizations have complex requirements for their information systems. A period of rapid economic growth can result in insane schedules and demands that a new information system be ready within weeks. Finally, organizations are fickle and have no compunction about changing the requirements mid-stream.
Technical people have traditionally met these challenges ... by arguing over programming tools. The data model can't represent the information that the users need, the application doesn't do what what the users need it to do, and instead of writing code, the "engineers" are arguing about Java versus Lisp versus ML versus C# versus Perl versus VB. If you want to know why computer programmers get paid less than medical doctors, consider the situation of two trauma surgeons arriving at an accident scene. The patient is bleeding profusely. If surgeons were like programmers, they'd leave the patient to bleed out in order to have a really satisfying argument over the merits of two different kinds of tourniquet.
The authors were asked to help Siemens and Boston Consulting Group (BCG) realize a knowledge sharing system for 17,000 telephone switch salespeople spread among 84 countries. This was back in the 1990s when (a) telephone companies were expanding capacity, and (b) corporations invested in information systems as a way of beating competitors.
Siemens had spent 6 months working with a Web development contractor that was expert in building HTML pages but had trouble programming SQL. They'd promised to launch the elaborately specified system 6 weeks from our first meeting. We concluded that many of the features that they wanted could be adapted from the source code behind the photo.net online community but that adding the "knowledge repository" would require 4 programmers working full-time for 6 weeks. What's worse, in looking at the specs we decided that the realized system would be unusably complex, especially for busy salespeople.
Instead of blindly cranking out the code, we assigned only one programmer to the project, our friend Tracy Adams. She turned the human-readable design notebooks into machine-readable database metadata tables. Tracy proceeded to build a program-to-write-the-program with no visible results. Siemens and BCG were nervous until Week 4 when the completed system was available for testing.
"How do you like it?" we asked. "This is the worst information system that we've ever used," they replied. "How do you compare it to your specs?" we asked. "Hmmm... maybe we should simplify the specification," they replied.
After two more iterations the system, dubbed "ICN Sharenet" was launched on time and was adopted quickly, credited by Siemens with $122 million in additional sales during its first year of operation.
If you're programming one Web page at a time, you can switch to the language du jour in search of higher productivity. But you won't achieve significant gains unless you quit writing code for one page at a time. Think about ways to write down a machine-readable description of the application and user experience, then let the computer generate the application automatically.
One thing that we hope you've learned during this course is the value of testing with users and iterative improvement of an application. If an application is machine-generated, you can test it with users, edit the specification based on their feedback, and regenerate the application in a matter of minutes, ready for a new test.
We're going to explore metadata (data about the data model) and automatic code generation in the problem domain of knowledge management.
An organization's second attempt at an information system intended to help beginners and average performers with the expertise of the most capable is typically dubbed knowledge sharing. In a knowledge-sharing system, User A has the ability to put a question in front of the community so that Users B, C, and D can write new material and/or point A to previously authored articles.
If nothing else, a knowledge-sharing system provides a means for employees at the leaf nodes of a hierarchy to converse amongst themselves. Consider the organization depicted in figure 15.1. Suppose that a worker in Singapore has a question that could be answered by a worker in Des Moines. The act of finding the coworker and getting assistance requires going all the way up the hierarchy to the chief executive in London and then back down a different path through the hierarchy to Des Moines. This bottleneck could be eliminated by eliminating the hierarchy. However, most organizations don't want to eliminate their hierarchies. It is the hierarchy that enables the corporation to reduce management complexity by establishing profit-and-loss responsibility at intermediate levels. Better to supplement the hierarchy with an informal mechanism by which the Singapore-based worker can ask for help and the Des Moines-based worker can offer it, i.e., a knowledge-sharing system.
**** insert drawing of corporate hierarchy ****
Your ontology will consist of class definitions and, because a relational database is the source of persistence behind your online community, your implementation of this ontology will consist of SQL table definitions, one table for each class. To assist you in developing a useful ontology for your community, here are a couple of examples.
Let's assume that we'll have at least the following types in our ontology:
languageand one of type
person. Why not link to the
userstable instead? John McCarthy might not be a registered user of the system. Some of the people you'll be referencing, e.g., John von Neumann, are dead.
Each object comprises a set of elements. An element is stored in a column. For every object in the system, we want to record the following elements:
- for the
- date_of_birth, title
- for the
- syntax_example, garbage_collection_p (whether the language has automatic storage allocation like Lisp or memory leaks like C), strongly_typed_p, type_inferencing_p, lexical_scoping_p, date_first_used
- for the
- this is for storing references to books and journal articles so you want all the fields that you'd expect to see when referencing something; include also an abstract field
- for the
- complexity_for_insertion, complexity_for_retrieval (varchars containing "O(1)", "O(N)", etc.)
- for the
- examples of systems are "Multics", "Xerox Alto", "TCP/IP", "MIT Lisp Machine", "Apple Macintosh", "Unix", "World Wide Web". Includes fields for date_of_conception, date_of_birth, organization_name, open_source_p. No need to include fields for the names of developers because we can instead use links to objects of type
personto represent prime developers or promoters.
- for the
- examples of problems are "traveling salesman", "dining philosophers", "sort", "query for inclusion in sorted list". We'll want elements for storing initial conditions and solution criteria. In general, objects of type
problemwill be linked to objects of type
algorithm(algorithms that solve the problem),
publication(papers that set forth the problem), and
person(people who were involved in stating or solving the problem)
- for the
- examples include "Quicksort" and "binary search" elements for pseudo_code and high_level_explanation. In general, objects of type
algorithmwill be linked to objects of type
problem(what need the algorithm addresses),
publication(papers that describe the algorithm or implementations of it), and
person(people who were involved in developing the algorithm)
For an example of what a completed system of this nature might look like, visit Paul Black's Dictionary of Algorithms, Data Structures, and Problems at http://www.nist.gov/dads/.
Object types include:
- for the
- For each kind of airplane flying, there is one entry in this table. An example might be "Cessna 172" or "Boeing 747". We need elements to specify performance such as stall_speed (how slow you can go before tumbling out of the sky), approach_speech (how fast you should go when coming near the runway to land), and cruise_speed. We want elements such as date_certified, manufacturer_name, and manufacturer_address to describe the design.
- for the
- An entry in this table is a specific airplane, very likely a rental machine belonging to a flight school. We want elements such as date_manufactured, ifr_capable_p (legal to fly in the clouds?), and optional_equipment.
- for the
- We want to know where the airport is: lat_long; elevation; relation_to_city (distance and direction from a named town). We want to know whether the airport is military-only, private, or public. We want to know whether or not the airport has a rotating green/white beacon and runway lights. We want to store the frequencies for weather information, contacting other pilots (if non-towered) or the control tower (if towered), and air traffic control for instrument flight clearances. We need to record runway lengths and conditions. An airport may have several runways, however, thus giving rise to a many-to-one relation, which is why we model runways separately and link them to airports.
- for the
- number (e.g., "09/27"), length, condition. Note that the runway number implies the magnetic orientation: 09 implies a heading of 090 or landing facing magnetic east; if the wind favors a landing on the same strip of asphalt in the opposite direction, you're on 27, which implies a heading of 270 or due west (36 faces north; 18 faces south).
- for the
- The U.S. Federal Aviation Administration maintains a nationwide network of Very High Frequency Omni Ranging beacons (VORs). These transmit two signals, one of which is constant in phase regardless of an airplane's bearing to the VOR. The second signal varies in phase as one circles a VOR. Thus a VOR receiver in the airplane can compare the phase of the two signals and determine that an airplane is, for example, on the 123-degree radial from the VOR. If you didn't have a Global Positioning System receiver in your airplane, you'd determine your position on the chart by plotting radials out from two VORs. For a navaid, we need to store its type (could be an old non-directional beacon, which simply puts out an AM radio-style broadcast), frequency, position, and Morse code ID (you want to listen to the dot-dash pattern to make sure that you're receiving the proper navaid).
- for the
- menu_excerpt, hours, distance_from_airport, phone_number, url, email, street_address
- for the
- price, distance_from_airport, phone_number, url, email, street_address
For an example of a running system along these lines, visit http://www.airnav.com/airports/ and type in the name or code for your favorite airport.
For ease of evaluation and interaction with your classmates, we suggest placing the user pages at http://yourservername/km/.
Here's an example SQL data model for a metadata repository:
-- note that this is designed for the Oracle 8i/9i RDBMS -- we'll have one row in this table for every object type -- and thus for every new SQL table that gets defined; an -- object type and its database table name are the same; -- Oracle limits schema objects to 30 characters and thus -- we limit a table_name to 21 characters so that we can -- have some freedom to create schema objects whose names -- are prefixed with an object type -- a "pretty name" is a synonym used when presenting pages -- to users; the prettiness could be as simple as replacing -- underscores with spaces or spelling out abbreviations; -- e.g., for an object type of "airplane_design", the pretty -- form might be "airplane design", and pretty_plural -- "airplane designs" create table km_metadata_object_types ( table_name varchar(21) primary key, pretty_name varchar(100) not null, pretty_plural varchar(100) ); -- here is the table for elements that are unique to an object type -- (the housekeeping elements can be defined implicitly in the source -- code for the application generator); there will be one row in -- the metadata table per element create table km_metadata_elements ( metadata_id integer primary key, table_name not null references km_metadata_object_types, column_name varchar(30) not null, pretty_name varchar(100) not null, abstract_data_type varchar(30) not null, -- ie. "text" or "shorttext" "boolean" "user" -- this one is not null except when abstract_data_type is "user" oracle_data_type varchar(30), -- "varchar(4000)" -- e.g., "not null" or "check foobar in ('christof', 'patrick')" extra_sql varchar(4000), -- values are 'text', 'textarea', 'select', 'radio', -- 'selectmultiple', 'checkbox', 'checkboxmultiple', 'selectsql' presentation_type varchar(100) not null, -- e.g., for textarea, this would be "rows=6 cols=60", for select, Tcl list, -- for selectsql, an SQL query that returns N district values -- for email addresses mailto: presentation_options varchar(4000), -- pretty_name is going to be the short prompt, -- e.g., for an update page, but we also need something -- longer if we have to walk the user through a long form entry_explanation varchar(4000), -- if they click for yet more help help_text varchar(4000), -- note that this does NOT translate into a "not null" constraint in Oracle -- if we did this, it would preclude an interface in which users create rows incrementally mandatory_p char(1) check (mandatory_p in ('t','f')), -- ordering in Oracle table creation, 0 would be on top, 1 underneath, etc. sort_key integer, -- ordering within a form, lower number = higher on page form_sort_key integer, -- if there are N forms, starting with 0, to define this object, -- on which does this go? (relevant for very complex objects where -- you need more than one page to submit) form_number integer, -- for full text index include_in_ctx_index_p char(1) check (include_in_ctx_index_p in ('t','f')), -- add forms should be prefilled with the default value default_value varchar(200), check ((abstract_data_type not in ('user') and oracle_data_type is not null) or (abstract_data_type in ('user'))), unique(table_name,column_name) );
text/plain. You can save this to your local file system as
km-generated.sqland feed it to your SQL client when you're satisfied.
In addition to the housekeeping elements that you've defined for your
application, each object table should have an
column. The value of this column should be unique across all of the
tables in the km module, which is easy to do in Oracle if you use a
single sequence to generate all the keys. Given unique object IDs
across types, if you were to add a
table, you'd be able to publish cleaner URLs that pass around only
object IDs rather than object IDs and types.
In addition to the metadata-driven object table definitions, your script should define a generalized mapping table to support links between knowledge objects. Here's an Oracle-syntax example:
Notice that this table allows the users to map an object to any other object in the system, regardless of type.
create table km_object_object_map ( object_id_a integer not null, object_id_b integer not null, -- the objects are uniquely identified above but let's save ourselves -- hassle by recording in which tables to find them table_name_a not null references km_metadata_object_types, table_name_b not null references km_metadata_object_types, -- User-entered reason for relating two objects, e.g. -- to distinguish between John McCarthy the developer of -- Lisp and Gerry Sussman and Guy Steele, who added lexical scoping -- in the Scheme dialect map_comment varchar(4000), creation_user not null references users, creation_date date default sysdate not null, primary key (object_id_a, object_id_b) );
For simplicity, assume that associations are bidirectional. Suppose
that a knowledge author associates the Huffman encoding algorithm
(used in virtually every compression scheme, including JPEG) with the
person David A. Huffman (1925-1999; an MIT graduate student at the
time of his invention, which was submitted as a term paper). We
should also interpret that to mean that the person David A. Huffman is
associated with the algorithm for Huffman encoding. This is why the
km_object_object_map have names such as
"object_id_a" instead of "from_object".
In an Oracle database, the primary key constraint above has the side effect of creating an index that makes it fast to ask the question "What objects are related to Object 17, where Object 17 happens to appear in the A slot?" For efficiency in querying "What objects are related to Object 17, where Object 17 happens to appear in the B slot?", create a concatenated index on the columns in the reverse order from that of the primary key constraint.
The "Trees" chapter of SQL for Web Nerds, at http://philip.greenspun.com/sql/trees, gives some examples of concatenated indices. If you're using Oracle you may want to read the composite indices section of the Performance Guide and Reference manual (online and in the product documentation) and the SQL Reference manual's section on "Create Index".
drop-tables-generate. This should generate DROP TABLE statements from the metadata tables. You probably won't get your data model right the first time, so you might as well be ready to clear out the RDBMS and start over.
Feed the database management system the results of your data model creation and clean-up scripts until you stop getting error messages.
This traditional approach has some drawbacks. First, it adds a mouse click before the administrator can see any user names. Ideally, you want every page of an application to display information and/or potential actions rather than pure bureaucracy and navigation. Second, and more seriously, this approach doesn't scale very well. When an administrator says "I need to see users who've registered within the last 30 days, who've contributed more than 4 product reviews, and who've bought at least $100 of stuff so that I can spam them with a coupon," another option must be added to the list. Eventually the navigation page groans with choices.
Imagine instead that the very first mouse click takes the administrator to a page that shows all the users who've registered in the last 30 days, in one big long list. At the top are sliders. Each slider controls a dimension, each of which can restrict or expand the number of items in the list. Here are some example dimensions for a community e-commerce site such as amazon.com:
How to implement dimensional controls? Sadly, there is no HTML tag that will generate a little continuous slider. You can simulate a slider by offering, for each dimension, a set of discrete points along the dimension, each of which is a simple text hyperlink anchor. For example, for content quality you might offer "4 or better", "3 or better", "2 or better", "all".
/km/directory. At the very least, the index page should display an unordered list of object types and, next to each type, options to "browse" or "create". You don't have any information in the database, so you should build a script called
object-createfirst. This page will query the metadata tables to build a data entry form to create a single object of a particular type.
When your object creation pipeline is done inserting the row into the
database, it should redirect the author's browser to a page where the
object is displayed (name the script
you don't have a better idea). Presumably the original author has
authority to edit this object and therefore this page should display
small hyperlinks to edit single fields. All of these links can target
object-edit-element with different arguments.
The object display page should also summarize all the currently linked
objects and have an "add link" hyperlink whose target is
The page returned by
link-add will look virtually
identical to the index page, i.e., a list of object types. Each
object type can be a hyperlink to a multi-purpose script at
one-type-browse. When called with only a
table_name argument, this page will display a table of object
names with dimensional controls at the top. The dimensions should be
"mine|everyone's" and "creation date". The user ought to be able to
click on a table header and sort by that column.
When called with extra arguments,
pass those arguments through to
script very similar to
object-display, but only showing
enough information that the author can positively identify the object
and with the additional ability to accept arguments for a potential
It is important to create an incentive system that rewards users for exhibiting the desired behavior. At amazon.com, for example, the site owners want users to write a lot of reader reviews. At the same time, they apparently don't want to pay people to write reviews. The solution circa 2003 is to recognize contributors with a "reviewer rank". If a lot of other Amazon users have clicked to say that they found your reviews useful, you may rise above 1000 and a "Top 1000 Reviewer" icon appears next to your name. From the home page of Amazon, navigate to "Friends and favorites" (under "Special Features"). Then, underneath "Explore", click on "Top Reviewers". Notice that some of the top 10 reviewers have written more than 5000 reviews, all free of charge to Amazon!
What makes sense to reward in an online community? We could start with a couple of obvious activities: content authoring and question answering. Every night our system could query the content tables and update user ranks according to how many articles and answers they'd posted into the database. Is it really a good idea to reward users purely on the basis of volume? Shouldn't we give more weight to content that has actually helped people? For example, suppose that there are ten answers to a discussion forum question. It makes sense to give the maximum reward to the author of the answer that the person asking the question felt was most valuable. If a question can be marked "urgent" by the asker, it probably makes sense to give greater rewards to people who answer urgent questions than non-urgent ones. An article is nice, but an article that prompts another user to say "I reused this idea in my area of the organization" is much nicer and should be encouraged with a greater reward.
create sequence km_object_view_id; create table km_object_views ( object_view_id integer primary key, -- which user user_id not null references users, -- two columns to specify which object object_id integer not null, table_name varchar(21) not null, view_time timestamp(0) not null, reuse_p char(1) default 'f' check(reuse_p in ('t','f')) );
object-view-oneso that it will insert a row into the
km_object_viewstable if and only if there isn't already a log row for this user/object pair within twenty-four hours. You can do this with the following procedure:
Let's start with Oracle. Here's an example of an INSERT statement that only has an effect if there isn't already a row in the table:
The structure of this statement is "insert into KM_OBJECT_VIEWS the result of querying the 1-row system table DUAL". We're not pulling any data from the DUAL table, only including constants in the SELECT list. Nor is the WHERE clause restricting results based on information in the DUAL table; it is querying KM_OBJECT_VIEWS. This is a seemingly perverse way to use SQL, but in fact is fairly conventional because there are no IF statements in standard SQL.
insert into km_object_views (object_view_id, user_id, object_id, table_name, view_time) select km_object_view_id.nextval, 227, 891, 'algorithm', current_timestamp(0) from dual where 0 = (select count(*) from km_object_views where user_id = 227 and object_id = 891 and view_time > current_timestamp - interval '1' day);
Suppose, however, that two copies of this INSERT start simultaneously. Recall that a transaction processing system provides the ACID guarantees: Atomicity, Consistency, Isolation, and Durability. Oracle's implementation of isolation, "the results of a transaction are invisible to other transactions until the transaction is complete", works by giving each user a virtual version of the database as it was when the transaction started.
|Session A||Session B|
Sends INSERT to Oracle at system change number ("SCN", a pseudo-time internal to Oracle) 30561.
Oracle counts the rows in
Oracle inserts a row into
Sends INSERT to Oracle at system change number 30562, a tick after
Session A started its transaction but several ticks before Session A
accomplished its insertion.
Oracle, busy with other users, doesn't start counting rows in
Having found 0 rows in the count, the INSERT proceeds to insert one row, thus creating a duplicate log entry.
Now consider the same query running in SQL Server:
There are minor syntatic differences from the Oracle statement above, but the structure is the same. A new row is inserted only if no matching rows are found within the last twenty-four hours.
insert into km_object_views (user_id, object_id, table_name, view_time) select 227, 891, 'algorithm', current_timestamp where 0 = (select count(*) from km_object_views where user_id = 227 and object_id = 891 and datediff(hour, view_time, current_timestamp) < 24)
SQL Server achieves the same isolation level as Oracle ("Read Committed"), but in a different way. Instead of creating virtual versions of the database, SQL Server holds exclusive locks during data-modification operations. In the example above, Session B's INSERT cannot begin until Session A's INSERT has completed. Once it is allowed to begin, Session B will see the result of Session A's insert, and will therefore not insert a duplicate row.
More: See the "Understanding Locking in SQL Server" chapter of SQL Server Books Online, the Microsoft SQL Server documentation.
Whenever you are performing logging, it is considerate to do it on the server's time, not the user's. In many Web development environments, you can do this by calling an API procedure that will close the TCP connection to the user, which stops the upper-right browser corner icon from spinning/waving. Meanwhile your thread (IIS, AOLserver, Apache 2) or process (Apache 1.x) is still alive on the server and can run whatever code is necessary to perform the logging. Many Web servers allow you to define filters that run after the delivery of a page to the user.
Help with date/time arithmetic: see the "Dates" chapter of SQL for Web Nerds at http://philip.greenspun.com/sql/dates.
object-view-oneto add a "I reused this knowledge" button. This should link to
object-mark-reused, a page that updates the
reuse_pflag of the most recent relevant row in
km_object_views. The page should raise an error if it can't find a row to update.
object-view-onethat does its logging on the server's time, explain the concurrency problem that arises in Exercise 7 and talk about ways to address it.
Write up your solutions to these non-coding exercises either in your
km module overview document or in a file named
metadata-exercises in the same directory.
km_object_viewsthat will make the code in Exercises 6 and 7 go fast.
/km/admin/statisticsto show, by day, the number of objects viewed and reused. This report should be broken down by object type and all the statistics should be links to "drill-down" pages where the underlying data are exposed, e.g., which actual users viewed or reused knowledge and when.