Abstraction-Filtration-Comparison of a SQL schema or Data Model

by Philip Greenspun, Jin S. Choi, and John Morgan; updated February 2011

Site Home : Software : Abstraction, Filtration, Comparison : One Example


As noted in the parent article, this example walks through the Abstraction-Filtration-Comparison of a SQL data model or "schema". It is taken from the ArsDigita Community System, a 1990s toolkit for building online communities (what ten years later came to be known as "Web 2.0") described in this 1998 book chapter.

Out of the hundreds of tables in the toolkit, for this analysis we've extracted just those for user registration and discussion forum. The source code is in the SQL language for the Oracle relational database management system.

Abstraction Level 0: the raw source code, including comments

create table users (
	user_id			integer not null primary key,
	first_names		varchar(100) not null,
	last_name		varchar(100) not null,
	screen_name		varchar(100),
	constraint users_screen_name_unique unique(screen_name),
	priv_name		integer default 0,
	email			varchar(100) not null unique,
	priv_email		integer default 5,
	email_bouncing_p	char(1) default 'f' check(email_bouncing_p in ('t','f')),
	-- converted_p means password is bogus; we imported this guy
	-- from a system where we only had email address
	converted_p		char(1) default 'f' check(converted_p in ('t','f')),
	password		varchar(30) not null,
	-- we put homepage_url here so that we can
	-- always make names hyperlinks without having to 
	-- JOIN to users_contact
	url			varchar(200),
	-- to suppress email alerts
	on_vacation_until	date,
	-- set when user reappears at site
	last_visit		date,
	-- this is what most pages query against (since the above column
	-- will only be a few minutes old for most pages in a session)
	second_to_last_visit	date,
	-- how many times this person has visited
	n_sessions		integer default 1,
	registration_date	date,
	registration_ip		varchar(50),
	-- state the user is in in the registration process
	user_state		varchar(100) check(user_state in ('need_email_verification_and_admin_approv', 'need_admin_approv', 'need_email_verification', 'rejected', 'authorized', 'banned', 'deleted')),
	-- admin approval system
        approved_date           date,
        approving_user          references users(user_id),
	approving_note       	varchar(4000),
	-- email verification system
	email_verified_date      date,
	-- used if the user rejected before they reach 
	-- the authorized state
	rejected_date		date,
	rejecting_user		integer references users(user_id),
	rejecting_note          varchar(4000),
	-- user was active but is now deleted from the system 
	-- may be revived
	deleted_date		date,	
	deleting_user   	integer references users(user_id),
	deleting_note          varchar(4000),
	-- user was active and now not allowed into the system
	banned_date		date,
	-- who and why this person was banned
	banning_user		references users(user_id),
	banning_note		varchar(4000),
	-- customer relationship manager fields
	crm_state		varchar(50), -- forward reference: references crm_user_states,
	crm_state_entered_date	date, -- when the current state was entered
	-- so user's can tell us their life story
	bio			varchar(4000)
);

create table users_preferences (
	user_id			integer primary key references users,
	prefer_text_only_p	char(1) default 'f' check (prefer_text_only_p in ('t','f')),
	-- an ISO 639 language code (in lowercase)
	language_preference	char(2) default 'en',
	dont_spam_me_p		char(1) default 'f' check (dont_spam_me_p in ('t','f')),
	email_type		varchar(64)
);


create table users_demographics (
	user_id		   	integer primary key references users,
	birthdate		date,
	priv_birthdate		integer,
	sex			char(1) check (sex in ('m','f')),
	priv_sex		integer,
	postal_code		varchar(80),
	priv_postal_code	integer,
	ha_country_code		char(2) references country_codes(iso),
	priv_country_code	integer,
	affiliation		varchar(40),
	-- these last two have to do with how the person
	-- became a member of the community
	how_acquired		varchar(40),
	-- will be non-NULL if they were referred by another user
	referred_by		integer references users(user_id)
);

create table users_contact (
	user_id		integer primary key references users,
	home_phone	varchar(100),
	priv_home_phone	integer,
	work_phone	varchar(100),
	priv_work_phone	integer,
	cell_phone	varchar(100),
	priv_cell_phone	integer,
	pager		varchar(100),
	priv_pager	integer,
	fax		varchar(100),
	priv_fax	integer,
	-- to facilitate users talking to each other and Web server
	-- sending instant messages, we keep the AOL Instant Messenger 
	-- screen name
	aim_screen_name		varchar(50),
	priv_aim_screen_name	integer,
	-- also the ICQ# (they have multi-user chat)
	-- currently this is probably only a 32-bit integer but
	-- let's give them 50 chars anyway
	icq_number		varchar(50),
	priv_icq_number		integer,
	-- Which address should we mail to?
	m_address		char(1) check (m_address in ('w','h')),
	-- home address
	ha_line1		varchar(80),
	ha_line2		varchar(80),
	ha_city			varchar(80),
	ha_state		varchar(80),
	ha_postal_code		varchar(80),
	ha_country_code		char(2) references country_codes(iso),
	priv_ha			integer,
	-- work address
	wa_line1		varchar(80),
	wa_line2		varchar(80),
	wa_city			varchar(80),
	wa_state		varchar(80),
	wa_postal_code		varchar(80),
	wa_country_code		char(2) references country_codes(iso),
	priv_wa			integer,
	-- used by the intranet module
        note			varchar(4000),
        current_information	varchar(4000)
);

create table bboard_topics (
	topic_id	integer not null primary key,
	-- topic name
	topic		varchar(200) unique not null,
	-- read access rights
	-- can be one of any (anonymous), public (any registered user), group
	read_access	varchar(16) default 'any' check (read_access in ('any','public','group')),
	-- write (post new message) access
	-- can be one of (public, group)
	write_access 	varchar(16) default 'public' check (write_access in ('public','group')),
 	users_can_initiate_threads_p	char(1) default 't' check (users_can_initiate_threads_p in ('t','f')),
	backlink	varchar(4000),	-- a URL pointing back to the relevant page
	backlink_title	varchar(4000),	-- what to say for the link back
	blather		varchar(4000),	-- arbitrary HTML text that goes at the top of the page
	-- posting is always restricted to members
	-- is viewing restricted to members or only posting?
	restricted_p	char(1) default 'f' check (restricted_p in ('t','f')),
	primary_maintainer_id	integer not null references users(user_id),
	subject_line_suffix	varchar(40),	-- whether to put something after the subject line, e.g., 'name', 'date'
	notify_of_new_postings_p	char(1) default 't' check (notify_of_new_postings_p in ('t','f')),	-- send email when a message is added?
	pre_post_caveat		varchar(4000),	-- special HTML to encourage user to search elsewhere before posting a new message
	-- 'unmoderated', 'new_threads_by_maintainer', 'new_threads_by_helpers'
	-- 'all_threads_by_maintainer', 'all_threads_by_helpers','answers_only_from_helpers', 'moderated_topics'
	moderation_policy	varchar(40),
	-- used for keeping messages for 50 US states, for example
	-- where each state is a top level posting but not really a 
	-- question
	-- if this isn't NULL then we put in an "about" link
	policy_statement	varchar(4000),
	-- presentation_type  q-and-a (Question and answer format), threads (standard listserve), or ed_com (Question and response pages separated, editiorial language)
	presentation_type	varchar(20) default 'q_and_a' constraint check_presentation_type check(presentation_type in ('q_and_a','threads', 'ed_com', 'usgeospatial')),
	-- stuff just for Q&A  use
	q_and_a_sort_order	varchar(4) default 'asc' not null check (q_and_a_sort_order in ('asc','desc')),
	q_and_a_categorized_p	char(1) default 'f' check (q_and_a_categorized_p in ('t','f')),
	q_and_a_new_days	integer default 7,
	q_and_a_solicit_category_p	char(1) default 't' check (q_and_a_solicit_category_p in ('t','f')),
	q_and_a_cats_user_extensible_p	char(1) default 'f' check (q_and_a_cats_user_extensible_p in ('t','f')),
	-- use the interest level system
	q_and_a_use_interest_level_p 	char(1) default 't' check (q_and_a_use_interest_level_p in ('t','f')),
	-- for popular boards, only show categories for non-new msgs
	q_and_a_show_cats_only_p	char(1) default 'f' check (q_and_a_show_cats_only_p in  ('t','f')),
	-- for things like NE43 memory project and 6.001 pset site 
	-- top level threads can have custom sort keys, e.g., date
	-- of story (rather than date of posting)
	custom_sort_key_p 		char(1) default 'f' check (custom_sort_key_p in  ('t','f')),
	custom_sort_key_name		varchar(50),	-- for display
	-- SQL data type, lowercase, e.g., "date" (ANSI format so that it sorts)
	-- we really only use this for user input validation
	custom_sort_key_type		varchar(20),
	custom_sort_order		varchar(4) default 'asc' not null check (custom_sort_order in ('asc','desc')),
	-- display to user if there aren't message yet
	custom_sort_not_found_text	varchar(4000),
	-- ask user to supply a sort key with new postings
	custom_sort_solicit_p		char(1) default 'f' check (custom_sort_solicit_p in  ('t','f')),
	-- ask user to supply a pretty sort key for display
	-- e.g., "Fall 1997" instead of 9-29-97
	custom_sort_solicit_pretty_p   char(1) default 'f' check (custom_sort_solicit_pretty_p in  ('t','f')),
	custom_sort_pretty_name		varchar(50),	-- for display
	custom_sort_pretty_explanation	varchar(100),	-- why we ask for it
	-- fragment of Tcl code that evaluates to 0 if a sort key is
	-- bad, 1 if OK, assumed to include "$custom_sort_key"
	custom_sort_validation_code	varchar(4000),
	-- for the 2nd round of 6.001 discussion thinking
	category_centric_p		char(1) default 'f' check (category_centric_p in  ('t','f')),
	-- image and file uploading
	uploads_anticipated		varchar(30) check (uploads_anticipated in ('images','files','images_or_files')),
	-- should this forum come up on the user interface?
	active_p			char(1) default 't' check (active_p in ('t','f')),
	group_id			integer references user_groups,
       -- Columns for unified presentation.
       -- default_topic_p is 't' if the web service admin wants that
       -- topic to be a default bboard forum for users
       default_topic_p            varchar(1) default 't' check (default_topic_p in ('t','f')),
       -- the default color set by the web service admin for
       -- displaying topic summary lines for a forum 
       -- in #XXXXXX format (Hexadecimal)
       color			  varchar(7),
       -- the default icon set by the web service admin for displaying
       -- topic summary lines for the forum 
       icon_id			  integer REFERENCES bboard_icons
);

create table bboard_q_and_a_categories (
	topic_id	not null references bboard_topics,	
	category	varchar(200) not null
);

create table bboard (
	msg_id		char(6) primary key,
	refers_to	char(6),
	topic_id	not null references bboard_topics,
	category	varchar(200),	-- only used for categorized Q&A forums
	originating_ip	varchar(16),	-- stored as string, separated by periods
	user_id		integer not null references users,
	one_line	varchar(700)
			constraint bboard_one_line_nn not null,
	message		clob,
	-- html_p - is the message in html or not
	html_p		char(1) default 'f' check (html_p in ('t','f')),
	posting_time	date,
	expiration_days	integer,	-- optional N days after posting_time to expire
	-- really only used for postings that initiate threads
	interest_level	integer check ( interest_level >= 0 and interest_level <= 10 ),
	sort_key	varchar(700),
	-- only used for weirdo things like NE43 memory project and 
	-- 6.001
	-- if this is a DATE, it has to be an ANSI so that it will
	-- sort lexicographically
	-- I guess we should constraint this to be UNIQUE
	custom_sort_key		varchar(100),
	custom_sort_key_pretty	varchar(100),
	-- stuff for US geospatial forums
	epa_region	integer check(epa_region >= 1 and epa_region <= 10),
	usps_abbrev	references states,
	fips_county_code	references counties,
	zip_code	varchar(5),
        urgent_p        char(1) default 'f' not null check (urgent_p in ('t','f'))
);

create table bboard_uploaded_files (
	bboard_upload_id	integer primary key,
	msg_id			not null unique references bboard,
	file_type		varchar(100),	-- e.g., "photo"
	file_extension		varchar(50), 	-- e.g., "jpg"
	-- can be useful when deciding whether to present all of something
	n_bytes			integer,
	-- what this file was called on the client machine
	client_filename		varchar(4000) not null,
	-- generally the filename will be "*msg_id*-*upload_id*.extension"
	-- where the extension was the originally provided (so 
	-- that ns_guesstype will work)
	filename_stub		varchar(200) not null,
	-- fields that only make sense if this is an image
	caption			varchar(4000),
	-- will be null if the photo was small to begin with
	thumbnail_stub		varchar(200),
	original_width		integer,
	original_height		integer
);

Abstraction Level 1: the tables, with columns grouped

TABLE USERS
generated integer primary key
name, split into two columns for first and last
an optional "screenname" alias 
email address, constrained NOT NULL and UNIQUE
information about email address validity
converted_p: was the record imported from an earlier and more primitive system?
password
URL for user's personal home page
vacation status (to suppress email alerts)
usage information (number of visits, date/time of last two visits)
registration information (when and from where? was the registration complete? approved? if so, by whom and why? banned? deleted?)
customer relationship manager (CRM) state
biography

TABLE USERS_PREFERENCES
1:1 relation with USERS table with a reference to the user_id
does the user prefer text or HTML email?
what language does the user want to see the interface in?
has the user said he doesn't want to be spammed?

TABLE USERS_DEMOGRAPHICS
1:1 relation with USERS table with a reference to the user_id
age, sex, geographic location (standard stuff for advertisers)
how did the user come to the community? was he or she referred by another user?

TABLE USERS_CONTACT
1:1 relation with USERS table with a reference to the user_id
home, work, cell, pager, fax phone numbers
instant message service screen names (AIM, ICQ [ouch, we're dating ourselves here!])
home and work mailing addresses
extra columns marked for use by an "intranet module"

TABLE BBOARD_TOPICS
generated integer primary key
topic name
is the topic active or an archive that should be mostly hidden?
how should the topic be displayed relative to other topics? At the top? In a special color? with an icon?
is the topic associated with a user group?
who can read and write to this topic? (any registered user or just members of a group?)
who can initiate a new thread? Just a moderator or any authorized user?
is this topic associated with a Web page elsewhere? if so, store info to present a "backlink"
explanatory information to present on the top-level page for this topic
explanatory information to present to users preparing to post a new thread, e.g., "check the FAQ first"
who is the primary maintainer of this topic and does he or she want email when a new thread is posted?
how should threads be displayed? Grouped into categories? Sorted by date? With a name/date suffix?
what's the moderation policy of this topic? 
if the topic is categorized, can users add new categories?
a 10-column subsystem for using a custom sort key for this forum
can users upload images or other file attachments to forum postings?

TABLE BBOARD_Q_AND_A_CATEGORIES
many:1 relation with BBOARD_TOPICS referencing topic_id
text string for a category to group messages

TABLE BBOARD
many:1 relation with BBOARD_TOPICS referencing topic_id (every message is part of a topic)
unusual primary key and sorting system explained in this book
  6-character string primary key, 
  nullable refers_to column that can point to another row in the same table
  unusual 700-character sort_key so that mesages can sort in a threaded form with a simple lexicographic sort
category (if using the BBOARD_Q_AND_A_CATEGORIES system)
user_id, IP address from which posting came, and time of posting
one-line summary and full message text
format of message (HTML or plain text)
does the message expire? If so, when?
interest level of message, as rated by moderator
urgency of message
columns to support a custom sort key
columns to support geospatialization of the discussion (group messages by region or zip code)

TABLE BBOARD_UPLOADED_FILES
generated integer primary key
many:1 relation with BBOARD referencing msg_id
type and extension of file uploaded
size of file uploaded
name of the file, with and without potential crud from the client computer such as "C:\Users\Joe"
if a photo: caption, thumbnail version, original width, original height

Abstraction Level 2: the tables, with purpose and relations

TABLE USERS
essential information about a registered user, including email address and name
machinery for requiring that users be approved or verified

TABLE USERS_PREFERENCES
1:1 relation with USERS table
what kind of email and language to use with this user?

TABLE USERS_DEMOGRAPHICS
1:1 relation with USERS table with a reference to the user_id
information about how we acquired the user and what kind of ads he or she might find relevant

TABLE USERS_CONTACT
1:1 relation with USERS table with a reference to the user_id
phone numbers, instant message accounts, mailing address 

TABLE BBOARD_TOPICS
one for each discussion forum on the system, e.g., "Canon EOS", "Nature Photography", "Weddings"
moderation and categorization policy

TABLE BBOARD_Q_AND_A_CATEGORIES
many:1 relation with BBOARD_TOPICS referencing topic_id
helper table for topics with enough messages that categorization is required for archives

TABLE BBOARD
many:1 relation with BBOARD_TOPICS referencing topic_id (every message is part of a topic)
one row for each posting from a user
lexicographic sort on the sort_key will show a threaded discussion

TABLE BBOARD_UPLOADED_FILES
many:1 relation with BBOARD referencing msg_id
holds files, e.g., photographs, attached to discussion forum messages
can hold any type of file, e.g., document, video, spreadsheet

Abstraction Level 3: modules

A user-registration module providing for publisher-optional user approval and verification processes.

A discussion forum module providing for 
  multiple topics
  multiple categories per topic
  different moderation policies per topic

Filtration of Abstraction Level 1

Most RDBMS tables have a generated integer primary key. It is also typical to use referential integrity constraints when there is a many:1 relation between tables. These should therefore be filtered out.

Note that in-line comments are presented in a green font.

TABLE USERS
generated integer primary key
name, split into two columns for first and last
an optional "screenname" alias (arguably innovative to realize that IM is big)
email address, constrained NOT NULL and UNIQUE (maybe innovative to use email as unique username)
information about email address validity (probably wouldn't have this in v1.0)
converted_p: was the record imported from an earlier and more primitive system?
password
URL for user's personal home page
vacation status (to suppress email alerts)(added due to bitter experience)
usage information (number of visits, date/time of last two visits)(not full user tracking, so boring)
registration information (when and from where? was the registration complete? approved? if so, by whom and why? banned? deleted?)
customer relationship manager (CRM) state (not obvious to apply CRM to online community)
biography (obvious that users would want to know about each other)

TABLE USERS_PREFERENCES
1:1 relation with USERS table with a reference to the user_id
does the user prefer text or HTML email?
what language does the user want to see the interface in?
has the user said he doesn't want to be spammed?

TABLE USERS_DEMOGRAPHICS
1:1 relation with USERS table with a reference to the user_id
age, sex, geographic location (standard stuff for advertisers) (Web advertising was new in 1995 when this software was written)
how did the user come to the community? was he or she referred by another user?

TABLE USERS_CONTACT
1:1 relation with USERS table with a reference to the user_id
home, work, cell, pager, fax phone numbers
instant message service screen names (AIM, ICQ [ouch, we're dating ourselves here!])
home and work mailing addresses
extra columns marked for use by an "intranet module" (if this were present in a Program B that did not have an intranet module, it would be highly suggestive of copying)

TABLE BBOARD_TOPICS
generated integer primary key
topic name
is the topic active or an archive that should be mostly hidden?
how should the topic be displayed relative to other topics? At the top? In a special color? with an icon? (an idiosyncratic way of recording this info)
is the topic associated with a user group? (more or less required since the larger system has groups)
who can read and write to this topic? (any registered user or just members of a group?) (again, the particular way this was done is an expression)
who can initiate a new thread? Just a moderator or any authorized user?
is this topic associated with a Web page elsewhere? if so, store info to present a "backlink"
explanatory information to present on the top-level page for this topic
explanatory information to present to users preparing to post a new thread, e.g., "check the FAQ first"
who is the primary maintainer of this topic and does he or she want email when a new thread is posted?
how should threads be displayed? Grouped into categories? Sorted by date? With a name/date suffix?
what's the moderation policy of this topic? 
if the topic is categorized, can users add new categories?
a 10-column subsystem for using a custom sort key for this forum
can users upload images or other file attachments to forum postings?

TABLE BBOARD_Q_AND_A_CATEGORIES
many:1 relation with BBOARD_TOPICS referencing topic_id
text string for a category to group messages (once you have the idea of the table, you need this column and there is really only one way to do it)

TABLE BBOARD
many:1 relation with BBOARD_TOPICS referencing topic_id (every message is part of a topic)
unusual primary key and sorting system explained in this book
  6-character string primary key, 
  nullable refers_to column that can point to another row in the same table
  unusual 700-character sort_key so that mesages can sort in a threaded form with a simple lexicographic sort
category (if using the BBOARD_Q_AND_A_CATEGORIES system)
user_id, IP address from which posting came, and time of posting
one-line summary and full message text
format of message (HTML or plain text)
does the message expire? If so, when?
interest level of message, as rated by moderator
urgency of message
columns to support a custom sort key
columns to support geospatialization of the discussion (group messages by region or zip code) (this was added for scorecard.org and is very unusual)

TABLE BBOARD_UPLOADED_FILES
generated integer primary key
many:1 relation with BBOARD referencing msg_id
type and extension of file uploaded
size of file uploaded
name of the file, with and without potential crud from the client computer such as "C:\Users\Joe"
if a photo: caption, thumbnail version, original width, original height

Filtration of Abstraction Level 2

The fact that a bunch of tables exist in a 1:1 relation can be filtered out. To do this rather than a single USERS table with additional columns was an implementation choice by a programmer, perhaps a poor one, and does not affect the capability of the system. Arguably this could be considered an abstraction, to lump multiple tables into one concept, but due to the 1:1 relation among the tables I think this is better done as a filtration step.

Let's suppose that we decide that that "demands of the industry" require basic user registration information, but not an approval process. And "widely accepted programming practices" include multiple topics, but not file attachments or multiple languages for the overall interface. Here's what the filtered Level 2 might look like...

TABLES HOLDING USER INFORMATION 
essential information about a registered user, including email address and name
what kind of email and language to use with this user?
information about how we acquired the user and what kind of ads he or she might find relevant
phone numbers, instant message accounts, mailing address
machinery for publisher-optional approval and verification of users

TABLE BBOARD_TOPICS
one for each discussion forum on the system, e.g., "Canon EOS", "Nature Photography", "Weddings"
moderation and categorization policy

TABLE BBOARD_Q_AND_A_CATEGORIES
many:1 relation with BBOARD_TOPICS referencing topic_id
helper table for topics with enough messages that categorization is required for archives

TABLE BBOARD
many:1 relation with BBOARD_TOPICS referencing topic_id (every message is part of a topic)
one row for each posting from a user
lexicographic sort on the sort_key will show a threaded discussion

TABLE BBOARD_UPLOADED_FILES
many:1 relation with BBOARD referencing msg_id
holds files, e.g., photographs, attached to discussion forum messages
can hold any type of file, e.g., document, video, spreadsheet

Filtration of Abstraction Level 3

One could argue that, at least in 2010, the "demands of the industry being serviced" require that publishers have an ability to register users by email address and name and, optionally, approve or reject particular users and/or require the users verify their email address. Thus the entire user registration module would be filtered out due to "industry demands" or perhaps "widely accepted programming practices".

On the other side, one could argue that this code, written in 1995, should not be held to the standard of what industry demands now. This was an innovative system for the time (Amazon.com launched in July 1995, around the same time as this code base) and part of the reason that publishers now demand these features is that they saw them supported by this software on photo.net, the Hearst Corporation publishing and ecommerce sites, etc.

The same argument could occur regarding the discussion forum module. As this was described in the 1997 book Database-backed Web Sites, which was one of the earliest tutorials on how to build systems like this, one could argue that the reason it is now "widely accepted" is due to the book, which was freely available on the Web even before being published in hardcopy and was read by hundreds of thousands of people.

We'll take a middle-ground here...

A user-registration module providing for publisher-optional user approval and verification processes.

A discussion forum module providing for 
  multiple topics
  multiple categories per topic
  different moderation policies per topic
  file attachments

More

Go back up to the main article.
philg@mit.edu