Abstraction-Filtration-Comparison in Software Copyright Litigation

by Philip Greenspun, Jin S. Choi, and John Morgan; updated March 2011

Site Home : Software : one article

Suppose that you're a federal judge handling a case regarding software copyright infringement. A plaintiff says that Computer Program B is the same as Computer Program A or that Computer Program B was derived from ("is a derivative work of") Computer Program A. The defendant says that the programs are different. How will this dispute be resolved? The answer, generally, is via an Abstraction-Filtration-Comparison, explained below.

Note that this method can also be applied in trade secret and patent litigation, though it is not explicitly mandated by case law. In patent litigation, for example, this method can be used to show similarity between an accused system and a prior art system, e.g., an earlier version of the accused system that predates the patent priority date.

Regardless of the type of action, one method of software comparison that the courts definitely do not encourage is the parties bringing two printouts, each containing 2 million lines of source code, into the courtroom and asking the judge or a jury to "read these printouts and see for yourself".

Direct Copy

If Program B is a direct copy of Program A, more or less, that's pretty easy to establish using Unix shell commands such as diff. Generally speaking, the software will not be contained in just one file and therefore the effort starts with the find command to identify the source code filenames associated with each program. It is straightforward to determine how many of the filenames are different, how many are new, how many have been dropped. It is also straightforward to run diff over a set of files to determine how many lines of code were added or changed. You can choose to restrict the change report to non-comment lines if desired. It is often the case that the comments at the top of each source code file are updated with a new copyright date, for example.

Direct Derivative

It would be rather brazen for someone to grab someone else's source code and sell it essentially unmodified. It is also uncommon for software to remain static. Even if someone were to steal the source code for Adobe Photoshop, for example, and sell it as AmazingSmartPhotos, by the time the case came before a court there would likely have been substantial changes to the source code due to normal product evolution.

Since we don't have the source code to Photoshop available to us, let's look at a real-world open-source software product, the Ingres database management system. Can we demonstrate quantitatively that Ingres 9.2 was derived from Ingres 9.0?

Number of Files Lines of Code
Ingres 9.0 6826 4,341,445
Ingres 9.2 6920 4,560,764

The "lines of code" column includes comments and blank lines for both systems. The overall code size has grown approximately 5 percent between 9.0 and 9.2. Excluding files that were obviously old and ones provided as examples, there are approximately 101 new files in 9.2; 22 files were removed. The 101 new files contain 47,263 lines of code, approximately one percent of the total code size.

Of the 6920 files in Ingres 9.2., 6804 files are shared with Ingres 9.0. Running the Unix commands "diff" and "diffstat" yields the result that these 6804 files, which contain nearly all of the 4.56 million lines of source code in Ingres 9.2, have approximately 254,157 new or changed lines of source code or comments. Note that counted among these 254,157 lines are changes as simple as changing the copyright date in a comment header.

How do we produce a report like this? Here are some example Unix shell commands:

# Get a list of all the source files, sorted, excluding vdba directory.

find 90 -name \*.c -o -name \*.h -o -name \*.qsc -o -name \*.qsh \
-o -name \*.qc -o -name \*.qh -o -name \*.sc -o -name \*.sh -o -name \*.st \
-o -name \*.sy -o -name \*.yf -o -name \*.yi -o -name \*.lex \
-o -name \*.lfm -o -name \*.s -o -name \*.m64 -o -name \*.mar -o -name \*.msg \
-o -name \*.roc | grep -v vdba | cut -b 1-3 --complement | sort > 90-src-files

find 92 -name \*.c -o -name \*.h -o -name \*.qsc -o -name \*.qsh \
-o -name \*.qc -o -name \*.qh -o -name \*.sc -o -name \*.sh \
-o -name \*.st -o -name \*.sy -o -name \*.yf -o -name \*.yi \
-o -name \*.lex -o -name \*.lfm -o -name \*.s -o -name \*.m64 \
-o -name \*.mar -o -name \*.msg -o -name \*.roc \
| grep -v vdba | cut -b 1-3 --complement | sort > 92-src-files

# Get a diff of changed files
diff -u 90-src-files 92-src-files > changed-files

# Removed files count
egrep -- ^-src changed-files | wc -l

# New files count
grep -- +src changed-files | wc -l

# summing lines of code
cd into 90 or 92
xargs wc -l < ../90-src-files

Indirect Derivative (Abstraction-Filtration-Comparison)

Suppose that Program B is in a different computer language than Program A. This could be the case because of the following: Abstraction-Filtration-Comparison is a test developed by United States Court of Appeals for the Second Circuit (CT, NY, VT) in 1992 for Computer Associates v. Altai ("CA v. Altai").

In the Abstraction step, an expert must produce a set of progressively higher-level descriptions of both programs. The CA v. Altai opinion quotes a law review article:

At the lowest level of abstraction, a computer program may be thought of in its entirety as a set of individual instructions organized into a hierarchy of modules. At a higher level of abstraction, the instructions in the lowest-level modules may be replaced conceptually by the functions of those modules. At progressively higher levels of abstraction, the functions of higher-level modules conceptually replace the implementations of those modules in terms of lower-level modules and instructions, until finally, one is left with nothing but the ultimate function of the program.... A program has structure at every level of abstraction at which it is viewed. At low levels of abstraction, a program's structure may be quite complex; at the highest level it is trivial.
For typical object-oriented software, e.g., in C++, C#, or Java, grouping by methods, classes, and packages is likely to be fruitful. These are the abstractions that the original programmers considered most useful in managing the complexity of the program and sticking to these constructs is the most objective possible approach. Without a similarity of class decomposition, it would be possible to copy some algorithms or methods from Program A into Program B, but it can't be said that Program B itself is a copy of Program A.

Essentially what needs to be produced is a massive block diagram of each program, with detail about what is happening within each method of each class, how the classes are grouped together, and how objects of different classes communicate with each other. To provide a court with the requested progressively higher levels of abstraction, the same diagram is presented but with details elided. Generally speaking the least abstracted levels will not be printable legibly even as a poster, but will have to be broken up into multiple tiles.

Filtration is what comes next. The CA v. Altai opinion says that anything not protectable by copyright should be filtered out. Even more so than with abstraction, this is a process requiring expert judgment and discretion. An algorithm that is in there for reasons of efficiency, e.g., an O[NlogN] sorting algorithm such as Quicksort, should arguably be filtered out because there aren't that many different ways to sort and certainly no competent programmer would choose an O[N^2] method. A procedure call that is there because that's the only way to do something via a standard library, e.g., to send an HTML page back to a Web browser or to display text on an Android phone screen, should be filtered out because it has be "dictated by external factors". Similarly, the fact that the program written for a big company uses a relational database and SQL must be filtered out because it would be required by "demands of the industry being serviced". A catch-all element dictated by external factors is "widely accepted programming practices within the computer industry". That one will give lawyers and experts plenty to argue about!

The 1992 CA v. Altai decision came down during the infancy of the open-source software movement, but the judges were prescient in explicitly saying that "elements taken from the public domain" should be filtered out.

Comparison is the final step. In my opinion, if an expert has done his or her job properly, and been given a sufficient opportunity to explain the underlying technology and the abstraction diagrams, this should be doable by a layperson. How similar is too similar? That's something that should be evident from the diagrams and, while possibly the subject of attorney argument, determined by a finder of fact.

Example: Database Application

A database application is a computer program whose primary purpose is entering and retrieving information from a computer-managed database. Corporate accounting systems, hospital clinical information systems, and ecommerce sites are all examples of database applications. The heart of any database application is the data model or schema, which specifies how data are to be represented in the database management system (DBMS). The vast majority of modern database applications are built in relational database management systems, such as DB2, Microsoft SQL Server, Oracle or a similar RDBMS that uses SQL as the language for defining and querying tables. The best place to start with a database application is with the SQL code that defines the tables and columns.

In this example, we'll look at a discussion forum system built by two of the authors in the mid-1990s. This is part of the open-source ArsDigita Community System and many of its features were added to support the 600,000 registered users of photo.net, an online community started by Philip Greenspun in 1993. You can see the system running right here on this server with a very simple HTML template: /bboard/.

See the full analysis on this separate page.

Example: C Program (sort)

How about a program in a standard imperative programming language? C is the most widely used language, serving as the basis for the Windows and Unix/GNU-Linux operating systems as well as for popular desktop applications such as Microsoft Office and the Firefox browser. Smartphones such as the iPhone have a C-based operating system and all iPhone applications are writtin in Objective-C, a newer variant of the original C language.

For this example, we've chosen a standard open-source data sorting program, GNU sort. The entire program fits into one file with 4626 lines of code.

See the full analysis of sort.

Example: Comparison (more/less)

In this example we perform abstraction, filtration, and finally comparison on two terminal pagers often used interchangably. The results of this process are presented side-by-side to illustrate the differences and similarities between the two programs. Many modern operating systems have gone so far as to link /usr/bin/more to /usr/bin/less leading many people to believe that they are in fact interchangable. Are they right?

See the full analysis of pagers.


About the Authors

Philip Greenspun has been a software developer since 1976 and, in addition to his honest labor, has a Ph.D. in Computer Science from MIT. He has been a software expert witness in a variety of cases, mostly in Federal court.

Jin S. Choi graduated MIT in 1994 with a bachelor's in Electrical Engineering and Computer Science. He has been a working software engineer ever since, with extensive experience building and maintaining Internet applications, including electronic medical record systems and online communities.

John Patrick Morgan is a graduate of the Olin College of Engineering and is a working software engineer. He has been working with Philip Greenspun since 2008.