Site Home : Software : one article
Note that this method can also be applied in trade secret and patent litigation, though it is not explicitly mandated by case law. In patent litigation, for example, this method can be used to show similarity between an accused system and a prior art system, e.g., an earlier version of the accused system that predates the patent priority date.
Regardless of the type of action, one method of software comparison that the courts definitely do not encourage is the parties bringing two printouts, each containing 2 million lines of source code, into the courtroom and asking the judge or a jury to "read these printouts and see for yourself".
Since we don't have the source code to Photoshop available to us, let's look at a real-world open-source software product, the Ingres database management system. Can we demonstrate quantitatively that Ingres 9.2 was derived from Ingres 9.0?
Number of Files Lines of Code Ingres 9.0 6826 4,341,445 Ingres 9.2 6920 4,560,764
The "lines of code" column includes comments and blank lines for both systems. The overall code size has grown approximately 5 percent between 9.0 and 9.2. Excluding files that were obviously old and ones provided as examples, there are approximately 101 new files in 9.2; 22 files were removed. The 101 new files contain 47,263 lines of code, approximately one percent of the total code size.
Of the 6920 files in Ingres 9.2., 6804 files are shared with Ingres 9.0. Running the Unix commands "diff" and "diffstat" yields the result that these 6804 files, which contain nearly all of the 4.56 million lines of source code in Ingres 9.2, have approximately 254,157 new or changed lines of source code or comments. Note that counted among these 254,157 lines are changes as simple as changing the copyright date in a comment header.
How do we produce a report like this? Here are some example Unix shell commands:
# Get a list of all the source files, sorted, excluding vdba directory. find 90 -name \*.c -o -name \*.h -o -name \*.qsc -o -name \*.qsh \ -o -name \*.qc -o -name \*.qh -o -name \*.sc -o -name \*.sh -o -name \*.st \ -o -name \*.sy -o -name \*.yf -o -name \*.yi -o -name \*.lex \ -o -name \*.lfm -o -name \*.s -o -name \*.m64 -o -name \*.mar -o -name \*.msg \ -o -name \*.roc | grep -v vdba | cut -b 1-3 --complement | sort > 90-src-files find 92 -name \*.c -o -name \*.h -o -name \*.qsc -o -name \*.qsh \ -o -name \*.qc -o -name \*.qh -o -name \*.sc -o -name \*.sh \ -o -name \*.st -o -name \*.sy -o -name \*.yf -o -name \*.yi \ -o -name \*.lex -o -name \*.lfm -o -name \*.s -o -name \*.m64 \ -o -name \*.mar -o -name \*.msg -o -name \*.roc \ | grep -v vdba | cut -b 1-3 --complement | sort > 92-src-files # Get a diff of changed files diff -u 90-src-files 92-src-files > changed-files # Removed files count egrep -- ^-src changed-files | wc -l # New files count grep -- +src changed-files | wc -l # summing lines of code cd into 90 or 92 xargs wc -l < ../90-src-files
In the Abstraction step, an expert must produce a set of progressively higher-level descriptions of both programs. The CA v. Altai opinion quotes a law review article:
At the lowest level of abstraction, a computer program may be thought of in its entirety as a set of individual instructions organized into a hierarchy of modules. At a higher level of abstraction, the instructions in the lowest-level modules may be replaced conceptually by the functions of those modules. At progressively higher levels of abstraction, the functions of higher-level modules conceptually replace the implementations of those modules in terms of lower-level modules and instructions, until finally, one is left with nothing but the ultimate function of the program.... A program has structure at every level of abstraction at which it is viewed. At low levels of abstraction, a program's structure may be quite complex; at the highest level it is trivial.For typical object-oriented software, e.g., in C++, C#, or Java, grouping by methods, classes, and packages is likely to be fruitful. These are the abstractions that the original programmers considered most useful in managing the complexity of the program and sticking to these constructs is the most objective possible approach. Without a similarity of class decomposition, it would be possible to copy some algorithms or methods from Program A into Program B, but it can't be said that Program B itself is a copy of Program A.
Essentially what needs to be produced is a massive block diagram of each program, with detail about what is happening within each method of each class, how the classes are grouped together, and how objects of different classes communicate with each other. To provide a court with the requested progressively higher levels of abstraction, the same diagram is presented but with details elided. Generally speaking the least abstracted levels will not be printable legibly even as a poster, but will have to be broken up into multiple tiles.
Filtration is what comes next. The CA v. Altai opinion says that anything not protectable by copyright should be filtered out. Even more so than with abstraction, this is a process requiring expert judgment and discretion. An algorithm that is in there for reasons of efficiency, e.g., an O[NlogN] sorting algorithm such as Quicksort, should arguably be filtered out because there aren't that many different ways to sort and certainly no competent programmer would choose an O[N^2] method. A procedure call that is there because that's the only way to do something via a standard library, e.g., to send an HTML page back to a Web browser or to display text on an Android phone screen, should be filtered out because it has be "dictated by external factors". Similarly, the fact that the program written for a big company uses a relational database and SQL must be filtered out because it would be required by "demands of the industry being serviced". A catch-all element dictated by external factors is "widely accepted programming practices within the computer industry". That one will give lawyers and experts plenty to argue about!
The 1992 CA v. Altai decision came down during the infancy of the open-source software movement, but the judges were prescient in explicitly saying that "elements taken from the public domain" should be filtered out.
Comparison is the final step. In my opinion, if an expert has done his or her job properly, and been given a sufficient opportunity to explain the underlying technology and the abstraction diagrams, this should be doable by a layperson. How similar is too similar? That's something that should be evident from the diagrams and, while possibly the subject of attorney argument, determined by a finder of fact.
In this example, we'll look at a discussion forum system built by two of the authors in the mid-1990s. This is part of the open-source ArsDigita Community System and many of its features were added to support the 600,000 registered users of photo.net, an online community started by Philip Greenspun in 1993. You can see the system running right here on this server with a very simple HTML template: /bboard/.
See the full analysis on this separate page.
For this example, we've chosen a standard open-source data sorting program, GNU sort. The entire program fits into one file with 4626 lines of code.
See the full analysis of sort.
/usr/bin/more
to /usr/bin/less
leading
many people to believe that they are in fact interchangable.
Are they right?
See the full analysis of pagers.
Jin S. Choi graduated MIT in 1994 with a bachelor's in Electrical Engineering and Computer Science. He has been a working software engineer ever since, with extensive experience building and maintaining Internet applications, including electronic medical record systems and online communities.
John Patrick Morgan is a graduate of the Olin College of Engineering and is a working software engineer. He has been working with Philip Greenspun since 2008.