Is “data scientist” the new “programmer”?

Back in the 1970s, being a “programmer” meant writing one or files of code that input data, processed it in some way, and then output a result. A program that occupied more than 256 KB of memory, even on a mainframe, would have been considered bloated (and wouldn’t have run at all on a “minicomputer,” at least not without a painful process of overlaying). Thus, there tended to be a lot of interesting stuff going on within every few lines of code and certainly an entire file of code might contain nearly everything interesting about an application.

Today’s “software developer” is typically mired in tedium. To trace out the code behind a simple function might require going through 25 files, each of which contains a Java method that kicks a message to another method in some other file. Development tools such as Eclipse can speed up the tedious process of looking at a 20-layer call stack, but there remains a low density of interesting stuff to look at. A line of code that actually does something is buried amidst hundreds of lines of glue, interface, and overhead code. How did applications get so bloated and therefore boring to look at? I blame hardware engineers! They delivered the gift of infinite memory to the world’s coders and said coders responded with bloat beyond anyone’s wildest imagination.

Does the interesting 1970s “programmer” job still exist? While teaching an intro “data science” class at Harvard, I wondered if the person we call a “data scientist” is doing essentially the same type of work as a 1970s Fortran programmer. Consider that the “data scientist” uses compact languages such as SQL and R. An entire interesting application may fit in one file. There is an input, some processing, and an output answer.

Readers: What do you think? Is it more interesting to work in “data science” than “software engineering” or “programming”?

Older readers: Is today’s “data science” more like a programming job from the 1970s “scarce memory” days?



Full post, including comments

Fallout from the Java = SUV posting

The “Java = SUV posting” continues to resonate in my inbox.

The last two students using Java dropped 6.171.  They were not keeping pace with the PHPers and those who sold their souls to Bill Gates.  (Recall that all the students in 6.171 had built a 10,000-line Java program in 6.170 so they all knew the language itself quite well.)

Lots of professional Java programmers emailed to say “If only those students had used Libraries X and Y, they would have done okay.”  Sadly X and Y were never the same in any two emails so it is easy to understand how the students went wrong (i.e., it is not obvious how one is supposed to choose among the 100 different ways to get something done in the world of Java tools).

Similarly there was no agreement among Java programmers as to whether it is good to have SQL queries prominently featured in source code or better to make everything into Java objects and magically generate SQL behind the programmers’ backs.  Half of those emailing said that SQL was impossibly hard to write and what people really needed was to see the programmers’ custom-created methods.  The other half seemed to think that a database application ought to be primarily expressed in SQL, a concise declarative query language that has been standard for 25+ years.  These are 100% incompatible points of view.

My friend Curtis, an old-time Silicon Valley monster C hacker, AIMed me to say that he’d seen the Slashdot article:

“My problem with Java is that it makes hard things hard, and easy things hard.  The amount of hassle doesn’t scale with the complexity of the problem.  Whereas with PHP you can write “Hello World” without having to read a 200-page book.  Java is a train wreck with dozens of classes with slightly different methods that do similar things.  On the other hand, it kills me that the PHP database interface is so bad.  Actually PHP just kills me anyway…why they had to invent a new language, I’ll never know.”

I pointed out to Curtis that the latest Technology Review, MIT’s alumni rag, picked the developer of PHP as one of its “100 Bold Young Innovators You Need to Know”:

“Rasmus Lerdorf has learned five languages while living around the world.  But it’s the language he invented that has had global impact.  In 1995, without any formal programming training, Lerdorf developed a server language to help him set up Web sites. … He named the language PHP, for PHP hypertext preprocessor.”

Curtis’s response to Tech Review?  “People mistake creation for innovation”.

Full post, including comments

Java is the SUV of programming tools

Our students this semester in 6.171, Software Engineering for Internet Applications have divided themselves into roughly three groups.  One third has chosen to use Microsoft .NET, building pages in C#/ASP.NET connecting to SQL Server.  One third has chosen to use scripting languages such as PHP connecting to PostgreSQL and sometimes Oracle.  The final third, which seems to be struggling the most, is using Java Server Pages (JSP) with Oracle on Linux.  JSP is fantastically simpler than “full-blown J2EE”, which is the recommended-by-Sun way of building applications, but still it seems to be too complex for seniors and graduate students in the MIT computer science program, despite the fact that they all had at least one semester of Java experience in 6.170.

After researching how to do bind variables in Java (see the very end of, which turns out to be much harder and more error-prone than in 20-year-old C interfaces to relational databases, I had an epiphany:  Java is the SUV of programming tools.

A project done in Java will cost 5 times as much, take twice as long, and be harder to maintain than a project done in a scripting language such as PHP or Perl.  People who are serious about getting the job done on time and under budget will use tools such as Visual Basic (controlled all the machines that decoded the human genome).  But the programmers and managers using Java will feel good about themselves because they are using a tool that, in theory, has a lot of power for handling problems of tremendous complexity.  Just like the suburbanite who drives his SUV to the 7-11 on a paved road but feels good because in theory he could climb a 45-degree dirt slope.  If a programmer is attacking a truly difficult problem he or she will generally have to use a language with systems programming and dynamic type extension capability, such as Lisp.  This corresponds to the situation in which my friend, the proud owner of an original-style Hummer, got stuck in the sand on his first off-road excursion; an SUV can’t handle a true off-road adventure for which a tracked vehicle is required.

With Web applications, nearly all of the engineering happens in the SQL database and the interaction design, which is embedded in the page flow links.  None of the extra power of Java is useful when the source of persistence is a relational database management system such as Oracle or SQL Server.  Mostly what you get with Java are reams of repetitive declarations at the top of every script so that the relevant code for serving a page is buried several screens down.  With a dynamic language such as Lisp, PHP, Perl, Python, Tcl, you could do bind variables by having the database interface look at local variables in the caller’s environment.  With Java the programmer is counting question marks in the SQL query and saying “Associate the 7th question mark with the number 4247”, an action that will introduce a bug into the program as soon as the SQL query is modified (since now the 7th question mark has been moved to become the 8th question mark in the query).

Full post, including comments