Is “data scientist” the new “programmer”?

Back in the 1970s, being a “programmer” meant writing one or files of code that input data, processed it in some way, and then output a result. A program that occupied more than 256 KB of memory, even on a mainframe, would have been considered bloated (and wouldn’t have run at all on a “minicomputer,” at least not without a painful process of overlaying). Thus, there tended to be a lot of interesting stuff going on within every few lines of code and certainly an entire file of code might contain nearly everything interesting about an application.

Today’s “software developer” is typically mired in tedium. To trace out the code behind a simple function might require going through 25 files, each of which contains a Java method that kicks a message to another method in some other file. Development tools such as Eclipse can speed up the tedious process of looking at a 20-layer call stack, but there remains a low density of interesting stuff to look at. A line of code that actually does something is buried amidst hundreds of lines of glue, interface, and overhead code. How did applications get so bloated and therefore boring to look at? I blame hardware engineers! They delivered the gift of infinite memory to the world’s coders and said coders responded with bloat beyond anyone’s wildest imagination.

Does the interesting 1970s “programmer” job still exist? While teaching an intro “data science” class at Harvard, I wondered if the person we call a “data scientist” is doing essentially the same type of work as a 1970s Fortran programmer. Consider that the “data scientist” uses compact languages such as SQL and R. An entire interesting application may fit in one file. There is an input, some processing, and an output answer.

Readers: What do you think? Is it more interesting to work in “data science” than “software engineering” or “programming”?

Older readers: Is today’s “data science” more like a programming job from the 1970s “scarce memory” days?

Related:

 

27 thoughts on “Is “data scientist” the new “programmer”?

  1. Suspect data scientist is still the new word for statistician, but it’s among the things reinflated by the printing press, like mortgage brokers.

  2. I was a “Programmer” in the ’70’s, and I keep thinking how much of what my early programs did would be done by a spreadsheet now (or any time since the late ’80’s).

  3. I am a software engineer in the defense industry. We still have code that is written in the nineties. I get to do bit banging and embedded code. So yes, there is a lot of that, but we muddle around in process too much to get much done.

  4. If their code is under version control and covered by automated tests then yes they are the new programmers.

  5. Data scientist sits at the Venn diagram intersection of Computer Science, Statistics, and some Domain expertise. Hugely hyped, but fueled by computing horsepower, and nearly infinite data from our connected world. Very lucrative in places, but no one can quite agree what makes a good one.

  6. TimB: Even if the money were half of what today’s coder gets paid it might still be a better job because one is spared the tedium of looking at millions of lines of Java that do almost nothing!

  7. The demand of today’s application vs. yesterday’s has increase considerably. Take the first browser for example. All that it had to do was render formatted text, images and navigate to links. Today’s browsers will have to handle all this plus much more. They have to provide secure browsing and work with various scripting languages and embedded applications to name some. The same can be said for spreadsheet applications, word processors, et. al.

    This is for 1980’s and 1990’s. If you go further back in history and look at 1970’s then you will see even further simplifications. The 1970’s “programs” were very much a “program” for a single operation and they were heavy on algorithm vs. being an “application” (this is why I quoted “program”). Those are easier [1] to work with and code for and thus take less resources and memory.

    Now, all this brings back memories from the good-old-days when I started coding: PEEK and POKE into the memory of a Commodore computer, dealing with Extended and Expanded memory of an Intel architecture, or making sure big-endian or little-endian are used properly so the program works properly on different architectures.

    [1] “easier” hear means you are dealing with a well defined requirement and a controlled environment in which the program is running in.

  8. In my experience so far, game programming is one area where there still isn’t too much room for bloat — at least, there is a large layer of “engine-y” code that benefits a lot from being small and efficient, sitting underneath “application-y” code that may admit bloat here and there.

  9. The common theme between a 70-ties programmer and a current “data scientist” is that their programs are mostly non-interactive, single-purpose, developed in a small team.

  10. mmm..

    Statistics = Math + Data
    Programming = Logic + Structures

    DataScientist = Statistics + Programming

    DataScientist.getResult()

  11. Why do you care so much about how interesting the code itself is? The code is a means to an end. I’m more impressed by the creations that code enables. The products and applications enabled by code today are exponentially more diverse, useful, and meaningful than they were in the 70s, 80s, and 90s.

  12. Victor: Good point, but when you’re hire to code you have to look at code! Being on a team of 500 or 5,000 looking at huge quantities of overhead code is not interesting day to day. Being on a team of 3 working with data might be! I had a lot of fun poking around in a monster database of insurance claims recently. No program was longer than about 5 pages, including view abstractions.

  13. I seem to recall seeing ads for ‘data science bootcamps’ for turning out the next generation of cannon fodder, so in this respect it might be like being a developer. Become a well-paid scientist in just six weeks!

  14. The last time I examined a Java call stack, it was 100+ calls deep, most of it various frameworks calling each other in a twisty little maze. In the end, there was SQL querying.

  15. I run a department of programmers and data scientists. It is a completely different job, they are not even close to becoming programmers.

    The fact the a programmer is defined here by the fact that the use R, and R can fit in a file is not a premise for defining what makes a programmer.

    Data Scientists never use R for production code unless its a trivial script. Beyond that, many programmers now use functional languages.

    Then finally, the opinion of “Today’s “software developer” is typically mired in tedium. To trace out the code behind a simple function might require going through 25 files, each of which contains a Java method that kicks a message to another method in some other file. ”

    Is one of someone that has clearly never done any interesting programming, used a decent IDE, or learned clean code and correct tooling.

  16. As a data scientist who has the luxury of building prototypes that then get productionalized by a data engineer, I can relate:

    I write a script of maybe ~250 lines spread across 2 or 3 files that
    Pulls data from source systems, Performs ETL and feature engineering, trains a model that I built, aggregates, saves and evaluates the results of the experiment, and does all this in a distributed and (in essence) scalable way on production hardware.

    In past projects, I’ve seen my colleage spread this code across dozens of files buried in 2k-3k lines of boilerplate code to make it work with our existing frameworks, pipelines etc. Sure, this is all important and allows for so much more resilience than my tiny script, but I absolutely prefer looking at my own code where (almost) every line actually does something.

  17. Data scientists can write terse code because they outsource the hard work to large libraries (e.g. numpy, pandas, keras).

    You didn’t mention web developers, who need to handle a huge number of user interactions.

    I’ve not coded until late 2000’s so it’s hard to compare, but I suspect the systems current large code bases interact with are more complicated and more versatile than those that existed in the 70’s.

    I just transitioned from a web developer to data scientist and love it. But I was a math major and find machine learning exciting. I’d recommend everyone at least look into what it is to see if it’s right for them.

  18. “Does the interesting 1970s “programmer” job still exist?”

    Certainly. If by “interesting” you mean, valuable? Then yes. Financial companies, economists, researchers, etc do just this day after day, but they’re not called “programmers.”

    “While teaching an intro “data science” class at Harvard, I wondered if the person we call a “data scientist” is doing essentially the same type of work as a 1970s Fortran programmer.”

    Absolutely. Not sure if you’ve managed to leave the academic walls of New England academia lately, but if you look at what Berkeley is doing, you can see where this is heading. There, data science is computer literacy 101 for everyone, regardless of major. It has taken over 50 years for the goal to make everyone into a programmer to solve problems and/or answer questions a reality.

    Is it “hyped”? If by hype one means paying exorbitant salaries, then maybe. Can’t blame those that are just following the money. However, if one were to look closely at what Microsoft and even Google are doing, it’s easy to see that 2/3 of the function of what a data scientist is currently doing could easily be automated, leaving the filtering and massaging of data to the domain expert. In essence, anyone who opens Microsoft Office in the near future would today be considered a data scientist.

    “Consider that the “data scientist” uses compact languages such as SQL and R. An entire interesting application may fit in one file. There is an input, some processing, and an output answer.”

    Yup. Not that much different from the SQL ridden TCL scripts running on Oracle databases of the Arsdigita days.

    “Readers: What do you think? Is it more interesting to work in “data science” than “software engineering” or “programming”?”

    Let me rephrase that question… Is it more interesting to use the incredible amount of computing power available on everyone’s desk – let’s not even talk about data centers – to solve an itch (personal/organizational) than to look at the massive amounts of seemingly unnecessary software abstractions for no clear valuable purpose? Then yes.

  19. What Stefan said in #18 is a good feedback that we can use to further explain why today’s programs are more complex.

    Sure the “~250 lines spread across 2 or 3 files” is enough for a prototype or even a stand alone “program” but once you convert that into an application or integrate it into an existing application, you will than have to add error handling, logging, security, data formatting for the UI, integration with public APIs, and I can go on and on.

    @Stefan and all, do you now see why that ~250 lines of code is a totally different animal? This is why we cannot compare the 1970’s “programs” to today’s “application”.

  20. George: I don’t think that you and Stefan are disagreeing. Stefan is saying that he’d rather do the interesting part. He isn’t saying that the additional thousands of lines of code are useless, only that they are dull and that working with them is a dull job.

  21. You are right because the installed base is so much smaller for data science systems. Most of programming is tedium today because there is so much installed base to navigate and contort themselves through. Data sciences systems has a lot of “freedom space” as a practice. Similar to working in a startup vs a large company, there isn’t a “bureaucracy” to navigate through. It’s the lack of established social structures that enables the feeling of progress and freedom.

  22. There is truth to this concept, as a Data Science and Analytics Manager for a large firm I’ve seen firsthand that Data Scientist is a ‘suitcase’ term that garners all sorts of wild imagination from executives because the media feeds them the hype (see differences between NLP and Timeseries based predictive ML, many executives don’t grasp the fundamental differences and think Data Scientists are great at both) At the end of the day I love this analogy because I think Data Scientists generally bring programming back to it’s first principles including quality and feature engineering which imho has been overshadowed in the last decade by frameworks, SOA, libraries, and a race to the bottom for programmer talent. The re-calibration is refreshing and represents the pendulum swinging back to high quality programmers who actually understand the business problem and code efficiently and effectively to meet those needs. Successful Data Scientists in my firm take on many traits similar to the 10x engineer/programmer which we have been missing for some time.

  23. OK, ‘data mining’ I get, and while ‘data science’ seems like mostly a rebranding, maybe it’s a useful one. But now there’s such a thing as a ‘data engineer’??

    I’m drawing a blank as to what that would be… optimized data ‘massaging’??

    In any case, if we end up with ‘data designers’, the field will have likely jumped the shark!

  24. Comparing single purpose, small, static (in sense of flexibility) code to big neglected codebase with multiple features developed over significant amount of time by many people doesn’t seem fair to me.
    I’ve seen code, where “every few lines of code and certainly an entire file of code might contain nearly everything interesting about an application” written mostly by one man, where UI internals were interspersed with logic and it was mostly impossible to change without breaking something else.
    My point is, when you have single purpose, small task without all that UI or networking overhead and done alone, it is very easy to keep it simple and smart, but most applications are so big and complex, that they need to be developed by many people, be flexible and be robust to make some value.
    And mostly it’s not something intentional, that code is bloated, it’s mainly lack of knowledge, floating requirements and dependencies, where in data science there is mostly one data source, few dependencies and one task.
    It’s like comparing group of few enthusiast sharing one goal to whole society with various points of few and interests.

Comments are closed.