Big data and machine learning

Continuing the report from my sojourn among some of the most capable programmers in Silicon Valley…

“You look like someone who might know awk,” said a top software engineer at one of the biggest web companies, where a 12 petabyte data set is a common starting point for analysis. “I think that was a polite way of saying that I had gray hair,” she continued. “Big data is ‘batch processing’ or ‘stuff that requires more work than can be done interactively.’ I’ve seen young programmers for whom big data is their first encounter with batch processing. It is like watching a dog eat peanut butter. It takes them a while to learn that machine learning is simply dividing things into bins and then clustering.” What are her secrets? “Transform everything into tab-delimited text files and use standard Unix tools. This results in much faster run times than code using the shiny new tools and data structures. My favorite algorithm is the gradient boosted decision tree mostly because I like to hear people trying to say it without getting tongue-tied.”

An artificial intelligence specialist said that, from her perspective, machine learning is fundamentally changing programming: “hundreds of lines of code to drive nVidia CUDAs instead of millions of lines of code. AI is both freedom from programming and freedom from understanding. Google is replacing PageRank with RankBrain and when this is complete they won’t know why certain pages are offered as the best results.” In her view a “future-proof skill set does not involve much coding background; it will be more about statistics and domain knowledge.” Some tools to learn? The Torch library and lua.

Is Deep Learning all hype? Perhaps not. The recognition rate on standard image sets has gone up dramatically recently. “There are no new ideas,” said one hardware/software expert. “The guys in the 60s were pretty smart. They just didn’t have fast enough hardware.” Could it be that the Singularity is in fact within sight? The best market-based evidence for this seems to be that people who teach at Singularity University are being offered $20,000 speaking engagements at corporate events.

9 thoughts on “Big data and machine learning

  1. These sound more like marketroids than engineers. The dramatic improvement in image recognition still wasn’t enough to make the Fire phone, Google goggles, or Google glass sell, or bring a self driving car out of vaporware & they seem to rebrand 1 development: the convolutional neural network into many new ideas.

    Would say today’s experience of searching for something on Goog by abbreviating sentences & repeatedly rephrasing until it spits out the desired result isn’t much better than library searches 20 years ago, with a few improvements like counting citations & sponsored links. It is humans who are getting better at formulating their queries & persisting with their searches, & of course economic policies assigning higher valuations to smaller innovations to prevent a recession.

  2. I read recently that Sussman is concerned about having AI systems that we don’t understand making life and death decisions, such as driving automated cars, and that it would be better to have rule-based systems so that if something goes wrong we can determine why it went wrong.

    The machine learning / neural network approach certainly seems to be bearing remarkable fruit recently, but the idea of handing over control of important systems to computer programs that we don’t understand does sound disconcerting to me.

    Thoughts?

  3. @jack crossfire
    You seem to have no idea what you are talking about. Google glass did not use any Deep Learning technology. The Amazon Fire Phone was a spectacular failure due to its pricing and management. Advancement in deep learning did not appear / gain mainstream acceptance until 2013. Your comment regarding web search are meaningless. I suggest you read up on Recurrent Networks, LSTM and the Deep Brain acquisition to write further deluded comments. At least they would be more humourous.

    @Phil I don’t think machine learning is fundamentally changing programming, it is of course becoming an irreplaceable tool such as Databases. There is still millions of lines of glue code that needs to be written. As far as recommendation regarding Torch or Lua is concerned, there is no guarantee that the community won’t adopt Tensorflow (C++ and Python) or stick with already successful Caffe . its incorrect to claim that few lines of CUDA replaces millions of lines of C++ is . It is possible to build more powerful models, achieve end to end training. Yet the gains remain in the core components. There is still huge amount of other code that needs to written for it to work.

  4. By the way Phil, does this seem familiar?

    Imagine going to work with your phone, and display/keyboard. You take an available desk, and connect to the 34″ Dell curved ultrawide monitor there, putting your phone into Mac OS X mode, and also available as touchpad. Your files are all accessible via Google Drive, and your keyboard is the cover of the tablet/display, which you configure to act as a second monitor. Later in the day you head out for a business trip, and a few hours later, when you get to the hotel you set up a similar configuration using the monitor on the desk there. Later, you head for bed, putting the phone on the nightstand, and running the display in iOS mode, catching up on the news in Flipboard.

    https://research.gigaom.com/2015/11/microsofts-lumia-950-the-shape-of-things-to-come/

  5. The hype is symptomatic of underlying advances in what we can do with deep learning, and machine learning in general.

    To be clear, gradient-boosted decision trees have nothing to do with deep learning or the recent advances in image recognition. The winners of ImageNet, and all their closest contenders, are all deep convolutional nets. On some problems, DL is only competing with itself.

    While neural networks have been around for a long time, there *are* in fact new ideas. Researchers are setting new records in accuracy on image net using deep-convnet architectures that weren’t imagined decades ago.

    Attention models and combinations of DL and reinforcement learning are producing better and better results. It’s more than hardware. It’s the convergence of more powerful hardware, bigger data and better ideas.

    I work for a deep-learning startup, Skymind, supporting an open-source deep-learning tool, Deeplearning4j. http://deeplearning4j.org/

  6. It is kind of Orwellian how “machine learning” is pretty much now newspeak for plain old “pattern recognition” or classification.

    It’s as if everyone has given up on the much more important problem of machines actually understanding anything at all, at even the level of two year old.

  7. Using unix tools is all good and well, but manually moving the data around a bunch of computers and manually spinning up processes to chew on the data is ridiculous. Taking care of those chores is what the frameworks buy you. You don’t have to use the over-engineered Java stuff. You just need some sort of cluster computing filesystem and some sort of job scheduler. Script those unix tools with HTCondor and use Ceph to hold the csv files.

Comments are closed.