In this episode of The Dr. Data Show, Eric Siegel answers the question, "What the heck do 'data science' and 'big data' really mean?"
Sign up for future episodes and more info: http://www.TheDoctorDataShow.com
Attend Predictive Analytics World: http://www.pawcon.com
Read Dr. Data's book: http://www.thepredictionbook.com
Welcome to "The Dr. Data Show"! I'm Eric Siegel.
“Data science.” “Big data.” What the hell do these buzzwords really, specifically mean? Are they just cockamamie -- intentionally vague jargon that overhypes and overpromises? Or are these terms actually helpful -- do they somehow designate, like, the most profound impact of the Information Age? Well, I’ll start with the vague and overhyping side and then circle back to why these buzzwords may matter after all. It’s time for the Dr. Data buzzword smackdown.
There are a lotta problems with these words.
First, "data scientist" is redundant. It's like calling a librarian a "book librarian." If you're doing science, it involves data. Duh!
Furthermore, don't tell anyone I said this, but real sciences like physics and chemistry don't have "science" in their name. Your science is trying too hard if it has to call itself a science: Social science, political science, data science, and I gotta say -- even though I have three degrees in it and was a professor of it -- computer science is an arbitrarily defined field. It's just the amalgam of everything to do with computers -- as a concept and as an appliance -- from the engineering of how to build them and the deep mathematics about their theoretical limitations to how to make them more user friendly, and even business strategies for managing a team of programmers...
Universities might as well also have a "toaster science" department, which covers the engineering of better toasters as well as the culinary arts on how to best cook with them.
But I digress. Ok, next buzzword: “Big data.” First of all, it's just grammatically incorrect. It’s like looking at the Pacific Ocean and saying “big water.” It should be “a lotta data” or “plenty of data.”
But the real problem with "big data" is that it emphasizes the size. 'Cause what’s exciting about data isn't how much of it there is per se -- it's about how quickly it's growing -- which is amazing by the way. There’s always so much more data today than there was yesterday. So we're gonna run out of adjectives really quickly: “big data,” “bigger data,” “even bigger data,” “the biggest data.” Actually, there’s been a long-running conference called the International Conference on Very Large Databases since 1975. I’m not joking. That's before the first Star Wars movie came out!
Now, in some cases, people use the terms data science and big data just to refer to machine learning, i.e., when computers learn from the experience encoded in data. That's the topic of most episodes of this program, The Dr. Data Show. It’s a show about machine learning -- which is a well-defined field and by the way is also often called predictive analytics, especially when you're talking about its deployment in the private or public sector. I would urge folks to use the well-defined terms machine learning or predictive analytics if in fact that's what you’re specifically talking about.
But as for data science and big data, in their general usage they suffer from a terrible case of vagueness. The have a wide range of subjective definitions, which compete and conflict. Basically, they're often used to mean nothing more specific than "some clever use of data." The terms don't necessarily refer to any particular technology, method, or value proposition. They're just plain subjective -- you can use them to mean whichever technology you'd like: machine learning, data visualization, or even just basic reporting.
But much worse than that, this vagueness often serves to mislead and misrepresent by alluding to capabilities that don't exist. For example, the popular press -- as well certain analytics vendors -- sometimes use "data science" to denote some whole collection of methods that includes machine learning as well as some other advanced methods. The problem is, those other advanced methods are implied but often actually just don't really exist. They're vaporware. This confusion is sometimes inadvertent -- such as when journalists aren’t fully knowledgeable of the topic yet want it to sound as powerful as possible -- but, either way, the end result is souped-up hype that overpromises and circulates misinformation.
All these issues, by the way, also apply to the older-school term "data mining," also totally subjective. Besides, calling it "data mining" is like instead of "gold mining," saying “dirt mining.” Malfunction, failed analogy... 'Cause we aren't searching for data, we're searching within data...
For the complete transcript and more: http://www.TheDoctorDataShow.com