There was a social media post, a bit of a brain teaser, about a murder mystery of sorts: a man was killed one afternoon, and there were several suspects, each with an alibi; it’s your basic whodunit. One of the suspects was a chef who was making breakfast. Many comments on the post insisted it had to be the chef since no one makes breakfast in the afternoon.
I have had breakfast for dinner many times. McDonald’s offers “all-day breakfast” at least in the U.S. Perhaps this is a regional thing, but it brings me to the point that we are inclined to assume the world operates as we see it from our own perspective. It’s human nature, but I wonder to what degree it contributes to increased friction and unwillingness to accept that non-preferred alternatives exist.
You might be wondering what this has to do with data and analytics. First, we underestimate the importance of clear operational definitions. What do you mean by “breakfast”? Does it refer to the time of the day the meal is eaten, or the type of the food typically eaten in the morning? Many analytical attempts are flawed from the initial stages because the definitions are subject to interpretation. Second, we live and (nearly) die by assumptions. Many assumed that “breakfast” referred to the meal eaten at a specific time of the day, but that is not true in all contexts. Successfully operationalizing any analytic requires not only identifying all relevant assumptions and the consequences of violating them, but also preparing for these consequences. In practice, assumptions are violated all the time; assumptions we don’t even know we are making can have devastating consequences.
The second point is connected, perhaps somewhat convolutedly, to the fact that the end consumer of analytics is a human one way or another. (It is also related to the topic of collaboration involving analytics professionals, but that’s a separate discussion.) We are drawn to view things in ways that are comfortable to us. I have no desire to get political here, but I am convinced that the greater polarization in society today is due to the hyper-clusterization of the viewpoints, opinions, and beliefs, rather than the feeling of connectedness at the personal level, and much of this is being perpetuated by social media and other aspects of “Big Data.” We want to believe that we know the Truth and everyone who disagrees is an idiot, thus caring about the opinions more than about the human to whom the opinions belong. (I recently read “The Fourth Industrial Revolution” by Klaus Schwab and found something similar expressed, although he’s obviously far more authoritative than I am!) This, in turn, makes it easier to capture in algorithms what makes each of us tick.
Big Data may be enabling a higher level of personalization, but personalization isn’t the same as human connection. Could data-driven personalization actually be contributing to the dehumanization of humans? Artificial intelligence and deep learning and such, as useful as they are, cannot be at the expense of being human, since there is always a human somewhere in the process. I’m not concerned about AI taking over the world as much as that it may be eliminating our need to be human.
For if it ever gets to the point where we have to re-learn how to be human, no amount of data-driven intelligence is going to save the humanity.
There is an inside joke between my husband and me–about the infomercials touting you can learn to play the piano in a flash. He (jokingly) threatens to achieve in mere four hours what took me many years of blood, sweat, and tears (I was a professionally trained classical pianist in my previous life), but for now, it remains an empty threat (thankfully).
I think it is fair to say that most reasonable people understand these programs do not turn a complete newbie into a professional pianist in only a few hours. I have always and strongly encouraged people to learn and enjoy playing the piano, as that also enables them to appreciate the work of other musicians more deeply and perhaps even collaborate under the right circumstances. However, offering the skill as a professional service for a fee is a different story, and it would be irresponsible for me to encourage that to someone who has only some cursory training.
The same is true of data science (or anything else for that matter). Learning should be encouraged so that one can appreciate it as well as understand its potential more intelligently–needed today simply to stay competitive. However, the line between intelligent appreciation and hard skills is becoming increasingly blurred, with the unsolicited “Learn Data Science in X Days/Weeks” advertisements showing up on my feeds daily. Along with the popularity of analytics democratization, is it becoming another factor that could threaten the integrity and eventual well-being of advanced analytics?
It is curious there seems to be anywhere from a tacit acceptance to enthusiastic encouragement for this in data science. To be fair, I do not believe that these short programs are put together under the presumption that they make a complete novice into a fully competent data scientist. That said, the expectations are usually not clearly articulated, and I am rather annoyed by what is essentially a marketing tactic that takes advantage of the hype. It plays very well into the rapid-results culture that often encourages shortcuts.
I recognize that not everyone in these programs is starting from scratch, and those with more adjacent background with just a few missing skills have a much better time transitioning into this much coveted discipline. There are other factors, obviously. We can also question what one means by a “data scientist” (let’s not get it started here), but it suffices to say that what a business needs from a “data scientist” runs a wide gamut, not all of which are about learning specific algorithms or programming language. However, I expect any “data scientist” to have the following hard competencies at the minimum:
- Solid understanding of probability concepts, on which any analysis design is heavily dependent regardless of the methodology (statistical or otherwise) ultimately employed.
- Solid ability to code, whatever the language. A data scientist must be comfortable getting around very messy raw data, big or small–it is the science of data, after all. The specific language is secondary, as long as its strengths and weaknesses are understood. What is more important is one’s ability to logic his or her way through a messy pile of data while programming efficiently and in a well-structured manner; one can always learn another language. (A recent comment hinted data scientists had to program in Python. Nonsense. I once coded something entirely in Base SAS just to prove a point and, of course, because I could.)
I purposely left out analysis techniques from the criteria. This is where short courses are perfectly suited–you can always learn techniques. But you need the above two first and foremost, and their development is not measured in weeks.
Can you learn data science in a flash? Like you can learn to play the piano in a flash.
“Everyone’s a data scientist–if they have the right tools”–a well-known business publication commented on social media, referencing an article on data democratization.
This is like saying everyone is a driver if he or she has the right car. While that may technically be true, you need to learn how to drive, then you need a driver’s license (well, at least legally in most places). It still says nothing about whether you know where you are going–you need a map or good directions. Some people are bad with directions; some are simply bad drivers. You could be the world’s best driver with the best car, but without the right map and directions, you have no chance of reaching your destination. What if, one day, all GPS maps cease to exist?*
Outside of the political context, Webster defines “democratic” as “relating to, appealing to, or available to the broad masses of the people.” The comment above by the publication implies democratization of analytics is equivalent to the democratization of data plus tools. However, this is true only if you define analytics to consist strictly of tools that can replace all understanding. The democratization process is different between data and analytics; data is a tangible asset, while analytics is a set of human activities on that asset, which may or may not include tools (technically speaking, it is possible to do 100% of analytics by human power only).
Then how should analytics made available to the broad masses? You can put a car in everyone’s hands, but not everyone should drive. On the other hand, the access to the benefits of vehicles can be near universal, as even those who cannot drive a car can generally ride in one. When it comes to analytics, however, far too many people equate democratization with universal permission to drive (i.e. execution of analytics), rather than universal access to its benefits. This has led to perhaps one of the most critical problems with analytics today: your ability to carry out the analysis tasks says nothing about whether the analysis is valid, and the recent trends of actively shifting the focus away from analysis validity and directly toward technology is troubling for the future of decision-making. Not too long ago, I came across a blog touting a “Big Data Easy Button”; again, without the right analysis design, it is simply an easy button to execute the wrong analyses, and even if the analysis is correct, its benefits often still remain out of the reach of those who need the insights to make decisions.
The recent re-emergence of the p-value controversy is another case in point. For those who recall p-values from that one statistics class, less than 0.05 and you had a statistically significant result. However, it says nothing about whether the analysis itself is valid in the first place. The fact that you can apply the mechanics of the statistical analysis to obtain the p-value does not validate your conclusion, much like the fact that you can operate a vehicle says nothing about whether you can actually reach the intended destination. Unfortunately, the p-value has achieved the statistical easy button status; the fixation on p-values routinely leads to false conclusions, driving the editor of a well-known periodical to call for a ban of p-values all total. The problem is not with the tool but with the users of the tool and the context in which the tool is used, banning the tool does not eradicate bad users of statistics or fix the context, and the true benefits of statistics still never reach those that really need them.
While there is so much attention on doing analytics, many are still so far out of the reach of its benefits. People are convinced that one must drive rather than ride for analytics to be universally accessible, and many continue to wait for the ride that will never come, while bad drivers clog up the streets, not knowing where they are going and causing massive pile-ups. Show me a place where everyone with a great car is a great driver who knows exactly where he or she is going, and I will show you an analytics easy button that makes everyone a data scientist.
*P.S. What did we do before GPS?
One of the prevailing challenges in making analytics successful is obtaining quick wins. What the businesses often need is an analytics equivalent of agile sprints approach, in which actionable results are delivered in short cycles; yet for many, this continues to escape reality. Too often an elaborate analytical project is planned or carried out, stemming from insistence (on rigor and perfection), naiveté (about all that are possible), or neglect (of the original intent).
Contrary to the common perception, an effective execution of “agile analytics” is almost all about design rather than about tools and/or technology; the speed to insight is driven almost entirely by how the analysis itself is designed. While tools and technology can sometimes help, they cannot make up for poor design–they will simply deliver an invalid solution a tad faster. And when the analysis is sufficiently simple, a wrong design delivers with less effort a solution to the wrong problem. This is like the worst case of “are we there yet?”–wrong directions and/or wrong destination; having a faster car does not correct for either, and all the passengers are hungry and cranky.
Many activities related to analytics have direct parallels in technology, although this often eludes even the highly skilled analytics professionals. The basic technology development lifecycle framework–Discover, Design, Develop, Deploy, or any of the other widely accepted variants thereof–applies to analytics as well. However for analytics, most of the attention is currently going to the “Develop”–that is, where the analytical techniques are applied–that the other three phases are practically being ignored. This has an impact not only on the feasibility of agile analytics but on the effectiveness of analytics overall.
Just like in technology, the “Develop” in analytics–effective application of analytical techniques to obtain insights or build an algorithm–must be based on good functional requirements that are designed to build the business needs as well as operational constraints into the solution. Then, the analyst’s job is to use the best techniques to fulfill those requirements through development. The challenge is, data scientists are often expected to do both the “Design” and the “Develop” based on their ability to apply analytical techniques–that is, based solely on their ability to do the “Develop.” This is equivalent of expecting technology developers to be great architects based on their ability to code. The technology field is now mature enough to realize this, but somehow the analytics community has yet to acknowledge the same. To further complicate the matter, many data scientists implicitly believe they can design based on their ability to develop, since the distinction is not very well understood even in the advanced analytics community.
What makes an analytical professional great at analytical design? That is a discussion for another time. However, it suffices to say that there are far more expert analytical developers than there are professionals who excel in analytical design. And as with anything else, it pays off to get it right up front–getting the solution design right is critical in setting the right expectations. Then, if the ecosystem of analytics consumption is in place, you will have two major pieces of foundation needed toward getting the real value out of analytics that are missing from so many of the analytical
Science—”the state of knowing: knowledge as distinguished from ignorance or misunderstanding.” (Merriam-Webster)
I once held the title of “Lead Data Scientist.” While the title made no difference to me—I see it as a label—I spent more time explaining what I did and my value was to the business world. If I were really smart, I would have built a predictive model to see that coming, would have retired and be on a beach somewhere, with a fruity drink. Alas, as usual I am reminded of my own shortcomings, as a data scientist or otherwise.
Then I recently came across a heated discussion on what a data scientist was. For as long as the world has coveted this resource, it is curious how this is still a hot debate—we all desperately need it, but we don’t really understand what it is.
Interestingly enough, the question is never with “data” but rather with “science.” The Webster definition above is only the first one, but in the subsequent definitions there are recurring references to (systematic) knowledge. If we break down the term “data science,” it would simply mean systematic knowledge of data, much like that the discipline of social science represents systematic knowledge of human society and that of political science represents systematic knowledge of politics. Therefore, “data science” is systematic knowledge of data, and a “data scientist” is someone who studies data systematically. Like social science and political science, data science has areas of specialization.
The thing is, the specialized areas within the “data science” discipline imply specialized skill sets (i.e., methodologies and techniques) rather than in expertise in the area of application; however, the business world has yet to understand how to articulate this well. Even among the analytics experts, the term is associated in some circles with a very specific set of methodologies and approaches, while others have a much broader interpretation.
Additionally, it is important to recognize that people hired primarily for their skill sets have different functions from those hired primarily for business expertise. Specifically, the skill sets are tools to solve business problems and rarely the business goals themselves, and this often leads to misaligned expectations. The technology discipline perhaps understands this a little better, and as a result its functional maturity is probably about a decade ahead of “data science.”
The point is, we all must think about what we need when we hire a “data scientist.” So, short of abolishing the term, I submit the following:
- It is imperative to articulate specific business problems for which you need a data scientist. “We need data scientists in order to stay competitive” is too broad by itself to be effective, like saying “we need a social scientist in order to better understand society.” Defining the technical qualifications is very straightforward once the business needs are well defined.
- Hiring a skill set equates to hiring a consultant with that skill set. An organization does not have to be a consulting organization, and the “data scientist” does not even have to ever work with anyone outside of his/her own group for this to be true. A skill set with no consultative aptitude will never help you solve your business problems.
Some have proposed the term “data artist,” which, of course, induces another set of heated discussions. As a trained performance artist, I believe I am sufficiently qualified to declare what I do with data is not art, although there certainly are some creative elements to it. Now, if we can stop arguing about what to call ourselves and actually get some work done, so that we can get to sipping that fruity drink….
In a recent Dilbert strip (April 24, 2015), Dilbert declares: “I found the root cause of our problems… It’s people. They’re buggy.” The thing is, even with all the hype around analytics, this is so true that it almost isn’t funny.
There are two key disconnects in many analytics initiatives today: the business feeling that the data scientists do not understand the business needs, and the data scientists feeling like the business is not listening to what the data have to say. And the gap is not closing nearly as fast as we would like.
What we have is a failure to communicate.
Many data scientists have been hired, and many great algorithms have been developed, many of which have died a lonely death without ever being appreciated. In less dire situations, some analytics is being consumed, but it is far from optimal. What we all need to realize is that analytics is really just a methodology, and a data scientist is just a person with the appropriate skills to apply the said methodology. In reality, however, the business and the data scientists often have practical expectations from each other that are not very well articulated, resulting in friction that leads to distrust and/or indifference. With that said, I have a message for the business leaders and another for the data scientists.
To the business leaders: If great analytics happens in the forest, and no one is there to make decisions with it, does it make a business impact? The answer is no. Hiring data scientists does not make analytics happen. In fact, hiring data scientists is not one of the first things I would recommend to any organization starting out with analytics. Analytics is not just-add-water—there must be a culture and an ecosystem along with the right processes and functions in place to make it all work. Unless the organization was built data-driven from the ground up, building an analytics capability will always involve a degree of retrofitting. Until the people are ready to receive analytics, it will not be received.
To the data scientists: Building the best model is your job, but you must keep in mind that it is not the end objective–it is simply a means to the end. You have a specific skill set for which people are willing to pay, and your objective is to help people do better at whatever they are trying to do better. There is always a person at the end, and often in between; care for all people involved, and build a positive relationship with all of them. Build models for others–whether you like it or not, being an expert holder of a skill set means that you have a responsibility as a consultant in some form and thus a responsibility to manage your relationship with the end client, with very few exceptions. Without people ready to embrace your work, even your most beautiful algorithms will sit idle. Strive to connect with the people involved, and you will find that not only you will have an entirely different relationship with your client, but also you will approach the analysis differently.
Being successful in analytics, whether you are a business executive or a data scientist, is not at all about the capability to do analytics. It is about people working together and relating to each other. You have a much better chance of success by doing basic analytics with the right people, processes, functions, and culture, than by doing great analytics without them.
P.S. I’ve used the term “data scientist” here for mere convenience. What it should really be called is another discussion!
Mark Twain wrote: “Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: ‘There are three kinds of lies: lies, damned lies, and statistics.'”*
I am OK with this statement because I get the intent, and I hope many would agree that the entire scientific discipline of statistics is not a lie. Like any good statistician, I must insist on a significant evidence to the contrary before rejecting the null hypothesis that at least some of statistics is honest, and in the absence of such evidence, I am not quite ready to concede that statistics is all a big lie.
Webster defines lying as making “an untrue statement with intent to deceive.” Lack of competence does not make one a liar, so not knowing how to use statistics correctly is a different issue. The key to lying is the “intent to deceive,” and this can be in the form of unwillingness to face the reality. This past week I heard multiple references to anecdotes of someone’s desire to make the results look “not so bad”; it can also go the other way to make someone else look “not so good.” It is not that the numbers are easy to manipulate, but rather that it is easy to appear data-driven.
Back when I taught introductory statistics courses, the syllabus always included the topic of subjectivity and the impact it may have on how the results are conveyed. We looked at various mass-circulation articles, identifying the author and/or the sponsor of the piece, the potential biases and their potential impact on the conclusions. While the results may be perfectly valid in one sense, it is important to take an objective view in order to understand what is really going on. The same is true in business settings.
The assumptions are critical–especially the business assumptions, which may be called business contexts or caveats that may or may not be made explicit. Statistical assumptions are important for sure; however, in practice, the violations of contextual assumptions are far more impactful than the violations of the statistical assumptions–many methodologies are fairly robust against violations of statistical assumptions and can generally produce directionally correct results. One may choose only the results that support one’s cause and ignore others that are more important, or choose the methodology or display that allows one’s story to be told, or choose to analyze in such a way that the results would only justify one’s position. Selecting the data to fit one’s pre-formed story, rather than letting the data coalesce into a story, is the opposite of being data-driven–call it agenda-driven analytics.
Agenda-driven analytics will tell you only what one wants to hear, not necessarily what one needs to hear. And in this case, analytics will never have a chance to do what it can do–it will be an involuntary participant in the advancement of an agenda it doesn’t even support. In the meanwhile, others, including the customers, suffer from lack of better treatment; depending on the context, the consequences may be quite grave.
P.S. I should fully expect a flurry of hate mails from my esteemed statistical colleagues for saying that statistical assumptions are not very important!
*”Chapters from my Autobiography–XX,” North American Review no. 618, July 5, 1907.
It is said that some astounding proportion of BI and analytics efforts fail. Depending on the context, that number appears to range from 50% to 80%. Certainly, numerous debriefs have been conducted on what worked and what did not work; many have opined on the top reasons for failure. So, why does this continue to happen?
Take analytical pilots. (Here we refer to pilots whose main concern involves analyzing the data and not simply implementing a tool–the latter deserves a separate discussion.) Pilots are particularly important, because the resulting decisions shape the course of what to come. At the risk of stating the obvious, organizations conduct them to do something with little to no precedent, and pilots are a financially prudent way to see if it works before investing in a larger-scale capability; starting small does provide an opportunity to work out the kinks, while also allowing organizations to plan better.
However, the fundamental reasons for the start-small approach deserve more careful thought in analytics. Specifically, we should ask whether the organization is ready for the consequences of the analysis results, and recognize that the transformation expected from the positive conclusion does not happen naturally. A pilot intended to prove the value of analytics is especially tricky, as the very need to prove may indicate that someone in the organization has not yet bought into the idea of the consequences–applying analysis results to make changes, and changes are uncomfortable. There is merit to convincing the unconvinced, but the degree to which the entire organization becomes convinced is a huge factor in whether there is a realistic future for analytics beyond the pilot.
That is, the organization must have the collective desire to be data-driven and have the next steps already defined, ready to accept change. The main goal of the analysis should be simply to prove the sufficiency of the business impact, with everything else already in place or ready to be executed immediately upon the completion of the analysis. Unfortunately, many non-financial planning and decisions are put on hold until the results of the analysis are available; the situation is exasperated with the unconvinced or the marginally convinced. We have seen pilot analysis executed, only to be followed by lack of priority, a long time to define the next steps, and finally the demise.
But a data-driven culture does not deprioritize the conversion of data-driven efforts into business results. First, if for some reason a well-selected and well-executed analytical pilot end up with less-than-favorable results, it has others in the wing waiting to benefit from the learning. Second and more importantly, for a pilot to be effective, those who will consume its results must be willing, able, and empowered to do so immediately–accept and action on those results to change themselves. The real challenge with analytics is that, without the resulting operational or strategic change–i.e., non-analytical change–it has no business value. And pilots cannot succeed with no business value. A data-driven culture is all about establishing an ecosystem of consumption of analytics throughout the organization and less about acquiring tools and data scientists. Having experience and capabilities in analytics is not a prerequisite.
I am not ruling out the possibility that there exist organizations for whom analytics makes no sense whatsoever, but I have yet to come across one in my nearly two decades of looking at analytics and analytical practices. I have, however, seen plenty that were not ready to consume. I am willing to bet that some analytics is sufficiently positively impactful to well more than the 20-50% of the opportunities as suggested by the failure rate. I am also willing to bet that a substantial portion, if not the majority, of the failures never came close to implementing the non-analytical changes needed to understand the business value.
Are you going to be content with continuing the trend of failure, or are you going to challenge your business to transform?
It is not uncommon to hear business leaders say how predictive analytics is important and strategic. However, is predictive analytics really the Holy Grail of analytical maturity?
We can start by clarifying what predictive analytics is and where it resides in relation to the business objectives for leveraging analytics. We can slice the analytics space along the following three dimensions:
- Predictive vs. Explanatory: Is the primary objective to quantify the likelihood of a particular outcome, or explain a particular phenomenon or behavior?
- Exploratory vs. Confirmatory: Is the primary objective to discover something new that can help you form a hypothesis, or to confirm the hypothesis you have already formed?
- Strategic vs. Tactical: Is the goal to inform business strategy decisions or to inform and execute on a specific set of actions?
We can table the discussion on methodology—from the business perspective, the specific quantitative methodology, statistical or otherwise, is secondary. Predictive methodologies can be applied while the objectives of the analysis remain explanatory, and in practice this is rather common. We should also acknowledge that some combinations of the above do not really exist, at least theoretically, and the distinctions can get a little blurry sometimes.
The point is that predictive analytics is just one class of just one dimension that defines the business objectives for analytics. The concern is that a blind focus on the “predictive” could be boxing organizations into analytical activities that do not necessarily address the most impactful business needs.
Going back to the strategic importance of predictive analytics, I believe it is important to make the following distinction: that having access to the predictive analytics capability is certainly strategic to the business, but what predictive analytics accomplishes is almost always tactical. The results of predictive analytics (scores, alerts, etc.) are most commonly used to automate certain aspects of decision making, such as recommending the next movie to watch, rank-ordering or prioritizing customers to target, or making decisions on a large volume of credit card applications very rapidly.
Businesses must start with the business objectives, then leverage the right analytical approach for the business objectives in order to realize full potential of data-driven decision making. While predictive analytics capabilities indeed often indicate a level of analytical maturity, it is only one part of analytics maturity. Setting any specific type of analytics as the Holy Grail cheats the organization of the best impact analytics could have. And it would be a shame if business leaders became disillusioned with analytics because that specific type of analytics did not produce the aggregate business impact they were expecting.
Over the last year or so:
- A number of articles have already declared the death of Big Data.
- A question was posed on a site frequented by data science professionals whether businesses should embrace Big Data.
- A meme was immortalized: “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it” (Credit Dan Ariely of Duke University). Data science, whatever it means, continues to be one of the hottest professions.
While some have already written the eulogy for Big Data, others still struggle to grasp what it is. Many of the definitions are more conceptual than tangible, perhaps leading to equating Big Data with certain sets of technology items if only to put some boundaries around the Big Data concept. And perhaps it is this lack of tangibility that lends Big Data to be concurrently huggable, sexy, and dead.
The human nature likes things to be tangible. If we recall the introductory statistics course that everyone had to take, a sample size of 30 made it large, and a p-value of 0.05 made the results statistically significant; hopefully it has been pointed out that nothing magical happens at these thresholds. It is also important to remember that Big Data is not always unstructured, and structured data is not always small—the size and the form are two different things. That said, the size of data does eventually imply tangible technology impact, which is easier to talk about.
What about the business impact? Business problems don’t care how big the data is that solves them. Data of any size is no good until someone makes some sense out of it and uses it to effect a positive change in the business. “Doing” Big Data does not directly lead to solved business problems, yet so much of the focus is still on the size and the form and less on why “doing it” is essential in the first place. The successes come from the ability to leverage the right data to solve a business problem and effect change; starting with the size or form of the data and not with the business problem is the proverbial hammer looking for a nail to hit.
So perhaps the question is whether businesses should embrace a data-driven culture. Most people would probably answer yes. Now the difficult part—this means that it is a shift in business culture. Along with this culture comes the realization that size does not matter and that it’s what you do with it that counts.