If the area responsible for the purchase and operation of the ovens for a cookie company were also responsible for the quality of the cookie ingredients, would you expect to love the cookies?
Aspirations in data and analytics are everywhere today, with the growing realization that effective use of information is critical to achieving business success. Yet even with much investment and efforts in the best tools and resources, conflicts and silos are all still too commonplace in data and analytics.
It is also interesting that data is often organizationally framed as an adjacency to technology. Recent high-profile data breaches and new regulations have forced information risk management into one of the top priorities for many organizations. The management and governance of information are being addressed today primarily from the technology point of view, with a heavy focus on risk (security and privacy) and compliance. While the need for risk management and compliance is unquestionable, the currently prevailing approach does this outside of the context of the business need for information, or, independently of value of the information to the business; that is, it only addresses the need to manage the information supply chain and not information value chain.
While no analogy is perfect, we can find a pretty good one in something we may all appreciate—cookies:
- Data are the ingredients to the insights, as the flour, the butter, etc. are the ingredients to the cookies.
- Analytical development results in how to formulate the explanations that lead to insights from the data, just as recipe development results in how to formulate cookies from the ingredients.
- Technology stores, transports, and provides the tools to transform the data into insights, just as appliances and vessels store, transport, and provide the tools to mix and bake the ingredients to produce cookies.
- The business user consumes the insights, just as the consumer (buys and) eats the cookies.
The cookie value chain is complete when the cookie is consumed; without consumption of cookies, the value chain and its components have no reasons to exist other than for academic or scientific reasons. To ensure that the value chain generates the desired value, the responsibility of each component must be owned by the people who understand the impact of their expertise to the end consumer. As consumers, we understand that while the butter is stored in refrigerators, refrigerator technicians are not experts on butter, and we understand that cookies are baked in an oven does not make oven technicians experts on recipe development.
In contrast, the data responsibilities and even analytics responsibilities often fall under IT. On the surface this seems reasonable since technology often generates the data as well as provide key capabilities in tools and environment throughout the supply chain. However, there are at least two challenges with this approach: first, it assumes custody means expertise, and second, it is designed to manage only the information supply chain, stopping short of managing the value. And dual roles are rarely as effective in practice as intended in a value chain.
It does not have to be complicated. In fact, far too often the fundamentals become lost in the focus on implementing the details of information management. Also, a change in responsibilities and ownership does not necessarily mean a change in organizational structure. Of course, all this needs to be done in the way that best suits the organization and its business goals, and the Human Resources department should play a huge part.
Finally, none of this means that the information supply chain does not need to be managed; rather, information risk management and compliance and the general supply chain must be an integral part of the larger strategy and management of the information value chain. After all, what is the importance of the tightly managed appliances if no cookies are consumed?
There was a social media post, a bit of a brain teaser, about a murder mystery of sorts: a man was killed one afternoon, and there were several suspects, each with an alibi; it’s your basic whodunit. One of the suspects was a chef who was making breakfast. Many comments on the post insisted it had to be the chef since no one makes breakfast in the afternoon.
I have had breakfast for dinner many times. McDonald’s offers “all-day breakfast” at least in the U.S. Perhaps this is a regional thing, but it brings me to the point that we are inclined to assume the world operates as we see it from our own perspective. It’s human nature, but I wonder to what degree it contributes to increased friction and unwillingness to accept that non-preferred alternatives exist.
You might be wondering what this has to do with data and analytics. First, we underestimate the importance of clear operational definitions. What do you mean by “breakfast”? Does it refer to the time of the day the meal is eaten, or the type of the food typically eaten in the morning? Many analytical attempts are flawed from the initial stages because the definitions are subject to interpretation. Second, we live and (nearly) die by assumptions. Many assumed that “breakfast” referred to the meal eaten at a specific time of the day, but that is not true in all contexts. Successfully operationalizing any analytic requires not only identifying all relevant assumptions and the consequences of violating them, but also preparing for these consequences. In practice, assumptions are violated all the time; assumptions we don’t even know we are making can have devastating consequences.
The second point is connected, perhaps somewhat convolutedly, to the fact that the end consumer of analytics is a human one way or another. (It is also related to the topic of collaboration involving analytics professionals, but that’s a separate discussion.) We are drawn to view things in ways that are comfortable to us. I have no desire to get political here, but I am convinced that the greater polarization in society today is due to the hyper-clusterization of the viewpoints, opinions, and beliefs, rather than the feeling of connectedness at the personal level, and much of this is being perpetuated by social media and other aspects of “Big Data.” We want to believe that we know the Truth and everyone who disagrees is an idiot, thus caring about the opinions more than about the human to whom the opinions belong. (I recently read “The Fourth Industrial Revolution” by Klaus Schwab and found something similar expressed, although he’s obviously far more authoritative than I am!) This, in turn, makes it easier to capture in algorithms what makes each of us tick.
Big Data may be enabling a higher level of personalization, but personalization isn’t the same as human connection. Could data-driven personalization actually be contributing to the dehumanization of humans? Artificial intelligence and deep learning and such, as useful as they are, cannot be at the expense of being human, since there is always a human somewhere in the process. I’m not concerned about AI taking over the world as much as that it may be eliminating our need to be human.
For if it ever gets to the point where we have to re-learn how to be human, no amount of data-driven intelligence is going to save the humanity.
There is an inside joke between my husband and me–about the infomercials touting you can learn to play the piano in a flash. He (jokingly) threatens to achieve in mere four hours what took me many years of blood, sweat, and tears (I was a professionally trained classical pianist in my previous life), but for now, it remains an empty threat (thankfully).
I think it is fair to say that most reasonable people understand these programs do not turn a complete newbie into a professional pianist in only a few hours. I have always and strongly encouraged people to learn and enjoy playing the piano, as that also enables them to appreciate the work of other musicians more deeply and perhaps even collaborate under the right circumstances. However, offering the skill as a professional service for a fee is a different story, and it would be irresponsible for me to encourage that to someone who has only some cursory training.
The same is true of data science (or anything else for that matter). Learning should be encouraged so that one can appreciate it as well as understand its potential more intelligently–needed today simply to stay competitive. However, the line between intelligent appreciation and hard skills is becoming increasingly blurred, with the unsolicited “Learn Data Science in X Days/Weeks” advertisements showing up on my feeds daily. Along with the popularity of analytics democratization, is it becoming another factor that could threaten the integrity and eventual well-being of advanced analytics?
It is curious there seems to be anywhere from a tacit acceptance to enthusiastic encouragement for this in data science. To be fair, I do not believe that these short programs are put together under the presumption that they make a complete novice into a fully competent data scientist. That said, the expectations are usually not clearly articulated, and I am rather annoyed by what is essentially a marketing tactic that takes advantage of the hype. It plays very well into the rapid-results culture that often encourages shortcuts.
I recognize that not everyone in these programs is starting from scratch, and those with more adjacent background with just a few missing skills have a much better time transitioning into this much coveted discipline. There are other factors, obviously. We can also question what one means by a “data scientist” (let’s not get it started here), but it suffices to say that what a business needs from a “data scientist” runs a wide gamut, not all of which are about learning specific algorithms or programming language. However, I expect any “data scientist” to have the following hard competencies at the minimum:
- Solid understanding of probability concepts, on which any analysis design is heavily dependent regardless of the methodology (statistical or otherwise) ultimately employed.
- Solid ability to code, whatever the language. A data scientist must be comfortable getting around very messy raw data, big or small–it is the science of data, after all. The specific language is secondary, as long as its strengths and weaknesses are understood. What is more important is one’s ability to logic his or her way through a messy pile of data while programming efficiently and in a well-structured manner; one can always learn another language. (A recent comment hinted data scientists had to program in Python. Nonsense. I once coded something entirely in Base SAS just to prove a point and, of course, because I could.)
I purposely left out analysis techniques from the criteria. This is where short courses are perfectly suited–you can always learn techniques. But you need the above two first and foremost, and their development is not measured in weeks.
Can you learn data science in a flash? Like you can learn to play the piano in a flash.
“Everyone’s a data scientist–if they have the right tools”–a well-known business publication commented on social media, referencing an article on data democratization.
This is like saying everyone is a driver if he or she has the right car. While that may technically be true, you need to learn how to drive, then you need a driver’s license (well, at least legally in most places). It still says nothing about whether you know where you are going–you need a map or good directions. Some people are bad with directions; some are simply bad drivers. You could be the world’s best driver with the best car, but without the right map and directions, you have no chance of reaching your destination. What if, one day, all GPS maps cease to exist?*
Outside of the political context, Webster defines “democratic” as “relating to, appealing to, or available to the broad masses of the people.” The comment above by the publication implies democratization of analytics is equivalent to the democratization of data plus tools. However, this is true only if you define analytics to consist strictly of tools that can replace all understanding. The democratization process is different between data and analytics; data is a tangible asset, while analytics is a set of human activities on that asset, which may or may not include tools (technically speaking, it is possible to do 100% of analytics by human power only).
Then how should analytics made available to the broad masses? You can put a car in everyone’s hands, but not everyone should drive. On the other hand, the access to the benefits of vehicles can be near universal, as even those who cannot drive a car can generally ride in one. When it comes to analytics, however, far too many people equate democratization with universal permission to drive (i.e. execution of analytics), rather than universal access to its benefits. This has led to perhaps one of the most critical problems with analytics today: your ability to carry out the analysis tasks says nothing about whether the analysis is valid, and the recent trends of actively shifting the focus away from analysis validity and directly toward technology is troubling for the future of decision-making. Not too long ago, I came across a blog touting a “Big Data Easy Button”; again, without the right analysis design, it is simply an easy button to execute the wrong analyses, and even if the analysis is correct, its benefits often still remain out of the reach of those who need the insights to make decisions.
The recent re-emergence of the p-value controversy is another case in point. For those who recall p-values from that one statistics class, less than 0.05 and you had a statistically significant result. However, it says nothing about whether the analysis itself is valid in the first place. The fact that you can apply the mechanics of the statistical analysis to obtain the p-value does not validate your conclusion, much like the fact that you can operate a vehicle says nothing about whether you can actually reach the intended destination. Unfortunately, the p-value has achieved the statistical easy button status; the fixation on p-values routinely leads to false conclusions, driving the editor of a well-known periodical to call for a ban of p-values all total. The problem is not with the tool but with the users of the tool and the context in which the tool is used, banning the tool does not eradicate bad users of statistics or fix the context, and the true benefits of statistics still never reach those that really need them.
While there is so much attention on doing analytics, many are still so far out of the reach of its benefits. People are convinced that one must drive rather than ride for analytics to be universally accessible, and many continue to wait for the ride that will never come, while bad drivers clog up the streets, not knowing where they are going and causing massive pile-ups. Show me a place where everyone with a great car is a great driver who knows exactly where he or she is going, and I will show you an analytics easy button that makes everyone a data scientist.
*P.S. What did we do before GPS?
One of the prevailing challenges in making analytics successful is obtaining quick wins. What the businesses often need is an analytics equivalent of agile sprints approach, in which actionable results are delivered in short cycles; yet for many, this continues to escape reality. Too often an elaborate analytical project is planned or carried out, stemming from insistence (on rigor and perfection), naiveté (about all that are possible), or neglect (of the original intent).
Contrary to the common perception, an effective execution of “agile analytics” is almost all about design rather than about tools and/or technology; the speed to insight is driven almost entirely by how the analysis itself is designed. While tools and technology can sometimes help, they cannot make up for poor design–they will simply deliver an invalid solution a tad faster. And when the analysis is sufficiently simple, a wrong design delivers with less effort a solution to the wrong problem. This is like the worst case of “are we there yet?”–wrong directions and/or wrong destination; having a faster car does not correct for either, and all the passengers are hungry and cranky.
Many activities related to analytics have direct parallels in technology, although this often eludes even the highly skilled analytics professionals. The basic technology development lifecycle framework–Discover, Design, Develop, Deploy, or any of the other widely accepted variants thereof–applies to analytics as well. However for analytics, most of the attention is currently going to the “Develop”–that is, where the analytical techniques are applied–that the other three phases are practically being ignored. This has an impact not only on the feasibility of agile analytics but on the effectiveness of analytics overall.
Just like in technology, the “Develop” in analytics–effective application of analytical techniques to obtain insights or build an algorithm–must be based on good functional requirements that are designed to build the business needs as well as operational constraints into the solution. Then, the analyst’s job is to use the best techniques to fulfill those requirements through development. The challenge is, data scientists are often expected to do both the “Design” and the “Develop” based on their ability to apply analytical techniques–that is, based solely on their ability to do the “Develop.” This is equivalent of expecting technology developers to be great architects based on their ability to code. The technology field is now mature enough to realize this, but somehow the analytics community has yet to acknowledge the same. To further complicate the matter, many data scientists implicitly believe they can design based on their ability to develop, since the distinction is not very well understood even in the advanced analytics community.
What makes an analytical professional great at analytical design? That is a discussion for another time. However, it suffices to say that there are far more expert analytical developers than there are professionals who excel in analytical design. And as with anything else, it pays off to get it right up front–getting the solution design right is critical in setting the right expectations. Then, if the ecosystem of analytics consumption is in place, you will have two major pieces of foundation needed toward getting the real value out of analytics that are missing from so many of the analytical
Science—”the state of knowing: knowledge as distinguished from ignorance or misunderstanding.” (Merriam-Webster)
I once held the title of “Lead Data Scientist.” While the title made no difference to me—I see it as a label—I spent more time explaining what I did and my value was to the business world. If I were really smart, I would have built a predictive model to see that coming, would have retired and be on a beach somewhere, with a fruity drink. Alas, as usual I am reminded of my own shortcomings, as a data scientist or otherwise.
Then I recently came across a heated discussion on what a data scientist was. For as long as the world has coveted this resource, it is curious how this is still a hot debate—we all desperately need it, but we don’t really understand what it is.
Interestingly enough, the question is never with “data” but rather with “science.” The Webster definition above is only the first one, but in the subsequent definitions there are recurring references to (systematic) knowledge. If we break down the term “data science,” it would simply mean systematic knowledge of data, much like that the discipline of social science represents systematic knowledge of human society and that of political science represents systematic knowledge of politics. Therefore, “data science” is systematic knowledge of data, and a “data scientist” is someone who studies data systematically. Like social science and political science, data science has areas of specialization.
The thing is, the specialized areas within the “data science” discipline imply specialized skill sets (i.e., methodologies and techniques) rather than in expertise in the area of application; however, the business world has yet to understand how to articulate this well. Even among the analytics experts, the term is associated in some circles with a very specific set of methodologies and approaches, while others have a much broader interpretation.
Additionally, it is important to recognize that people hired primarily for their skill sets have different functions from those hired primarily for business expertise. Specifically, the skill sets are tools to solve business problems and rarely the business goals themselves, and this often leads to misaligned expectations. The technology discipline perhaps understands this a little better, and as a result its functional maturity is probably about a decade ahead of “data science.”
The point is, we all must think about what we need when we hire a “data scientist.” So, short of abolishing the term, I submit the following:
- It is imperative to articulate specific business problems for which you need a data scientist. “We need data scientists in order to stay competitive” is too broad by itself to be effective, like saying “we need a social scientist in order to better understand society.” Defining the technical qualifications is very straightforward once the business needs are well defined.
- Hiring a skill set equates to hiring a consultant with that skill set. An organization does not have to be a consulting organization, and the “data scientist” does not even have to ever work with anyone outside of his/her own group for this to be true. A skill set with no consultative aptitude will never help you solve your business problems.
Some have proposed the term “data artist,” which, of course, induces another set of heated discussions. As a trained performance artist, I believe I am sufficiently qualified to declare what I do with data is not art, although there certainly are some creative elements to it. Now, if we can stop arguing about what to call ourselves and actually get some work done, so that we can get to sipping that fruity drink….