In our experience the mistake of “waiting for perfect data” probably kills more projects than any other. Here’s a typical scenario:
The project starts out well. The management team defines the goals, calculates the potential return on investment, develops a project plan, gets a budget approved, assembles the team, and launches the project. The trouble starts with a desire to make sure that the data is in “good” condition.
Most organizations today have vast amounts of data, often stored in relational databases, but can also be found in survey results, physician’s notes, csv files, and software usage logs, to name a few. The data is often in different formats because it is derived from a variety of sources and some of the data will be missing, corrupted, or poorly organized. Available data is almost always messy; collected and stored for a purpose other than analytics, it must be parsed, cleaned, and transformed into a format suitable for analytical modeling and visualization. Further, your data will never contain every relevant predictor for the business problem you are trying to solve, especially since things that drive human behavior are so varied and complex. Nevertheless, insights are almost always available with the data you have, even if it isn’t the data you wish you had.
Experienced data scientists expect to work with messy data, and they have tools and techniques to get around the most challenging data problems. Yet, many organizations are reluctant to start on a project until they are confident that their data is complete and well organized. As most people who work with data know, that almost never happens. Too often, the delays caused by waiting for ideal data prevent the project from getting off the ground.
This is what happened with one federal agency that called us in for a data science engagement. In every weekly meeting management said they were “getting the data together” for us. After nine months of waiting for the perfect data, a contract modification had to be issued to extend the period of performance. Several months later, another contract extension was needed, and then another. Finally, after almost 2 ½ years, the project was completed.
Because of the organization’s reluctance to release data it considered inadequate, the project took more time and cost more money than necessary. Had we been allowed to work with early versions of the data we could have completed the project in just a few months.
On a more positive note, another of our clients acknowledged from the beginning that there were missing values in their data and that some of the records were inconsistent. Nevertheless, this large, multinational corporation provided us with a sample data set. Two of our data scientists analyzed it, and within forty-five minutes we had identified segments of the data that were good enough to begin the project. The client’s decision to proceed, despite the data issues, ultimately saved this organization significant expense and led to faster results.
On the other extreme from organizations that think their data must be perfect are organizations that think their data are already perfect. The costs of this latter error are not usually as harmful to a project as the former, but they still can be high.
Most people who are only moderately familiar with data science assume that the major emphasis of a predictive analytics project is on model building. In reality, our firm typically spends 65 percent to 80 percent of our time on understanding, cleansing, preparing, and validating the data—known as extract, load, and transform (ETL)—for the modeling process. When the data preparation is thorough and well done, the modeling process goes more smoothly and produces better results. Sloppy data preparation, on the other hand, leads to poor modeling results.
No organization has perfect data. Even when a data set is relatively clean, the modelers must spend time understanding it and making sure that it is properly prepared for the modeling process. When an organization thinks its data are perfect, it will tend to have unrealistic expectations about the time and costs required to complete an engagement.
This is what happened with one of our clients, a health care services company. When our firm presented our project plan, company leaders pushed back hard on the schedule because they felt we were planning to spend too much time on data preparation. They simply could not understand why the modeling process would be delayed by data preparation, since they were providing such clean data. After some difficult conversations, Elder Research commenced the engagement, being careful not to rush the data-understanding and the data-preparation processes. In the course of the assignment, we found some significant problems in the data and reported them to management. For example, on some customer documents there were multiple ship dates, sometimes months apart, and on others the ship date preceded the order date. The client reluctantly acknowledged that the data needed some cleaning before the modeling process could begin, and more reasonable expectations were established.
Connecting data understanding with a clearly defined business problem can often be a challenge, but it is well worth the effort. Documenting assumptions and risks along the way will shed new light on the data available and inform data layer expansions and enhancements, and perhaps more importantly, provide new insights surrounding existing business processes. Refining the connection between the business question to be answered and the data available to answer it provides clarity on the analytics strategy and roadmap that an organization can leverage to get to a state of pervasive analytics.