Data Science project management must be customized to work best with each organization, but we find that our projects are most successful when managed using an Agile + CRISP-DM process rather than a traditional Waterfall approach. Sprint planning in an Agile + CRISP-DM framework constantly encourages the team to consider emerging requirements and to pivot based on findings from the previous sprint.
The Data Science Delivery Process
Data science initiatives are project-oriented, so they have a defined start and end. The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a high-level, extensible process that is an effective framework for data science projects. Figure 1 shows its six main steps (the circles). Although the steps are shown in the general order in which they are executed, it is important to note that CRISP-DM, like the Agile software development process, is an iterative process framework. Each step can be revisited as many times as needed to refine problem understanding and results. This iterative cycle enables information to be shared and lessons to be learned between project activities. Rather than trying to perfect one stage before moving to another, the project team can create a “minimally viable project” (MVP) in a rapid-prototyping mode, learning lessons and taking notes for the next iteration, and thus be much more aware of “downstream” issues as the solution comes into focus.
Figure 1. The CRISP-DM Process Framework
The six CRISP-DM steps are:
- Business Understanding: For data science projects to be successful it is important to have a thorough understanding of the business problem. It is essential to meet with stakeholders and domain subject matter experts to explicitly define “success criteria” for the project. The success criteria are typically framed as decisions made more accurately, efficiently, timely, and transparently to drive a core objective of the organization. Stakeholders include both strategic and tactical parties impacted by the anticipated solution—including executives and end users. Stakeholders will propose and receive feedback on realistic goals — the magnitude of the benefits they may expect from the project. The analytics team facilitates the conversation on what a “good” model must look like and the evaluation metrics by which to assess the success of any solution.
- Data Understanding: The quality and granularity of the data must be assessed to determine if they will support the objectives defined in the business understanding phase. This will likely involve data acquisition, integration, description, and quality assessment. After such assessment, it is often the case that stakeholder expectations need adjustment. Often key inputs that are known to impact the desired outcome are not accessible. Or, key inputs may have high missing rates. Some inputs may be concatenated from diverse sources and represent different things. The list of problems may be quite long, but this should not discourage the project team. Competent data science teams have learned many methods to work successfully with noisy and incomplete data. This step can also include a review of publicly available data to assess whether external data sources might enhance results. Issues during Data Understanding typically cause us to revisit the Business Understanding step one or more times.
- Data Preparation: Next, one needs to access, transform, and condition available data into a format suitable for modeling and scoring, called an Analytic Base Table (ABT). It must match the granularity of the decision that the deployed model will serve, implying aggregation of raw data to that level, or rationally allocating values down to the required level. The inputs must be available at the time an estimate is made, and every target outcome label should be carefully curated. Data Preparation may include processes such as: data cleansing, missing data imputation, feature transformation, case weighting and/or outcome balancing, data abstraction, feature engineering, and evaluation of feature importance. It is frequently in this step where the “art” of data analytics becomes most valuable. Note that this step often takes up a significant portion of the time and resources needed for a data analytics project and is iterated upon repeatedly.
- Modeling: Models are specified in a wide variety of ways. A model is a representation of an object, system, or business process containing an optimal mix of core features relevant to the desired use cases, for example classification or prediction. Multiple model types might be developed, as guided by the requirements set forth in the Business Understanding step. Model approaches as well as the “optimal” mix of core features must balance benefits across multiple criteria such as simplicity, interpretability, and speed vs. accuracy.
- Evaluation: Multiple competing models must be evaluated to determine which model (or model ensemble) best addresses the business objectives. The success criteria identified in the Business Understanding step are used to create a metric that scores the performance of each model in light of intended usage— included criteria such as the costs of errors (i.e. false positives and false negatives). This evaluation will not only determine which model(s) are best, but which thresholds (or sensitivity levels) are most appropriate. Once evaluation results are available it is important to communicate the results with stakeholders. This will undoubtedly lead to revisiting Business Understanding and other previous steps, which will refine expectations (analysts and end-users) while communicating assumptions and limitations of the chosen approach. An output of the Evaluation step includes a business case for future studies of the problem that builds upon the merits of the present project. Before the model is implemented, a final evaluation of the model is done on fresh data that was not used previously in the Modeling or Evaluation steps.
- Deployment: Last, focus on how to make results actionable and easy to understand by the end users of the analytic product. This step highlights the success criteria established in the Business Understanding step. Actions include hardening the data infrastructure to bring inputs reliably to the model, designing the best way make model results accessible (spreadsheet, visualization, interactive dashboard), educating end users on how to interpret the insights, and reviewing the assumptions and limitations of the data and modeling techniques with end users and key stakeholders. Deployment usually requires IT and Information Security to authorize new software and to upgrade IT infrastructure to support the optimal end-user delivery mechanisms. It is important for the analytics team to interface with these teams frequently throughout the life cycle, but especially during deployment, to ensure the deployment can proceed effectively.
An Alternative to CRISP-DM
We encourage you to consider the formal adoption of the CRISP-DM process but want to point out an alternative methodology created by Microsoft called the Team Data Science Process (TDSP). The TDSP process is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. Its two advantages are that it is more modern, with updated technology stacks and considerations, and more in-depth documentation is provided by Microsoft. Its disadvantages are that it is verbose and can sometimes make the data science process unnecessarily complex. Elder Research has experience and success with both frameworks, but we more frequently use CRISP-DM and encourage organizations searching for a straightforward and effective data analytics process to employ the power of an Agile + CRISP-DM framework.
Advantages of an Agile Analytics Process
Now that we have reviewed the CRISP-DM framework it is important to understand why the Agile methodology is preferred over the Waterfall standard for analytics project management. The Waterfall approach breaks down project activities into linear sequential phases, meaning the start of each phase depends on the finalization of the previous one. For software development and data analytics this linear dependency tends to become inflexible and less iterative as progress flows downwards in one direction (hence the name). The approach does not work well with data analytics because:
- It is rigid and does not embrace changes, whereas analytics requires the flexibility to pivot, always with a view of adding value.
- It demands detailed and complete requirement specifications up front, whereas Data scientists do not know in advance what question(s) the data can answer.
Agile project management is instead an iterative and incremental approach. It develops and delivers requirements throughout the project life cycle, focusing on refining instead of defining. Agile projects are based on trust, flexibility, empowerment and collaboration by being responsive to findings throughout the project. Agile:
- Prioritizes regular collaboration with stakeholders to integrate their feedback into the design
- Embraces changing requirements, even late in development cycle
- Values the introspection of intermediate results from inductive processes like data mining
- Improves customer satisfaction through early and continuous delivery of valuable and actionable insight
Figure 2 illustrates the core differences between the two project management methods. Elder Research data scientists and project managers embrace Agile and use it for all our engagements.
Figure 2. Waterfall vs Agile Project Flow
Agile trusts the self-organizing team to respond to “realities on the ground” in a quick and efficient manner. It fosters collaboration and builds trust between the data science team and project stakeholders ensuring they are an integral part of the process and are in much more regular communication than is common under other frameworks. There are many articles and documentation about Agile, but we recommend Excella as a great resource for Agile training and Scrum Master certification.
CRISP-DM can be closely aligned and integrated with Agile, where each phase of CRISP-DM (Goal Definition, Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment, Knowledge Application, and Measurement) are columns or stages within an agile board. An analytics project may be planned as several Agile sprints, each utilizing all CRISP-DM phases but focused primarily on one or two of them. For example, the first sprint likely focuses on discovery: Core business needs, data availability and consolidation, current practices, and current performance. The next may be to make a baseline model using just three inputs, for example. Then one or two sprints can focus on finding the best model specification. The final sprint may hone and harden the deployment structure. Again, each sprint touches on every phase of CRISP-DM but the central focus changes with each one. In this agile structure, the project remains time-boxed for each sprint and for the final project deliverables. Every sprint has scheduled review and collaboration sessions with relevant stakeholders. Each build on the products and learning of the previous, leading to a final result that is understood, accepted and owned by the project stakeholders.