How would you systematically go about planning a machine learning/data science project? As a Data Scientist/ML Engineer, you don’t want to make the mistake of diving straight into designing the solution/modeling without first understanding the problem and objectives at hand. Additionally, it’s very crucial to spend more time on the data itself. If you like frameworks and would prefer to build some discipline around structuring your data science/machine learning process, CRISP-DM (Cross-industry Standard Process for Data Mining) can help you. CRISM-DM is one of the well-known and widely used industry-standard processes that can help modularize your data science/machine learning project into iterative steps. CRISM-DM breaks down a machine learning project into six iterative phases:
- Business understanding: This involves understanding the problem statement and the target user(s). Think about what you’re solving and why it matters. Identify the gaps in the current state and understand how the problem is solved today. Then go about quantifying the business impact you expect to achieve once you solve the problem for your target user/s. It’s crucial to translate this business impact into outcome and output metrics. Make sure you define your success and failure metrics before diving into the data. It behooves you to understand your problem, gather domain expertise, and identify relevant factors.
- Data understanding: This involves gathering data, validating data, and performing exploratory data analysis. Gathering data entails sourcing the relevant data from identified sources, labeling the data if it’s not already labeled, and creating the relevant features. Validating data involves deciding how you/the business would like to handle missing/empty/erroneous data and outliers. This step also includes performing quality control over your data to ensure the values are what you expect and appropriately cleaning the data. Finally, EDA involves exploring the data, performing statistical analysis and visualizations, identifying any relationships and patterns, and dimensionality reduction.
- Data preparation: Data preparation entails feature engineering and selection, and subsequently, the steps that will get your data prepped for modeling. This includes scaling/standardizing your data set, splitting your data into training and test sets, resolving any class imbalances, and optionally encoding categorical features.
- Modeling: Modeling involves both model selection and tuning. The modeling process involves evaluating various algorithms via cross-validation, hyperparameter optimization/tuning, documenting and versioning of the model, and subsequently re-training the model. This also involves making necessary trade-offs (performance, interpretability, and computational cost) when choosing the optimal algorithm as part of your model selection.
- Evaluation: This is where you’d score your model on the test set and interpret your model output and evaluate its performance. It’s a good idea to write unit and integration tests to test your model and make it more robust. Subsequently, user tests would build more confidence before operationalizing your model.
- Deployment: After you have evaluated the performance of your model and tested your solution, it’s time to deploy your model, adhering to the software deployment process and security measures built in place. Your final model could be exposed as an API or can be integrated within a current product/service. Ensure that you have a plan in place to monitor the performance of your model and potentially re-train the model if there is a need to do so.
Again, remember this is an iterative process, so you might find yourself switching back and forth if project requirements change and/or you’re not satisfied with the outcomes. You might want to modify some steps depending on your project requirements and timelines. Overall, CRISP-DM is a domain-agnostic process that can help you build discipline and a framework around executing your data science/machine learning projects.
Source: indiaai.gov.in