- Finding a relevant business problem to solve: Often neglected, this is the most important step of the process, since generating business value is the end goal of any data analyst. Having a clear objective and restricting the data space to be explored is paramount to avoiding wasting resources. Since it requires deep knowledge of the problem domain, this step may be executed by a domain expert other than the data analyst.
- Data extraction: The next step is to collect data for analysis. It could be as simple as loading a CSV file, but more often than not it involves gathering data from multiple sources and formats.
- Data cleansing: After gathering the data, the dataset needs to be prepared for processing. Likely the most time-consuming step, data cleansing can include handling missing fields, corrupt data, outliers, and duplicate entries.
- Data exploration: This is often what comes to mind when thinking of data analysis. Data exploration involves generating statistics, features, and visualizations from the data to better understand its underlying patterns. This then leads to insights that might generate business value.
- Data modeling and model validation (optional): Training a statistical or machine learning model is not always required, as a data analyst usually generates value through insights found in the data exploration step, but it may uncover additional information. Easily interpretable models, like linear or tree-based models, and clustering techniques often expose patterns that would be otherwise difficult to detect with data visualization alone.
- Storytelling: This last step encompasses every bit of information uncovered previously to finally present a solution to—or at least a path to continue exploring—the business problem proposed in the first step. It’s all about being able to clearly communicate findings to stakeholders and convincing them to take a course of action that will lead to creating business value.
These are the most common steps of data analysis. Although they have been presented as a list, more often than not they are not executed sequentially and some steps may require several iterations as new data sources are added and information is uncovered.