Introduction
Data profiling is where to start when data quality is a priority. This step ensures that the data you have access to is legitimate and has acceptable quality. Data profiling focuses on examining and analyzing data, followed by creating a useful summary of that data. Effective data profiling falls into three categories:
- The structural discovery that validates data’s consistency and correct formatting
- The content discovery that looks focuses on individual records to check for error
- Relationship discovery to understand the relationship between parts of the data
Data discovery is meant to provide insight and trends of the data that is in the inventory. Before you get to profile your data, you need to take into consideration 10 data profiling steps to make your data discovery endeavor successful. Our platform at DQLabs does AI-driven data profiling and accepts data from multiple sources in different formats. The data profiling steps are;
Step 1
Identify the data domains. Gather the domains of data you want to profile and verify that they are all credible. It is important to clearly understand the domains because it gives a picture of how data flows within the organization. This ensures that the focus data is not overwhelming to the data analyst and that too much time isn’t wasted looking at data that will end up not adding value to the analysis stage.
This process involves using the data semantics to discover its functional meaning. To achieve this, an analyst requires a domain profile containing the data’s main characteristics. For instance, if the data belongs to an enterprise, the first step would be to identify which characteristic regarding the products is in the data. The next step in data profiling is checking the specific field/characteristics to ensure they are standard; this can be achieved by rules parsing the data to understand whether it’s trustworthy. When the data is in a spreadsheet of rows and columns, you create the profile by analyzing the individual columns. This can be done by executing the data discovery process by applying data and column name rules. The data name will filter the columns that meet the threshold defined by the rule. Column name rules will filter the column names meeting the defined rule’s logic.
Step 2
Get authorization and protect any sensitive data. Request authorization on all required domains and state exactly what data will be needed from each domain. This will ensure that sensitive data not useful in data discovery remains safe as the process continues. It is always important to understand that not all available data in each domain will be used, and the organization might be reluctant to give access to some sensitive data. In some cases, the organization can access its data but be prohibited from sharing it because of an agreement with a client. For instance, organizations working with military or intelligence services might be limited from sharing specific information on previous and upcoming transactions.
After parsing the data with rules, the sensitive data is highlighted and prepared to be masked. Data discovery also involves taking action on sensitive data to increase the overall health of the organization’s data. Data masking involves obscuring the original sensitive data by adding other content to make it unidentifiable. This ensures that going forward, the sensitive data remains hidden, thereby enhancing the data’s privacy.
Step 3
Uncover potential internal sources. Understand the organization’s data is the generation in terms of where it’s generated. how it’s generated? and how it is shared. If they have online platforms, understand which data they generate and whether it mixes with data generated from their offices. This will help logically organize the data to make the profiling process faster and more effective. This is crucial among the data profiling steps as it allows the analysts to decide how to structure their profiling process.
The discovered data should be categorized based on possible usage. For instance, the data can be categorized into quantitative and qualitative data. Qualitative data will require context to be added for successful profiling. Examples of qualitative data include; employee satisfaction from feedback, and customer complaints, among others. Quantitative data, however, are numeric and require no further action to be taken for successful profiling. Many analysts mistake ignoring qualitative data and instead focus on quantitative data with numbers that are easy to analyze, such as revenue, number of customers, and other easy-to-understand numeric data. This can lead to incomplete reports because qualitative provides context on major changes in the qualitative data. For instance, a major drop in qualitative data, such as sales, can be explained by a qualitative analysis of customers’ ease in using a new online platform.
Step 4
Uncover potential external sources. Understand which external data sources will be useful enough to provide potentially enriching data. This step of data profiling includes vetting the reliability of the external sources and analyzing their relationship to the organization. External data sources allow the analyst to understand the organization’s operations better so as not to make data profiling decisions in isolation from the industry’s standards. By using external sources, an analyst gains an edge in understanding the internal data, especially the outliers. Therefore, understanding these sources makes the profiling process faster as they already know where to refer.
External data will provide a good source of the comparator for the conclusions reached from the internal data. However, there is a quality risk associated with external sources because the organization may not have control of some external data sources. For instance, the industry’s performance data extracted from external sources require the extra step of the analyst vetting the source. The analyst should clearly know the external data they will need. External data sources, such as the number of vendors and active customers, should be updated regularly to match internal data sources. While uncovering potential external sources, the analyst must also ensure that they narrow their focus to what directly impacts the organization and the analysis they aim to undertake.
Step 5
Prioritize candidates of source data. After uncovering all the internal and external sources and getting authorization to the data sources, the next step is setting priorities on source data. Setting priorities will make the profiling process flow seamlessly and provide more insight during the data discovery process. Failure to set priorities can lead to more time consumed by data sets that eventually end up making little to no impact on the analysis results. Like every other activity within an organization, data profiling has to be optimized to minimize the time from the start of data analysis to the publishing of the final analysis.
The analyst can map the way forward by creating a list of source data with the priorities set. The priority setting determines the time and resources allocated toward gathering the data. For instance, the high-priority data would require thorough profiling to ensure that it meets the quality and content threshold that matches its position in the priorities list. This also allows the analyst to optimize the source data discovery process in terms of cost and time. Like any other business activity, the resources spent on data discovery must match the value derived from the process to make economic sense.
Source: datasciencecentral.com