Data quality is key to the productivity of any business. Bad data can have a significant negative impact on decision-making, slow down progress, and cost a fortune to fix.
Based on work pioneered by Thomas C. Redman, CoreValue understands the importance of implementing an advanced data quality program, that uses data quality metrics to identify areas for improvement, and can ensure uninterrupted resiliency and enhanced data quality — especially in large environments, where even small wins can add up to large savings. In order to provide an effective program for its clients, CoreValue embraces the following elements:
- Identify a problem
- Identify the impact of the problem
- Build a model for reprocessing data
- Reintegrate data
- Update all of reports
To illustrate the point, one of CoreValue’s clients was faced with the need to continuously download and evaluate huge quantities of data. The data was generated by customers’ activities across a wide spectrum of services, as well as on-premise equipment provided by the company. Due to the complexity and quantity of data processed daily, data processing often led to missed service level agreements (SLAs), which in turn resulted in a negative impact on their inability to understand and support their customers. It was critical for the client to establish effective data management and consistent reporting for all data sources.
By utilizing the five elements of an advanced data quality program, as noted above, CoreValue delivered a 360-degree view of their system’s data quality that identified data problems and their impact; allowed for the reprocessing and reintegration of all data; and an updating of all reports based on the newly cleaned data.
This allowed us to also address in-depth aspects of these individual items. For example, data modeling encompasses the following:
- Database capacity prediction. Predictive analysis uses statistical techniques and historical data to make predictions about future capacity. One of the most common methods employed in predictive modeling is linear regression. Unfortunately, application of regression is challenging because behavior changes. System administrators may change retention policies, or simply delete data, which can lead to poor predictions. Significantly more accurate models were obtained by finding the optimal subset of “clean” data for each database and applying linear regression to only that subset of the data.
- Automated anomaly detection. Because ‘big data” needs effective anomaly detection, we proposed enhancements that enable real-time anomaly identification. Using R and PostgreSQL, we built an alarm system to monitor jobs from Scheduler so that users could immediately react to issues. The alarm system utilized Storage, Model and Shell script to perform checks, and then send an alarm if any anomaly is detected. Basically, these alarms monitor upper and lower thresholds for the start time and module’s duration for every weekday.
- Uniform dashboards. By using Tableau for reporting, we created uniform dashboards for all systems so that correlations between different metrics became easier to understand. In the reports we monitor:
– Metrics crossing into unacceptable ranges
– Unexpected changes or trends
– Variance in metrics data
– Sliced data by host, cluster, geographical tags, etc.
As a result of the newly implemented data quality program, our client was able to save almost half a million dollars in processing time.
Obviously, processing time depends on the complexity of an environment, the volume of data, and overall amount of processing efforts of the data. In the best case scenario, a data quality program can save you a few hundred thousand dollars, while in the worst case up to millions.
By any means, data quality analysis is crucial to the success of any system of record used for official reporting.
Author: Olena Medvedyeva, Data Scientist at CoreValue