Data Science combines mathematics, statistics, programming applied to collected data and activities to clean, prepare – stage the data. In few words, it is the scientific approach to knowledge extraction from data. Hopefully, the knowledge extracted is aligned with business needs and of real value!
Done correctly, Data Science provides actionable, valuable intelligence from massive volumes of data and delivers predictive and prescriptive analytics to make organizations make better decisions.
Corevalue has been working with DataScience in a number of scenarios to help our clients gain actionable insights. In one instance, our data team was asked to analysis a complicated data pipeline for a large organization. They were challenged meeting data delivery SLA on a daily basis caused by various nightly data cleansing and processing jobs running long, failing all together or behaving in unexpected ways. Over time, as scale and complexity increased with computing moving to the cloud, and leveraging microservice architectures, their current monitoring techniques and tools needed to be extended.
Challenges: data comes from different data sources (Netezza, Oracle, RedShift, and PostgreSQL), reporting should be similar for all data sources. With lots of metrics, applications, and high performance systems, keeping track of performance became a difficult task.
Outcomes from the Data Science Engagement:
- Databases Capacity Prediction. Predictive analysis that uses statistic techniques and historical data was applied to make predictions about the future capacity.
One of the most common methods employed in predictive modeling is linear regression. Unfortunately, application of regression to storage capacity time series data is challenging because behavior changes. System administrators may change retention policies, or simply delete data. Therefore blind application of regression to the entire data set often leads to poor predictions. Significantly more accurate models was obtained by finding the optimal subset of “clean” data for each database and applying linear regression to only that subset of the data. Capacity prediction is performed for databases in Netezza and RedShift clusters. R programing language was used for modeling and Tableau was used to represent results on weekly basis.
- Automated anomaly detection. Big data needs effective anomaly detection, i.e. deviations from what can be expected based on past history, so engineers may focus on real issues. Thus enhancements that enable real-time identification of anomalies were proposed. Alarm System allows users to immediately react to issues. The Alarm System was built for a set of jobs from Appworx Scheduler. There are three main components of the system: Storage, Model and Shell script that is used to perform check and to send alarm if any anomaly is detected. PostgreSQL is used as central repository for historical data and model as well as for alarms history. R programming language (packages: dplyr, tidyr, lubridate, foreach, jsonlite, PivotalR) is used for modeling. The idea for the model is to build upper and lower thresholds for module’s start time and for module’s duration for every weekday. This idea works as there is no visible trend in the data. The model is designed to find start time and duration thresholds for a single job, for the sequence of jobs, or jobs that are executed in parallel. For every weekday a number of historical observations is selected and statistics is calculated. We perform lookup in the given sequence to peak such parameters that reduces the amount of errors, that is maximizes true positives and minimizes false alerts. Shell script is scheduled to run every 5 minutes starting from 12 am to 9 am.
- Reports. We created meaningful and uniform dashboards for all systems that allow understanding correlations between different metrics. In the reports we monitor:
- Metrics crossing into unacceptable ranges
- Unexpected changes or trends
- Variance in metrics data
- Sliced data by host, cluster, geographical tags, etc.
Reports are built in Tableau and spread between mailing distribution on the daily basis.
Thus we proposed full cycle of System Performance Data Analysis: data collection, data preprocessing, data modeling/thresholds identification, anomaly notification/alarms, and data visualization/reporting solution.