p1781 schelter

Transcript

1 Automating Large-Scale Data Quality Verification Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann Amazon Research } @amazon.com sseb,langed,phschmid,celikelm,biessmann { ∗ Andreas Grafberger University of Augsburg [email protected] across teams and organizational boundaries, especially in ABSTRACT large organizations with complex systems that result in com- Modern companies and institutions rely on data to guide plex data dependencies. Furthermore, there is a trend across every single business process and decision. Missing or incor- different industries towards more automation of business rect information seriously compromises any decision process processes with machine learning (ML) techniques. These downstream. Therefore, a crucial, but tedious task for ev- techniques are often highly sensitive on input data, as the eryone involved in data processing is to verify the quality of deployed models rely on strong assumptions about the shape their data. We present a system for automating the verifica- of their inputs [42], and subtle errors introduced by changes tion of data quality at scale, which meets the requirements in data can be very hard to detect [34]. At the same time, of production use cases. Our system provides a declarative there is ample evidence that the volume of data available for API, which combines common quality constraints with user- training is often a decisive factor for a model’s performance defined validation code, and thereby enables ‘unit tests’ for [17, 44]. In modern information infrastructures, data lives in data. We efficiently execute the resulting constraint vali- many different places (e.g., in relational databases, in ‘data dation workload by translating it to aggregation queries on lakes’ on distributed file systems, behind REST APIs, or is Apache Spark. Our platform supports the incremental va- constantly being scraped from web resources), and comes lidation of data quality on growing datasets, and leverages in many different formats. Many such data sources do not machine learning, e.g., for enhancing constraint suggestions, support integrity contraints and data quality checks, and of- for estimating the ‘predictability’ of a column, and for de- ten there is not even an accompanying schema available, as tecting anomalies in historic data quality time series. We the data is consumed in a ‘schema-on-read’ manner, where a discuss our design decisions, describe the resulting system particular application takes care of the interpretation. Addi- architecture, and present an experimental evaluation on var- tionally, there is a growing demand for applications consum- ious datasets. ing semi-structured data such as text, videos and images. PVLDB Reference Format: Due to these circumstances, every team and system in- Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Ce- volved in data processing has to take care of data validation likel, Felix Biessmann and Andreas Grafberger. Automating Large- in some way, which often results in tedious and repetitive , 11 (12): 1781-1794, PVLDB Scale Data Quality Verification. work. As a concrete example, imagine an on-demand video 2018. platform, where the machines that stream videos to users DOI: https://doi.org/10.14778/3229863.3229867 write log files about the platform usage. These log files must regularly be ingested into a central data store to make the 1. INTRODUCTION data available for further analysis, e.g., as training data for Data is at the center of modern enterprises and institu- recommender systems. Analogous to all data produced in tions. Online retailers, for example, rely on data to support real-world scenarios, these log files might have data quality customers making buying decisions, to forecast demand [7], issues, e.g., due to bugs in program code or changes in the to schedule deliveries, and more generally, to guide every semantics of attributes. Such issues potentially result in fail- single business process and decision. Missing or incorrect in- ures of the ingestion process. Even if the ingestion process formation seriously compromises any decision process down- still works, the errors in the data might cause unexpected stream, ultimately damaging the overall effectiveness and ef- errors in downstream systems that consume the data. As ficiency of the organization. The quality of data has effects a consequence, the team managing the daily ingestion pro- ∗ cess must validate the data quality of the log files, e.g., by work done while at Amazon Research checking for missing values, duplicate records, or anomalies in size. We therefore postulate that there is a pressing need for This work is licensed under the Creative Commons Attribution- increased . We present a sys- automation of data validation NonCommercial-NoDerivatives 4.0 International License. To view a copy tem that we built for this task and that meets the demands of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For of production use cases. The system is built on the follow- any use beyond those covered by this license, obtain permission by emailing ; we ing principles: at the heart of our system is declarativity [email protected] want users to spend time on thinking ‘how’ their data should Vol. 11, No. 12 Proceedings of the VLDB Endowment, look like, and not have to worry too much about how to im- Copyright 2018 VLDB Endowment 2150-8097/18/8. DOI: https://doi.org/10.14778/3229863.3229867 1781

2 plement the actual quality checks. Therefore, our system because the attribute is not applicable [37]. We are only offers a declarative API that allows users to define checks interested in measuring completeness when an attribute is applicable. on their data by composing a huge variety of available con- Consistency is defined as the degree to which a set of se- straints. Additionally, data validation tools should provide mantic rules are violated. high flexibility to their users. The users should be able to define Intra-relation constraints leverage external data and custom code for validation (e.g., a range of admissible values, such as a specific data type, an interval for a numerical column, or a set of values for a call a REST service for some data and write a complex func- tion that compares the result to some statistic computed on categorical column. They can also involve rules over mul- the data). Our vision is that users should be able to write t-shirts tiple columns, e.g., ‘if category is , then the range Inter-relation constraints of ‘unit-tests’ for data (Section 3.1), analogous to established is { S, M, L } ’. size may in- volve columns from multiple tables. For example, a column testing practices in software engineering. Furthermore, data customerId may only include values from a given reference validation systems have to acknowledge the fact that data is being continuously produced table of all customers. , therefore they should allow for is the correctness of the data and can be mea- the integration of growing datasets. Our proposed system Accuracy explicitly supports the incremental computation of quality sured in two dimensions: syntactic and semantic. Syntactic metrics on growing datasets (Section 3.2), and allows its accuracy compares the representation of a value with a cor- users to run anomaly detection algorithms on the resulting responding definition domain, whereas semantic accuracy historical time series of quality metrics (Section 3.4). The compares a value with its real-world representation. For ex- color last principle to address is : data validation sys- scalability ample, for a product attribute , a value blue can be tems should seamlessly scale to large datasets. To address considered syntactically accurate in the given domain even if this requirement, we designed our system to translate the red , whereas a value XL would be the correct value would be considered neither semantically nor syntactically accurate. required data metrics computations to aggregation queries, which can be efficiently executed at scale with a distributed dataflow engine such as Apache Spark [50]. 3. APPROACH The contributions of this paper are the following: We introduce our general machinery for automated large- scale data quality verification. We first present our declara- We present a declarative API combining common qual- • tive API, which allows users to specify constraints on their ity constraints with user-defined validation code, which datasets, and detail how we translate these constraints into enables ‘unit tests’ for data (Section 3.1). computations of metrics on the data, which allow us to sub- We discuss how to efficiently execute the constraint vali- • sequently evaluate the constraints (Section 3.1). Next, we dation by translating checks to aggregation queries with discuss how to extend this approach to scenarios where we dedicated support for incremental computation on grow- need to evaluate such constraints for incrementally growing ing datasets (Sections 3.2 and 4). datasets (e.g., in the case of ingestion pipelines in a data • We give examples of how machine learning can be lever- warehouse) in Section 3.2. Lastly, we describe extensions aged in data quality verification, e.g., for enhancing con- of our approach such as the suggestion of constraints (Sec- straint suggestions, for estimating the predictability of tion 3.3) and anomaly detection on historical data quality a column, and for detecting anomalies in historic data time series of a dataset (Section 3.4). quality timeseries (Sections 3.3 and 3.4). • We present an experimental evaluation of our proposed 3.1 ‘Unit Tests’ for Data approaches (Section 5). The general idea behind our system is to enable users to easily define ‘unit tests’ for their datasets in a declara- 2. DATA QUALITY DIMENSIONS tive manner [10]. These ‘unit-tests’ consist of constraints on We first take a look at data quality literature to under- the data which can be combined with user-defined functions, stand common dimensions of quality. The quality of data e.g., custom code. Table 1 shows the constraints available to can refer to the extension of the data (i.e., data values), or to our users. We want them to focus on the definition of their of the data (i.e., the schema) [4]. We focus on the intension checks and their validation code, but not on the computation extensional data quality in the following. We treat schema of the metrics required for the constraints. Therefore, we de- problems with the concept of completeness; for example, if sign our system to translate the user-defined constraints into there is no attribute value specified in the open schema of an efficiently executable metrics computation. an entity, we consider it missing. There are multiple studies on measuring the extensional data quality [4, 30, 39]. In the Declarative definition of data quality constraints . In following, we briefly describe the most commonly referred our system, users define checks for their datasets, which ei- to dimensions. ther result in errors or warnings during execution, if the refers to the degree to which an entity in- Completeness validation fails. This approach provides a high flexibility for cludes data required to describe a real-world object. In users: they can write complex functions and leverage exist- tables in relational database systems, completeness can be ing libraries for their validation code; they can use external measured by the presence of null values, which is usually in- data or even can call external services. In order to show- terpreted as a missing value. In other contexts, for instance case our system, we introduce an exemplary use case on an in a product catalog, it is important to calculate complete- on-demand video platform. Assume that the machines that ness given the correct context, i.e., the schema of the prod- stream videos to users write log files about the platform us- uct category. For example, the absence of a value for the age, with details such as the type of device used, the length is not relevant for products in the cat- size attribute shoe of the session, the customer id or the location. notebooks egory . In this case, the attribute value is missing 1782

3 Table 1: Constraints available for composing user-defined data quality checks. constraint arguments semantic dimension completeness column isComplete check that there are no missing values in a column custom validation of the fraction of missing values in a column hasCompleteness column, udf dimension consistency isUnique column check that there are no duplicates in a column column, udf hasUniqueness custom validation of the unique value ratio in a column hasDistinctness column, udf custom validation of the unique row ratio in a column validation of the fraction of values that are in a valid range isInRange column, value range validation of the largest fraction of values that have the same type hasConsistentType column validation whether all values in a numeric column are non-negative isNonNegative column column pair validation whether values in the 1s column are always less than in the 2nd column isLessThan validation whether all rows match predicate satisfies predicate validation whether all rows matching 1st predicate also match 2nd predicate predicate pair satisfiesIf hasPredictability column, column(s), udf user-defined validation of the predictability of a column consistency ) statistics (can be used to verify dimension udf hasSize custom validation of the number of records custom validation of the maximum fraction of values of the same data type hasTypeConsistency column, udf column custom validation of the number of distinct non-null values in a column hasCountDistinct hasApproxCountDistinct column, udf custom validation of the approx. number of distinct non-null values hasMin column, udf custom validation of a column’s minimum value column, udf custom validation of a column’s maximum value hasMax hasMean column, udf custom validation of a column’s mean value column, udf custom validation of a column’s standard deviation hasStandardDeviation column, quantile, udf custom validation of a particular quantile of a column (approx.) hasApproxQuantile hasEntropy column, udf custom validation of a column’s entropy hasMutualInformation column pair, udf custom validation of a column pair’s mutual information column, udf custom validation of column histogram hasHistogramValues column pair, udf custom validation of a column pair’s correlation hasCorrelation time hasNoAnomalies metric, detector validation of anomalies in time series of metric values These log files must regularly be ingested into a central val numTitles = callRestService (...) 1 val maxExpectedPhoneRatio = computeRatio (...) 2 data store to make the data available for further analysis, 3 e.g., as training data for recommender systems. Analogous var checks = Array() 4 to all data produced in real-world scenarios, these log files 5 might have data quality issues, e.g., due to bugs in pro- 6 checks += Check(Level.Error) gram code, data loss, redeployments of services, or changes 7 .isComplete("customerId", "title", "impressionStart", "impressionEnd", 8 in semantics of data columns. Such issues might potentially 9 "deviceType", "priority") result in several negative consequences, e.g., the ingestion .isUnique("customerId", "countryResidence", 10 process might fail and need to be manually restarted af- "deviceType", "title") 11 ter communication with the data provider. Even if the 12 .hasCountDistinct("title", _ <= numTitles) ingestion process still works, the errors in the data might .hasHistogramValues("deviceType", 13 cause unexpected errors in downstream systems that con- 14 _.ratio("phone") <= maxExpectedPhoneRatio) 15 sume the data. In many cases these errors might be hard 16 checks += Check(Level.Error) to detect, e.g., they might cause regressions in the predic- .isNonNegative("count") 17 tion quality of a machine learning model, which makes as- .isLessThan("impressionStart", "impressionEnd") 18 sumptions about the shape of particular features computed 19 .isInRange("priority", ("hi", "lo")) from the input data [34]. Therefore, the video streaming 20 service could use our system to validate the data quality checks += Check(Level.Warning , on="delta") 21 22 .hasNoAnomalies(Size , OnlineNormal(stdDevs =3)) before starting the data ingestion process, by declaring a checks += Check(Level.Error , on="delta") 23 custom set of checks that should hold on the data. List- 24 .hasNoAnomalies(Size , OnlineNormal(stdDevs =4)) ing 1 depicts a toy example of how such a declarative qual- 25 ity check for video stream logs could look like and high- 26 checks += Check(Level.Warning) lights the combination of declarative constraint definitions .hasPredictability("countryResidence", 27 with custom code. External data is fetched in the begin- 28 ("zipCode", "cityResidence"), precision =0.99) 29 ning and used throughout the quality check: an external 30 Verification.run(data , checks) REST service is called to determine the overall number of movies in the system and the expected ratio of smartphone Listing 1: Example for declarative data quality con- watchers is computed (see lines 1 & 2). Then, a set of straint definitions using our API. completeness and consistency checks is defined, e.g., we re- quire the columns customerId , title , impressionStart , 1783

4 lies on a particular set of data quality metrics that our Table 2: Computable metrics to base constraints on. system computes from the data at hand. The system in- semantic metric spects the checks and their constraints, and collects the metrics required to evaluate the checks. Table 2 lists all completeness dimension data quality metrics supported by our system. We directly fraction of non-missing values Completeness in a column and completeness address the data quality dimensions con- listed in Section 2. Let sistency D denote the dataset D consistency dimension with c records, on which we operate, and let N denote number of records Size v Compliance ratio of columns matching predicate the cardinality of value the v in a particular column of Uniqueness unique value ratio in a column D . Furthermore, let V denote the set of unique dataset unique row ratio in a column Distinctness . We calcu- D values in a particular column of the dataset ValueRange value range verification for a column late as the fraction of non-missing values in Completeness DataType data type inference for a column d | D ∈ d |{ a column: /N . For measuring null = 6 ) col ( }| Predictability predictability of values in a column consistency , we provide metrics on the number of unique ) statistics (can be used to verify dimension consistency values, the data types, the data set size, the value ranges, Minimum minimal value in a column and a general predicate matching metric. The metric Size maximal value in a column Maximum for example refers to the number of records N , while the Mean mean value in a column standard deviation of the StandardDeviation metric denotes the ratio of records which sat- Compliance value distribution in a column ∈ d |{ isfy a particular predicate: . The metric /N }| ) d ( p | D number of distinct values in a column CountDistinct Uniqueness refers to the unique value ratio [19] in a partic- number of distinct values in a column ApproxCountDistinct c , while V Distinctness ular column: |{ v ∈ V | | / = 1 }| | v estimated by a hyperloglog sketch [21] in the column. /N | | corresponds to the unique row ratio V approximate quantile of the value ApproxQuantile In addition, we implement standard summary statistics for in a column [15] correlation between two columns Correlation numerical columns that can be used for defining additional Entropy entropy of the value distribution Mean Minimum semantic rules on datasets, such as , , Maximum , in a column , which we for StandardDeviation , Histogram , and Entropy ∑ histogram of an optionally Histogram c c v v . We also include stan- log − example compute as v N N binned column and dard statistics such as Correlation MutualInformation MutualInformation mutual information between for measuring the amount of association between two columns, two columns ∑ ∑ c c v v v v 1 2 1 2 where the latter is computed as: log . v v N c c 1 2 v v 2 1 As some metrics are rather expensive to compute and might involve re-partitioning or sorting the data, our system pro- priority impressionEnd and deviceType , to be complete vides approximations of metrics such as quantiles in the form (lines 7 to 9), and we dictate that the column combination of ApproxQuantile (computed via an efficient online algo- countryResidence customerId , is , deviceType , and title ApproxCountDistinct rithm [15]) or for estimating the num- unique in the data at hand (lines 10 and 11). We make sure ber of distinct values with a hyperloglog sketch [21]. column is that the number of distinct values in the title less than or equal to the overall number of movies in the . In Predictability Lastly, we offer an implementation of system (line 12) and we check that the ratio of ‘phone’ de- an attempt to automate the verification of the correctness vices meets our expectations by investigating a histogram of values, we train a machine learning model that predicts of the deviceType column in lines 13 and 14. Subsequently, t of a particular record from a value for a target column we issue another set of consistency checks that define the all ,...,l in the target column, ∈ V k observed values l 1 t k expected shape of the data (e.g., no negative values in the of input columns ,...,l l given the corresponding values i i n 1 count column, a happens-before relationship between the for the particular record, e.g., using the maximum ,...,i i n 1 viewing timestamps, and a set of valid values for the priority ( p ). An ex- ,...,l l | l a posteriori decision rule: argmax i i k n 1 k column, lines 16 to 19). ample would be to predict the value of a ‘color’ column in Next, we have two checks that rely on comparisons to pre- a product table from text in the ‘description’ and ‘name’ viously computed metrics on former versions of the dataset columns. We train this model on a sample of observed val- (available from a central ‘metrics database’): we advise the ues in the target column, and measure its prediction quality system to detect anomalies in the time series of sizes of on the held-out rest of the data. We return the quality score, records that have been added to the dataset over time and calculated using standard measures such as precision, recall issue a warning if the is size more than three standard de- or -score of the predictions, as value of the metric. F 1 viations away from the previous mean and throw an er- After having inspected the checks and collected the met- ror if it is more than four standard deviations away (see rics, the system triggers the efficient computation of the lines 21 to 24). Finally, we define a predictability check for metrics (see Section 4 for details on how we physically ex- column which dictates that our sys- countryResidence the ecute these computations), invokes the user-defined valida- tem should be able to predict values in this column with a tion code from the constraints, and evaluates the results. precision of 99% by inspecting the corresponding values in zipCode the cityResidence and columns. Output . After execution of the data quality verification, our system reports which constraints succeeded and which Translating constraints to metrics computations . In failed, including information on the predicate applied to the the following, we detail how our system executes the ac- metric into which the constraint was translated, and the tual data quality verification. The declarative definition value that made a constraint fail. of constraints (which are evaluated by the user code) re- 1784

5 ... Note that we restrict ourselves to append-only cases where Success("isComplete(title)", a dataset at time t is simply the union of all previous deltas: ⋃ t ( ) k ( ) t Completeness("title") == 1.0)) , . Instead of computing metrics for the ∆ D = D =1 k Success("isNonNegative(count)", growing dataset from all snapshots, incremental computa- Compliance("count >= 0") == 1.0)), , a function for updating the f S tion introduces a state Failure("isUnique(customerId , countryResidence , g state from a delta and the previous state, and a function deviceType , title)", Uniqueness("customerId", "countryResidence", S such that for computing the actual metric from the state t ( ( t ) ) "deviceType", "title") == 1.0, 0.9967) , = g m S ( ). Furthermore, we need an initial ‘empty’ Failure("isInRange(priority , (‘hi’, ‘lo ’))", (0) S . The benefit of incremental computation is that state Compliance("priority IN (‘hi’, ‘lo ’)") == 1.0, it allows us to compute the series of metrics for the dataset 0.833) , snapshots via a recursive computation that only consumes ... : the deltas t ( ) t ( ) t ( 1) − Listing 2: Exemplary output of data quality verifica- S (∆ D ,S ) = f tion showing metrics, applied predicates and results. Reformulation of quality metrics . In the following, we Listing 2 shows an excerpt of a potential output for our present a set of reformulations of the existing metrics to en- example. We see that our isComplete(title) constraint able incremental computation for them. For each metric, we has been translated to a predicate Completeness(title) show how to ‘split’ the computation of the metrics for the t ( +1) == 1.0 which held on the data. Analogously, our constraint D into the computation of sufficient statis- new dataset isNonNegative(count) leads to the predicate Compliance( and the dataset delta ∆ D tics over the previous dataset D and also matched all records. On "count >= 0") == 1.0 (we drop the indexes here to improve readability). Once the contrary, our unique constraint has failed, as only 99.67% such a reformulation is given, we can conduct the computa- +1) t ( of records have been identified as unique, and the predicate D tion for by loading the persisted sufficient statistics constraint which the system generated from the isInRange and updating these from values computed only on the D for only matched 83.3% of records. newly added records ∆ D . : Let Notation and ∆ N denote the number of records N 3.2 Incremental Computation of Metrics for and ∆ and ∆ V in the datasets D V denote all unique . Let D Growing Datasets . The D or ∆ D values in a particular column of the dataset t +1) ( In real-world deployments, data is seldomly static; instead is simply the V set of unique values in the new dataset we typically encounter systems that continuously produce ∆ V union V of the sets of unique values from the previous ∪ data (e.g., by interacting with users). Therefore it is of and ∆ c c dataset and the delta dataset. Furthermore, let v v utter importance for data validation systems like ours to denote the cardinality of value v in a particular column of support scenarios where we continously ingest new batches . D or ∆ dataset D of records for a particular dataset. In such cases, we need access to updated metrics for the whole dataset as well as is the most straightforward The number of records Size ( +1) t for the new records and we must be able to update such D metric to rewrite, as the size of the new dataset metrics incrementally without having to access the previous of the size N of the previous + ∆ is simply the sum N N data (see Figure 1 for details). In the following, we present of the delta dataset ∆ . For dataset D plus the size ∆ N D our incremental metrics computation machinery built for an incremental version of Compliance , we need to main- this task. tain two intermediate results, namely the absolute number ( that previously matched the }| ) d of records p | D |{ d ∈ . predicate as well as the size N of the previous dataset D (3) D Then we can compute the compliance for the new dataset t +1) ( D from these retained values and the number of records (2) (3) D ∆ D }| that matched the predicate in the delta |{ ) ∈ ∆ D | p ( d d (1) D (0) (2) (1) (2) (2) of the delta: N as well as the size ∆ S D ∆ D ∆ S S | D ∈ d |{ ∈ d |{ + }| ) d }| ) d ( p | D ∆ ( p (1) (1) (1) (3) (2) (1) D D ∆ ∆ D D ∆ D ∆ D ∆ ∆ N + ∆ N We can reformulate as compliance with an Completeness metric at t=1 metric at t=3 metric at t=2 metric at t=2 metric at t=3 metric at t=1 ‘is not null’ predicate. The incremental computation of incremental metrics batch metrics requires us to know the cardinalities Uniqueness c of the v computation computation v value as well as the set of distinct in the previous dataset D values V . We need to inspect the sum of the cardinalities Figure 1: Instead of repeatedly running the batch for each value c + ∆ c v in the previous dataset and the v v D , we support computation on growing input data delta: running an incremental computation that only needs V c }| c | V + ∆ |{ v ∈ = 1 ∪ ∆ v v t ) ( and a D to consume the latest dataset delta ∆ | V ∪ ∆ V | of the computation. S state Distinctness We also compute incremental along these lines by comparing the number of distinct values in the data ) t ( V ∆ ∪ : N + ∆ V | N to the size of the data | Computational model D denote the snapshot . Let ( t ) D of dataset at time t denote the delta data and let ∆ V ∆ | | V ∪ t ( +1) D (the additional records) required to form t at time . N + ∆ N 1785

6 Incremental computation of Entropy requires us to esti- Heuristics on summary statistics . Our constraint sug- v p occurring ) of a particular value ( mate the probability v gestion functionality is built on a heuristics-based approach c in the previous in the column from the value’s cardinality employing single-column profiling [1]. While more complex v in the delta and the sizes c N and data, its cardinality ∆ data profiling would certainly be helpful, we are required to v ∆ of the previous dataset and the delta: N be able to consume terabyte-sized tables with several billions of rows, and therefore have to restrict ourselves to simple ∑ c + ∆ c c c + ∆ v v v v statistics. As already explained, the user provides a single log − N N + ∆ N + ∆ N table dataset with no type information and no schema in- v formation except column names as input. Furthermore, the MutualInformation re- The incremental computation of user can optionally specify a set of columns to inspect (and quires us to maintain histograms about the cardinalities c v 1 a sample size to use during the suggestion phase) to speedup of the second column, as well as of the first column, c v 2 the process. Our system then executes single column profil- for all pairwise occurrences, and cooccurrence counts c v v 1 2 ing in three passes over the data. In the first pass, we com- for the c merge these with the corresponding counts ∆ v v 1 2 pute the data size, run data type detection on each column, delta dataset: and compute the completeness as well as the approximate ∑ ∑ number of distinct values via hyperloglog sketches [21, 18] + ∆ c c + ∆ c c v v v v v v v v 2 1 1 2 2 1 2 1 log for each column of interest. The profiling tasks in the second ) + ∆ c ) ( c c + ∆ N ( N + ∆ c v v v v 2 2 1 1 v v 2 1 pass operate on the columns which we identified to have nu- meric types. For every such column, we compute summary Predictability metric, we eval- In order to compute our statistics such as the minimum, maximum, mean, standard uate the prediction quality of a multinomial naive bayes deviation, and approximate quartiles [15]. In a third pass, model (trained on features extracted from the user-specified we compute the frequency distribution of values for columns input columns) for the target column. The parameters are with a cardinality below a user-specified threshold (in order typically estimated using a smoothed version of the maxi- ∑ N α + i ki to bound the required memory). Afterwards, our system mum likelihood estimate: argmax log f . Here, i k i α + N k recommends constraints for the dataset at hand, based on i k , N occurs in class denotes the number of times feature ki heuristics leveraging the profiling results. In the following, stands for the overall number of feature occurrences in N k we list a selection of the heuristic rules which we apply: k , a uniform prior used for the sake of simplicity, α class i is the smoothing term per feature and α the sum of the • If a column is complete in the sample at hand, we suggest smoothing terms. For updating a classification model from ( +1) t ( +1) t (not null) constraint. isComplete an D to and , we need to know N D a previous dataset ki +1) t ( • If a column is incomplete in the sample at hand, we N , which we can easily compute by adding the counts k suggest a hasCompleteness constraint. We model the N to the counts from the delta ∆ N and ∆ N ∆ D and k ki ki fact whether a value is present or not as a Bernoulli- ) t ( D for the previous version of the dataset: N k distributed random variable, estimate a confidence in- ∑ terval for the corresponding probability, and return the + ∆ N + α N i ki ki f argmax log i k start value of the interval as lower bound for the com- α N + + ∆ N k k i pleteness in the data. If the detected type of the column is different from ‘string’, • The data structures which we use for the ApproxQuantile we suggest a constraint for the de- hasConsistentType metrics naturally support incre- ApproxCountDistinct and tected type. mental computations and therefore do not require special • For key discovery, we investigate an approximation of care from our side. the ‘unique row ratio’ [12]: if the ratio of dataset size to 3.3 Constraint Suggestion the approximated number of distinct values in that col- umn is within the approximation error bound of the used The benefits of our system to users heavily depend on constraint. hyperloglog sketch, we suggest an isUnique the richness and specificity of the checks and constraints, If a column is numeric and its observed values fall into a • which the users define and for which our system will regu- certain range, we suggest a constraint with Compliance larly compute data quality metrics. As a consequence, it is a predicate that matches only values within that range very important for a system like ours to make the adoption (e.g., a range of only positive values if the observed min- process as simple as possible. Therefore we provide machin- imum is 0). ery to automatically suggest constraints and identify data If the number of distinct values in a column is below a • types for datasets (even if no schema is available). Such particular threshold, we interpret the column as cate- suggestion functionality can then be integrated into inges- constraint that checks isInRange gorical and suggest an tion pipelines and can also be used during exploratory data whether future values are contained in the set of already analysis. The starting point for our constraint suggestion observed values. machinery is a dataset where individual columns are known and have names, but no further schema information such as data types or constraints is available. A classical example Note that we see constraint suggestion as a ‘human-in-the- for such a dataset would be a CSV file living in a distributed loop’ process with a low computational budget and therefore filesystem. Our system assists the user in identifying data rely on the end user to select from and validate our sugges- types of columns and suggests potential constraints to users, tions which might not necessarily hold for future data (or which they can use as a foundation to design declarative even the sample at hand in the case of unique constraint checks for the dataset at hand. suggestions). 1786

7 Learning semantics of column and table names . We Checks + Constraints API Declarative notice that the actual names of columns often contain inher- Constraint ent semantics that allow humans to intuitively infer column Definition Metrics types and constraints. Examples for such names are ‘id’, which is commonly used for an artifical primary key column of type string or int, ‘is deleted’, which probably refers to Analyzers per unit’ which indicates a nu- boolean column, or ‘price Runtime meric column. We therefore train a machine learning model Incremental Anomaly Batch Computation Detectors Computation to predict constraints solemnly based on the name of the (Spark) (Spark) table and column, as well as its type. The training data for this model is extracted from the schemas of tables from open source projects. Our system integrates this model by leveraging its predictions to enhance (and potentially cor- Intermediate States Metrics History Storage (Dynamo DB) (S3) rect) the suggestions made by our heuristics. In the case where our heuristic rules suggest an isUnique constraint, we consult the classifier’s probabilistic prediction to decide Figure 2: System architecture: users declaratively whether to follow the suggestion or not. define checks and constraints to combine with their verification code. The system identifies the required 3.4 Anomaly Detection metrics and efficiently computes them in the run- Anomaly detection in our system operates on historic time time layer. The history of metrics and intermediate series of data quality metrics (e.g., the ratio of missing values states of incremental computations are maintained for different versions of a dataset). We pose no restriction in AWS storage services. on the anomaly detection algorithm to apply and our system ships with a handful of standard algorithms. Examples are an algorithm that simply checks for user-defined thresholds. com- OnlineNormal The algorithm from our example called putes a running mean and variance estimate and compares Metrics that must be computed and identifies the required the series values to a user-defined bound on the number of on the data in order to run the user-defined verification code standard deviations they are allowed to be different from of the constraints. For each metric, our library chooses a the mean. An additional method allows users to specify the so-called (which can be seen as a physical opera- Analyzer degree of differencing applied prior to running the anomaly tor) capable of computing the particular metric on the data. detection. This gives users the possibility to apply a simple The selected Analyzers are given to an AnalysisRunner technique in order to stationarize the to-be analysed time se- in our runtime layer which schedules the execution of the ries [22]. Data quality metrics such as the number of missing metrics computation. This runner applies a set of simple values in a continously produced dataset might be subject optimizations to the computation of multiple metrics. For to seasonality or trends (e.g., the loss only occurs at cer- all metrics that do not require re-partitioning the data, the tain times when the system is under heavy load). In these runner collects their required aggregation functions and ex- cases, asserting the correct behaviour may not be feasible SparkSQL query over the ecutes them in a single generated with user-supplied thresholds. To this end, we allow users data to benefit from scan-sharing. In our example from Sec- to plug in their own algorithms for anomaly detection and tion 3.1, such metrics would be the Size of the dataset, the time series prediction. of six columns, as well as the for Compliance Completeness the three satisfies constraints. All these metrics will be computed simultaneously in a single pass over the data. The 4. IMPLEMENTATION resulting metrics are finally stored in a document database We implement our data validation library on top of the (DynamoDB) for later retrieval (and usage by anomaly de- distributed dataflow engine [50], using AWS Apache Spark tection algorithms). The runtime for incremental compu- infrastructure for storage. Note that our library does not tations stores the states of incremental analyzers in a dis- depend on any functionality exclusive to Spark, and would tributed filesystem (S3). be easily extendable to leverage different runtimes, as long For the predictability estimation, we have to train a ma- as they support SQL queries, user-defined aggregation func- chine learning model on the user-specified input columns and 1 . We decided for tions and simple machine learning models evaluate how well it predicts values in the target column. Spark due to the fact that a Scala/JVM environment makes We developed a pluggable architecture, where we featurize it very easy for users to write custom verification code and the input columns by concatenating their string representa- interact with external libraries and systems. Figure 2 gives Tokenizer tions, and tokenize and hash them via Spark’s an overview of the applied architecture. Our system op- transformers. Afterwards, any classification and HashingTF DataFrames erates on , a relational abstraction for a parti- SparkML [27] can be used to learn a pre- algorithm from tioned (and often denormalized) table. The user-facing API diction model; in our experiments we used Sparks Naive Constraints consists of so-called Checks and , which allow Bayes [36] implementation as it offers a scalable lower bound our users to declaratively define on which statistics of the on prediction accuracy, does not require hyperparameter op- data their verification code should be run. When executing timization, and is simple to train incrementally. We apply the checks, our library inspects the contained constraints the trained classification model to predict values on a held- 1 out fraction of the data and report the prediction quality A system with support for materialized views would even allow us to simplify our incremental computation machinery. (e.g., measured using precision) as predictability value. 1787

8 5 A 4.1 Incremental Computation 3 2 A A 3 ... A ... 1 B - B 1 In the following, we detail how to make our system’s an- D 1 ... D ... SELECT alyzers ‘state-aware’ to enable them to conduct incremental 1 C - 1 C FULL OUTER JOIN ... A ... column, + COUNT(*) computations. A corresponding base class in Scala is shown D 1 D 1 - ... A ... FROM delta A 2 denotes the type of metric to com- M in Listing 3, where GROUP BY (t+1) column S (t+1) B 1 S pute and denotes the type of state required. Persistence ∆ D c ∆ c ∆ + c c + i i i i and retrieval of the state are handled outside of the im- C 1 col = ) − ( H log ∑ + N ∆ ∆ N N + N i . The method StateProvider plementation by a so-called (t) S produces an initial empty state, initialState pro- apply duces a state and the corresponding metric for an initial Figure 3: Example for an incremental update of the dataset, and update consumes the current state and a delta entropy of a column: the frequencies of values in the dataset, and produces the updated state, as well as the cor- t ( +1) ∆ delta records are computed via a grouping D responding metrics, both for the dataset as a whole and ) t ( via a S query and merged with the previous state for the delta, in the form of a tuple (S, M, M) . Further- full outer join. After adding up the counts, we have applyOrUpdateFromPersistedState exe- more, the method ( t +1) S , from which the entropy the updated state cutes the incremental computation and takes care of man- +1) ( t of the column in the updated dataset D can be aging the involved states using s. StateProvider computed. trait IncrementalAnalyzer[M, S] 1 extends Analyzer[M] { 2 the previous histogram for the current data with the his- 3 togram of the delta via an outer join on the grouping columns def initialState(initialData: DataFrame ): S 4 5 and computes the corresponding counts for the delta and the 6 def update( whole dataset. The analyzer then computes the metric from state: S, 7 the state by an aggregation over the histogram. We show- 8 delta: DataFrame ): (S, M, M) case an example of incrementally updating the entropy of 9 column in Figure 3. def updateFromPersistedState( 10 stateProvider: Option[StateProvider], 11 12 nextStateProvider: StateProvider , . During the computation of multiple met- Optimizations 13 delta: DataFrame ): (M, M) rics, we apply a set of manually enforced query optimiza- } 14 operation on count tions: (a) we cache the result of the 15 dataframes, as many metrics require the size of the delta trait StateProvider { 16 for aggregations: we scan sharing for example; (b) we apply 17 run all aggregations that rely on the same grouping (or no 18 def persistState[S]( 19 state: S, grouping) of the data in the same pass over the data. 20 analyzer: IncrementalAnalyzer[M, S]) 21 4.2 Efficiently Suggesting Constraints def loadState[S]( 22 The major design objective in our constraint suggestion analyzer: IncrementalAnalyzer[M, S]): S 23 component is to keep the computation of required summary } 24 statistics cheap so that it can be executed during an inges- tion pipeline for large datasets. It is therefore crucial to keep Listing 3: Interface for incremental analyzers. the number of passes over the data independent of the num- ber of columns in the dataframe at hand. We assume our State management . In order to execute the incremental computation, a user needs to configure state providers to al- input to be a dataframe with named columns with unknown types (initially of type ‘string’ during ingestion, e.g., when low for state management. Typically, the state for a partic- reading CSV files). For the computation of the summary ular snapshot of the dataset resides in a directory on S3 and statistics required for our heuristics from Section 3.3, we we provide a corresponding provider implementation. Given which will these, we call applyOrUpdateFromPersistedState only use aggregations that do not require a re-partitioning compute the metric and persist the state. To compute the of the table, and we only conduct two passes over the data, where the aggregations share scans. Furthermore, we have updated metric for the next snapshot of the dataset, we an estimate of the number of distinct values per column of StateProvider need two s, one which provides the state for interesting columns after the first pass, which allows us to the old snapshot, and another one which will receive the updated state computed from the old state and the delta. control the memory for sketches and histograms (e.g., by only computing the full value distribution for columns with Note that this internal API is typically hidden from the low cardinality) used in the second pass. users, which are advised to program our system using the As discussed in Section 3.3, we leverage a machine learn- declarative checks API from Section 3.1. In the following we discuss additional implementation details. When incremen- ing model for deciding upon unique constraint suggestion. tally computing metrics that require a re-partitioning of the This model’s inputs are the table name, as well as column name and type. As training data for this model, we ex- data (e.g., entropy and uniqueness that require us to group name, column name, type, tract a dataset of 2,453 (table the data by the respective column), we implement the incre- tuples from the database schemas of several open is unique) state is composed of a his- mental scheme as follows. The S source projects such as mediawiki , wordpress and oscom- togram over the data (the result of delta.select(columns) merce . On this schema data, we train a logistic regression update function ). The .groupBy(columns).count() merges 1788

9 40 no grouping 5.1 Batch Computation 35 brand In this first set of experiments, we evaluate how well our material 30 product id metrics computation scales to large datasets and showcase 25 the efficiency of our machine learning-based predictability 20 estimation. 15 runtime (seconds) 10 . In order to eval- Scalability of metrics computations 5 120 0 100 20 80 40 60 uate the scalability of our system’s batch metrics computa- # records in dataset (millions) tion, we compute a set of quality metrics on the resulting growing product dataset, which we read from S3. Figure 4 Figure 4: Linearly increasing runtime for different shows the results, where each point represents the runtime batch metrics computations on a growing product on a particular version of the growing dataset. The plot dataset with up to 120 million records. labeled ‘no grouping’ refers to the results for computing a set of six metrics (size of the data and completeness of five columns) which do not require us to re-partition the data. Therefore these metrics can be computed by aggregations in model using hashed character n-grams of the names and a a single pass over the data. The remaining lines refer to the one-hot-encoding of the type as features. We leverage the computation of metrics such as entropy and uniqueness on SGDClassifier from combined with the HashingVectorizer , which require id material , brand the columns product and [32], and tune the model’s hyperparameters (fea- scikit-learn us to repartition the data (e.g., grouping it by the column for ture vector dimensionality, regularization, size of n-grams) which we want to compute these metrics). Due to the inher- using 5-fold cross validation. We achieve an AUC score of ent re-partitioning, these metrics are typically more costly 0 . 859 for the ROC curve, using a logistic loss function with to compute and their cost is related to the cardinality of the L1 regularization and a regularization factor of 0.001 on n- respective column. Nevertheless, all four evaluated work- 8 grams of up to size 5 from the input data hashed to 10 loads exhibit a runtime linearly growing with the dataset dimensional feature vectors. We leverage the probabilistic size, which is expected as our system internally generates prediction of this model (giving us a hint on whether the simple aggregation queries with custom aggregation func- naming of the column indicates a unique constraint) as a [3]. SparkSQL tions to be run by score for our rule-based unique constraint suggestion and only issue the suggestion to the user if the model assigns a Predictability estimation with naive bayes . We show- probability greather than 50%. case our predictability estimation functionality on a set of 845,000 fashion products from 10 popular brands which we extracted from the larger product dataset. We set the task 5. EXPERIMENTAL EVALUATION brand of predicting the value of the column from other In the following, we run a set of scalability experiments , , , description name columns, such as size , bulletpoints for our batch metrics computation, apply our predictabil- manufacturer and combinations of those. We run the cor- ity estimation to a product dataset (Section 5.1), and in- responding experiments with Spark on a single c4.8xlarge vestigate the benefits of applying incremental computation instance. We take different samples from the dataset (100k to growing datasets (Section 5.2). Finally, we evaluate our records, 200k records, 400k records, 600k records, full dataset). constraint suggestion functionality on two external datasets On each of these, we train a naive bayes model with hashed (Section 5.3) and showcase a simple anomaly detection use- column brand input columns as features for predicting the case in Section 5.4. F on 80% of the sample. Finally, we calculate the weighted 1 score for the predictions on the remaining 20%. We repeat For our Spark-based experiments, we leverage a dataset this for different input columns and differently sized samples representing a sample from an internal product catalog, which column alone is already name of the data. We find that the consists of approximately 120 million records, where each a very good predictor for brand , as it results in a weighted record describes a product with several hundred attributes; score of over 97% on all samples of the dataset. The F 1 the data size is roughly 50GB in parquet format. We mimic best results are achieved when leveraging a combination of a use case with a growing append-only dataset, and ran- column with the the bulletpoints name , description and domly partition our product data into 10 ‘deltas’ of about scores of 97.3%, F manufacturer columns, where we reach 1 12 million records for that. Additionally, we use two exter- 98.8%, 99.4%, 99.5% for the 100k , 200k, 400k, 600k samples 2 consists of nal datasets for evaluation. The first dataset of the data, and of 99.5% for the full dataset. Based on these comments from May 2015 in several discussion boards on AnalysisRunner results (which can be computed with an . The sec- the social news aggregation website reddit.com from Section 4), a user could configure a constraint for fu- ond dataset contains information about 5,180 twitter users, ture data to get notified once the predictability drops below which we extracted from the publicly available twitter sam- the observed values, e.g.: ple stream. Check(Level.Warning) .hasPredictability("brand", ("name", The Spark-based experiments leverage a cluster on Elastic "bulletpoints", "description", MapReduce with 5 workers (c3.4xlarge instances) running "manufacturer"), f1 =0.97) Apache Spark 2.0.2 and HDFS 2.7.3. We additionally record the runtime of the model training for different featurizations. We find that the runtime scales 2 https://www.kaggle.com/reddit/reddit-comments-may-2015 linearly for growing data and mostly depends on the length 1789

10 80 35 40 30 batch batch 70 batch batch 30 35 25 incremental incremental 60 incremental incremental 30 25 50 20 25 40 20 15 20 30 15 10 15 20 10 runtime (seconds) runtime (seconds) 10 5 runtime (seconds) 10 runtime (seconds) 5 0 5 0 120 80 40 20 0 20 40 60 100 100 60 0 120 80 80 60 40 20 0 120 100 80 100 120 20 40 0 60 # records in dataset (millions) # records in dataset (millions) # records in dataset (millions) # records in dataset (millions) Figure 8: Runtimes for Figure 7: Runtimes for Figure 6: Runtimes for Figure 5: Runtimes for uniqueness and entropy uniqueness and entropy uniqueness and entropy size and completeness product id on . brand on . material on . id , material , on product color , brand . is basically just a copy of the original column. The overhead of the strings in the input columns (e.g., training a model column alone with lots of text takes description of maintaining this histogram is also what makes the incre- on the name mental computation always perform worse. While this is a column combined with longer than training on the column). This is expected as naive bayes drawback, the incremental approach still has the advantage bulletpoints the conducts a single pass through the data, and sums up the of not requiring access to the full dataset during computa- tion time, which greatly simplifies ingestion pipelines. feature vectors per class, which result from tokenizing and hashing the input columns. Note that the training is very efficient; it takes less than ten seconds in all cases, even on 5.3 Constraint Suggestion the full dataset. We evaluate our constraint suggestion component on a sample of 50,000 records from the reddit dataset as well 5.2 Benefits of Incremental Computation as on the twitter users dataset. In each case, we take a random sample of 10% of the records, have our system sug- We revisit our scalability experiment on growing data to gest constraints based on the sampled data and compute validate our assumptions about the benefits of incremental the coverage of these constraints on the remaining 90% of computation. We compare batch analysis, which always has the datasets. The suggested constraints as well as their cov- to consume the dataset as a whole (the union of all the erage is shown in Table 3. The reddit dataset is a very deltas seen so far) against our incremental approach which easy case, as all columns are complete and only have types maintains a state and always operates on this state and the string and integer. This simple structure is reflected by the current delta only. fact that all suggested constraints suggested hold on the Figure 5 shows the results for again computing the metrics test set. The experiment on the twitter users dataset is that do not require us to re-partition the data (we referred to more interesting, as we have columns with missing values this experiment as ‘non-grouping’ previously). These met- location such as and columns with a small set of discrete rics can be computed in a single pass over the data, and the values such as . The completeness of the lang col- location actual state for the incremental computation is tiny here, as . umn in the sample is 0 28076 and the suggested constraint only one or two numbers per metric have to be maintained. hasCompleteness >= 0.28 holds on the test data, e.g., the While the runtime of the batch analysis grows linearly with ratio of missing values does not increase. The system cor- the dataset size, the runtime remains constant in the in- id rectly suggests an isUnique constraint for the columns cremental case, as it only depends on the size of the delta both of which are actually primary keys name screen and (which is constant in our setup). Next, we revisit the com- for the data. However, the system also suggests two con- putation of metrics such as entropy and uniqueness which straints which do not hold for the data. The first one is require us to repartition the data. These metrics are typi- the range of values for the lang column. Here we identified cally more costly to compute and the incremental computa- ten different values which only account for more than 99% tion is also more difficult, as we have to internally maintain of records in the test data, but miss rare languages such as a histogram of the frequencies per value in the column (the turkish or hungarian. A failing constraint on that column result of the grouping operation). We first compute these can nevertheless be helpful; we found by manual investiga- metrics on the columns material and brand which have a tion that the data in this column is not correctly normalized, rather low cardinality. The results are shown in Figure 6 e.g., there are different capitalizations of the same language and Figure 7. We see that the incremental approach has a value such as ‘en-gb’ and ‘en-GB’. In the second case, the substantive overhead in this case (persisting and joining the system errouneously suggests an isUnique constraint for the maintained histogram), however its runtime stays roughly column, due to the fact that there are many count statuses constant and it outperforms the batch analysis after three different values for this column in the sample at hand and or four deltas. Figure 8 shows the resulting runtimes for id col- we only known an approximation of the number of distinct product computing entropy and uniqueness for the umn. This column is special in this dataset as it consists of values; the uniqueness value of this column is only 64% per- unique values only. Due to this characteristic, the runtime of cent in the test data. The second error is corrected, how- the incremental approach shows the same growth behavior ever, once we leverage the predictions of our classifier for as the runtime of the batch analysis (linearly growing with unique constraints: while the classifier assigns a high prob- data size), as every delta introduces a set of new values, and ability of 81% that our suggested unique constraint on the the histogram which the incremental computation maintains id column is valid, it only assigns a 2% probability to the 1790

11 Table 3: Constraint suggestion and type prediction for the reddit comments and twitter users dataset. Constraints are suggested based on a 10% sample of the data, and their coverage is computed on the remaining 90% of the data. We leverage a machine learning model trained on column names to decide upon potential unique constraints. suggested constraints coverage classifier score dataset column isComplete reddit-comments - , id 1.0 0.83 isUnique 1.0 - hasConsistentType(integral) created isNonNegative 1.0 utc isComplete , , 1.0 - isComplete subreddit 1.0 author isComplete - isComplete hasConsistentType(integral) 1.0 - ups , , isNonNegative , isComplete 1.0 - downs hasConsistentType(integral) , hasConsistentType(integral) 1.0 - score isComplete isComplete , edited 1.0 - isNonNegative controversiality , hasConsistentType(integral) , isNonNegative , 1.0 - isComplete 1.0 isInRange(0, 1) - text isComplete 1.0 - id isComplete , hasConsistentType(integral) , isNonNegative , 1.0 - twitter-users 1.0 isUnique 0.81 name isComplete , 1.0 - screen 1.0 isUnique 0.01 isComplete , 1.0 - lang 0.991 - isInRange(’en’, ’pt’, ...) location hasCompleteness >= 0.28 1.0 - followers count isComplete , hasConsistentType(integral) , isNonNegative 1.0 - statuses , , isNonNegative , 1.0 - isComplete count hasConsistentType(integral) 0.02 isUnique 0.636 verified 1.0 - , hasConsistentType(boolean) isComplete isComplete hasConsistentType(boolean) 1.0 - enabled geo , statuses count column being unique; therefore, we decide This indicates that we want to be warned if the mean con- against the suggested constraint. Unfortunately, the classi- troversiality on a particular day in a discussion board is screen fier produces a false negative for the column, name more than three standard deviations higher than the pre- which is indeed unique for our sample at hand, however this vious mean. Figure 9 illustrates the result of this analysis would also be not immediately obvious to humans (as differ- for a selection of discussion boards from the reddit dataset. ent users are allowed to have the same screen name on many We see that there are discussion boards with relatively low social networking platforms), and we prefer conservative and and variance in the controversiality over time such as anime robust suggestions (e.g., rather having false negatives than askreddit . However, there are also boards with spikes in false positives), which build trust in our users. controversiality such as cringepics chicagobulls which and get marked by our anomaly detection approach. Manual in- 5.4 Anomaly Detection vestigation of these spikes revealed that they are strongly correlated to the number of deleted users on the particular In order to showcase our anomaly detection functionality, day, which indicates that they indeed result from trolling we apply it to a fictitious use case on the reddit dataset. behavior. Assume that we want to leverage the discussion data for an information extraction task such as question answer- ing. A potential data quality problem would now be that 6. LEARNINGS the data could contain large amounts of spam and trolling posts, which would negatively influence our machine learn- We report on learnings on different levels that we gained ing model that we aim to train on the data. If we regu- organi- from users of our data validation system. On an larly ingest data from reddit, we would like to be alarmed sational level , there are many benefits in using a common if there are signs of increased spamming or trolling activ- data quality library. Such a common library helps establish ity. The reddit data contains a controversiality field for a shared vocabulary across teams for discussing data quality each post, and the series of the ratio of such posts in a and also establishes best practices on how to measure data board per day might be a good signal for detecting poten- quality, leading to a common way to monitor the metrics of tial trolling. To leverage our anomaly detection function- datasets. It is also a big advantage if producers as well as ality for this task, we inspect the historic time series of consumers of datasets leverage the same system for verifying the metric per discussion board Mean(controversiality) data quality, as they can re-use checks and constraints from ) which we would need to compute during inges- ( subreddit each other, e.g., the producer can choose to adapt checks tions. In our declarative API, the corresponding check looks from downstream consumers earlier in the data processing as follows: pipeline. On a technical level , users highlighted the fact that our Check(Level.Warning , groupBy="subreddit") data quality library runs on Spark, which they experienced .hasNoAnomalies("controversiality", Mean , as a fast, scalable way to do data processing, partly due to OnlineNormal(upperDeviationFactor =3)) the optimizations that our platform is applying. Our sys- 1791

12 0 . 25 Multiple researchers have suggested ML for data cleaning cringepics . 15 0 to use ML for cleaning data. While traditional methods can . 05 0 be used to generate candidates for fixing incorrect data (e.g., violations of functional dependencies), active learning meth- . 0 25 anime ods can be used to select and prioritize human effort [49]. . 15 0 . 05 0 ActiveClean similarly uses active learning for prioritization, but at the same time it learns and updates a convex loss . 0 25 model [24]. HoloClean generates a probabilistic model over chicagobulls 15 . 0 a dataset that combines integrity constraints and external . 0 05 Boost- data sources to generate data repair suggestions [35]. 25 0 . automatically selects an ensemble of error detection Clean askreddit 0 . 15 mean controversiality and repair combinations using statistical boosting [25]. 0 05 . The challenges in- Data validation in ML applications 20 30 10 5 25 15 0 volved in building complex machine learning applications in the real-world have recently been highlighted by many re- day searchers, e.g., with respect to managing the resulting mod- els [26], software engineering [42, 8], pipeline abstractions [2, Figure 9: Anomalies detected in the time series 43, 46], and learnings from real-world systems [34, 7]. A of the Mean(controversiality) metric for different new focus is being put on data management questions with boards in the reddit dataset, which indicate trolling respect to large-scale machine learning. Examples include behavior potentially decreasing data quality. managing the metadata of artifacts produced during ma- chine learning workloads, including schemas and summary statistics of the datasets used for training and testing [47, 48, 40, 29, 28, 20, 33, 41], the discovery and organization of tem helped reduce manual and ad-hoc analysis on their data, enterprise datasets [16], and machine learning specific san- e.g., sampling and eyeballing the results to identify possible ity checks [45]. As a consequence, modern machine learn- problems such as incomplete fields, outliers and derivations ing platforms begin to have explicit data validation compo- from the expected number of rows. Instead, such checks can nents [7, 5, 9]. now be run in an automated way as a part of ingestion pipelines. Additionally, data producers can leverage our 8. CONCLUSION system to halt their data publishing pipelines when they encounter cases of data anomalies. By that, they can en- We presented a system for automating data quality veri- sure that downstream data processing, which often includes fication tasks, which scales to large datasets and meets the training ML models, is only working with vetted data. requirements of production use cases. The system provides a declarative API to its users, which combines common qual- ity constraints with custom validation code, and thereby 7. RELATED WORK enables ‘unit tests’ for data. We discussed how to efficiently Data cleaning has been an active research area for decades, execute the constraint validation by translating checks to see recent surveys [1, 10, 38] for an overview. scalable metrics computations, and elaborated on reformu- lations of the metrics to enable incremental computations Declarative data quality verification The idea to allow declar- on growing datasets. Additionally, we provided examples ative definitions of data quality standards is well-established. for the use of machine learning techniques in data quality Ilyas et al. provide an overview on standard consistency def- verification, e.g. for enhancing constraint suggestions, for initions [23]. In fact, every DBMS supports standard in- estimating the predictability of a column and for detecting tegrity constraints such as key constraints or nullable fields. anomalies in historic data quality timeseries. denial constraints , which is a first A relevant extension are In the future, we want to extend our machine learning- order logic formalism that allows the coverage of more busi- based constraint suggestion by leveraging more metadata as ness rules across two tuples [11]. A popular paradigm for well as historical data about constraints defined with our functional de- defining dependencies between columns are API. Moreover, we will investigate the benefits of fitting conditional functional dependencies [6]; there pendencies and well-known distributions to numeric columns to be able to exist fast, approximate algorithms for discovering them [31]. understand this data in more detail and suggest more fine- In contrast to this line of work, our predictability metric re- grained constraints. Another direction is to provide users lies on ML to learn relationships between columns and uses with more comprehensive error messages for failing checks empirical tests on hold-out datasets for validation. We con- and allow them easy access to records that made a partic- jecture that, due to their inherent robustness to outliers, ular constraint fail. Furthermore, we will apply seasonal ML methods more suitable for our use case to automati- ARIMA [22] and neural network-based time series forecast- cally detect changes in data quality on many large datasets. ing [13] in order to enhance our anomaly detection function- for Galhardas et al. propose a declarative language AJAX ality to also be able to handle seasonal and intermittent time data cleaning as an extension of SQL as well as an execution series. Finally, we will we streaming explore validating data model for executing data cleaning programs [14]. We simi- quality in streaming scenarios, which should be a natural larly optimize the validation of our declarative data quality extension of our discussed incremental use case. constraints to minimize computational effort. We combine a larger set of constraints into a unified framework, but we do not support the automatic execution of data repair methods. 1792

13 9. REFERENCES [19] J. M. Hellerstein. Quantitative data cleaning for large United Nations Economic Commission for databases. [1] Z. Abedjan, L. Golab, and F. Naumann. Profiling , 2008. Europe (UNECE) , VLDB Journal relational data: a survey. [20] J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, 24(4):557–581, 2015. J. Dalton, A. Dey, S. Nag, K. Ramachandran, [2] P. Andrews, A. Kalro, H. Mehanna, and A. Sidorov. S. Arora, A. Bhattacharyya, S. Das, et al. Ground: A Productionizing Machine Learning Pipelines at Scale. CIDR data context service. , 2017. , 2016. Machine Learning Systems workshop at ICML [21] S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in [3] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, practice: algorithmic engineering of a state of the art J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, EDBT cardinality estimation algorithm. , 683–692, A. Ghodsi, et al. Spark sql: Relational data processing 2013. , 1383–1394, 2015. in spark. SIGMOD [22] R. J. Hyndman and G. Athanasopoulos. Forecasting: [4] C. Batini, C. Cappiello, C. Francalanci, and principles and practice . OTexts, 2014. A. Maurino. Methodologies for data quality [23] I. F. Ilyas and X. Chu. Trends in cleaning relational ACM Computing assessment and improvement. Foundations and data: Consistency and deduplication. Surveys , 41(3):16, 2009. , 5(4), 281–393, 2015. Trends in Databases [5] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. [24] S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, and E. Wu. Activeclean: An interactive data cleaning et al. TFX: A TensorFlow-Based Production-Scale SIGMOD , framework for modern machine learning. Machine Learning Platform. , 1387–1395, 2017. KDD 2117–2120, 2016. [6] P. Bohannon, W. Fan, F. Geerts, X. Jia, and [25] S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu. A. Kementsietsidis. Conditional functional Boostclean: Automated error detection and repair for , 746–755, 2007. dependencies for data cleaning. ICDE machine learning. , abs/1711.01299, 2017. CoRR [7] J.-H. B ̈ose, V. Flunkert, J. Gasthaus, [26] A. Kumar, R. McCann, J. Naughton, and J. M. Patel. T. Januschowski, D. Lange, D. Salinas, S. Schelter, Model Selection Management Systems: The Next M. Seeger, and Y. Wang. Probabilistic demand Frontier of Advanced Analytics. SIGMOD Record , , 10(12):1694–1705, 2017. PVLDB forecasting at scale. 44(4):17–22, 2016. [8] E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley. [27] X. Meng, J. Bradley, B. Yavuz, E. Sparks, The ml test score: A rubric for ml production S. Venkataraman, D. Liu, J. Freeman, D. Tsai, , Big Data readiness and technical debt reduction. M. Amde, S. Owen, et al. Mllib: Machine learning in 1123–1132, 2017. Apache Spark. JMLR , 17(1):1235–1241, 2016. [9] E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and [28] H. Miao, A. Li, L. S. Davis, and A. Deshpande. On M. Zinkevich. Data Infrastructure for Machine model discovery for hosted data science projects. , 2018. Learning. SysML Workshop on Data Management for End-to-End [10] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data , 6, 2017. Machine Learning at SIGMOD cleaning: Overview and emerging challenges. [29] H. Miao, A. Li, L. S. Davis, and A. Deshpande. , 2201–2206, 2016. SIGMOD Towards unified data and lifecycle management for [11] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial , 571–582, 2017. ICDE deep learning. PVLDB , 6(13):1498–1509, 2013. constraints. Quality-driven Query Answering for [30] F. Naumann. [12] T. Dasu, T. Johnson, S. Muthukrishnan, and Integrated Information Systems . Springer, 2002. V. Shkapenyuk. Mining database structure; or, how to [31] T. Papenbrock and F. Naumann. A hybrid approach build a data quality browser. SIGMOD , 240–251, 2002. , to functional dependency discovery. SIGMOD [13] V. Flunkert, D. Salinas, and J. Gasthaus. DeepAR: 821–833, 2016. Probabilistic forecasting with autoregressive recurrent [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, networks. CoRR , abs/1704.04110, 2017. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, [14] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and R. Weiss, V. Dubourg, et al. Scikit-learn: Machine C.-A. Saita. Declarative data cleaning: Language, JMLR , 12:2825–2830, 2011. learning in python. VLDB model, and algorithms. , 371–380, 2001. [33] J. F. Pimentel, L. Murta, V. Braganholo, and [15] M. Greenwald and S. Khanna. Space-efficient online J. Freire. noworkflow: a tool for collecting, analyzing, SIGMOD Record computation of quantile summaries. , and managing provenance from python scripts. 30:58–66, 2001. , 10(12):1841–1844, 2017. PVLDB [16] A. Halevy, F. Korn, N. F. Noy, C. Olston, [34] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. N. Polyzotis, S. Roy, and S. E. Whang. GOODS: Data management challenges in production machine SIGMOD Organizing Google’s Datasets. , 795–806, , 1723–1726, 2017. SIGMOD learning. 2016. [35] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. R ́e. [17] A. Y. Halevy, P. Norvig, and F. Pereira. The Holoclean: Holistic data repairs with probabilistic Intelligent Systems , unreasonable effectiveness of data. , 10(11):1190–1201, 2017. inference. PVLDB 24(2):8–12, 2009. [18] H. Harmouch and F. Naumann. Cardinality , estimation: An experimental survey. PVLDB 11(4):499–512, 2017. 1793

14 [36] J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger. [44] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Tackling the poor assumptions of naive bayes text Revisiting unreasonable effectiveness of data in deep ICML , 616–623, 2003. classifiers. learning era. , 843–852, 2017. ICCV [37] T. Rukat, D. Lange, and C. Archambeau. An [45] M. Terry, D. Sculley, and N. Hynes. The Data Linter: interpretable latent variable model for attribute Lightweight, Automated Sanity Checking for ML Data Interpretable applicability in the amazon catalogue. Machine Learning Systems Workshop at NIPS Sets. , , 2017. ML Symposium at NIPS 2017. [38] S. Sadiq, J. Freire, R. J. Miller, T. Dasu, I. F. Ilyas, [46] T. van der Weide, D. Papadopoulos, O. Smirnov, F. Naumann, D. Srivastava, X. L. Dong, S. Link, and M. Zielinski, and T. van Kasteren. Versioning for X. Zhou. Data quality the role of empiricism. end-to-end machine learning pipelines. Workshop on SIGMOD Record , 46(4):35–43, 2018. Data Management for End-to-End Machine Learning at SIGMOD, 2, 2017. [39] M. Scannapieco and T. Catarci. Data quality under a computer science perspective. Archivi & Computer , 2, [47] J. Vanschoren, J. N. Van Rijn, B. Bischl, and 1–15, 2002. L. Torgo. OpenML: networked science in machine KDD learning. , 49–60, 2014. [40] S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Automatically Tracking Metadata and [48] M. Vartak, H. Subramanyam, W.-E. Lee, Provenance of Machine Learning Experiments. S. Viswanathan, S. Husnoo, S. Madden, and Machine Learning Systems workshop at NIPS , 2017. M. Zaharia. ModelDB: A System for Machine Workshop on Learning Model Management. [41] S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and Human-In-the-Loop Data Analytics at SIGMOD , 14, S. Seufert. Declarative Metadata Management: A 2016. Missing Piece in End-to-End Machine Learning. , 2018. SysML [49] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. [42] D. Sculley, G. Holt, D. Golovin, E. Davydov, PVLDB , 4(5):279–289, 2011. T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison. Hidden Technical Debt in [50] M. Zaharia, M. Chowdhury, M. J. Franklin, NIPS , 2503–2511, 2015. Machine Learning Systems. S. Shenker, and I. Stoica. Spark: Cluster computing HotCloud , 95, 2010. with working sets. [43] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. KeystoneML: Optimizing , Pipelines for Large-Scale Advanced Analytics. ICDE 535–546, 2017. 1794

Related documents

p2061 kunft

p2061 kunft

BlockJoin: Efficient Matrix Partitioning Through Joins * † * Sebastian Schelter Andreas Kunft Asterios Katsifodimos ‡ * ‡ * Volker Markl Tilmann Rabl ‡ † * Delft University of Technology German Resear...

More info »
OpenWPM 1_million_site_tracking_measurement

OpenWPM 1_million_site_tracking_measurement

Online Tracking: A 1-million-site Measurement and Analysis Steven Englehardt Arvind Narayanan Princeton University Princeton University [email protected] [email protected] This is an extende...

More info »