p150 arvanitis cidr19

Transcript

1 Automated Performance Management for the Big Data Stack Anastasios Arvanitis, Shivnath Babu, Eric Chu, Adrian Popescu, Alkis Simitsis, Kevin Wilkinson Unravel Data Systems {tasos,shivnath,eric,adrian,alkis,kevinw}@unraveldata.com ABSTRACT More than 10,000 enterprises worldwide today use the big data stack that is composed of multiple distributed systems. At Unravel, we have worked with a representative sample of these enterprises that covers most industry verticals. This sample also covers the spectrum of choices for deploying the big data stack across on-premises datacenters, private cloud deployments, public cloud deployments, and hybrid combi- nations of these. In this paper, we aim to bring attention to the performance management requirements that arise in big data stacks. We provide an overview of the requirements both at the level of individual applications as well as holis- tic clusters and workloads. We present an architecture that can provide automated solutions for these requirements and then do a deep dive into a few of these solutions. 1. BIG DATA STACK Many applications in fields like health care, genomics, fi- nancial services, self-driving technology, government, and Figure 1: Evolution of the big data stack in an enterprise media are being built on what is popularly known today big data stack . What is unique about the big data as the stack is that it is composed of multiple distributed systems. The typical evolution of the big data stack in an enterprise distributed storage system like HDFS, S3, or ABS. These usually goes through the following stages (also illustrated in systems power the interactive SQL queries that are common Figure 1). in business intelligence workloads. Big Data Extract-Transform-Load (ETL): Storage sys- As enterprises mature in their use of Big Data Science: tems like HDFS, S3, and Azure Blob Store (ABS) are used the big data stack, they start bringing in more data-science to store the large volumes of structured, semi-structured, workloads that leverage machine learning and AI. This stage and unstructured data in the enterprise. Distributed pro- is usually when the Spark distributed system starts to be cessing engines like MapReduce, Tez, and Pig/Hive (usually used more and more. running on MapReduce or Tez) are used for data extraction, Big Data Streaming: Over time, enterprises begin to un- cleaning, and transformations of the data. derstand the importance of making data-driven decisions in Big Data Business Intelligence (BI): MPP SQL sys- near real-time as well as how to overcome the challenges in tems like Impala, Presto, LLAP, Drill, BigQuery, RedShift, implementing them. Usually at this point in the evolution, or Azure SQL DW are added to the stack; sometimes along- systems like Kafka, Cassandra, and HBase are added to the side incumbent MPP SQL systems like Teradata and Ver- big data stack to support applications that ingest and pro- tica. Compared to the traditional MPP systems, the newer cess data in a continuous streaming fashion. ones have been built to deal with data stored in a different Industry analysts estimate that there are more than 10,000 enterprises worldwide that are running applications in pro- duction on a big data stack comprising three or more dis- tributed systems [1]. At Unravel, we have worked closely with around 50 of these enterprises, and have had detailed This article is published under a Creative Commons Attribution License conversations with around 350 more of these enterprises. (http://creativecommons.org/licenses/by/3.0/), which permits distribution These enterprises cover almost every industry vertical and and reproduction in any medium as well allowing derivative works, pro- run their stacks in on-premises datacenters, private cloud de- vided that you attribute the original work to the author(s) and CIDR 2019. ployments, public cloud deployments, or in hybrid combina- 9th Biennial Conference on Innovative Data Systems Research (CIDR ‘19) tions (e.g., regularly-scheduled workloads like the Big Data January 13-16, 2019 , Asilomar, California, USA.

2 ETL runs in on-premises datacenters while the non-sensitive • Configuring the 100s of configuration settings that dis- data is replicated to one or more public clouds where ad-hoc tributed systems are notoriously known for having in workloads like Big Data BI run). Sizes of these clusters vary order to get the desired performance. from few tens to few thousands of nodes. Furthermore, in • Tuning data partitioning and storage layout. some of the production deployments on the cloud, the size of an auto-scaling cluster can vary in size from 1 node to • Optimizing dollar costs on the cloud. All types of re- 1000 nodes in under 10 minutes. source usage on the cloud cost money. For example, The goal of this paper is to bring attention to the per- picking the right node type for a cloud cluster can formance management requirements that arise in big data have a major impact on the overall cost of running stacks. A number of efforts like Polystore [7], HadoopDB a workload. [18] have addressed challenges in stacks hybrid flows [3], and Capacity planning using predictive analysis in order to • composed of multiple systems. However, their primary focus account for workload growth proactively. was not on the performance management requirements that we address. We split these requirements into two categories: Identify in an efficient way who (e.g., user, tenant, • application performance requirements and operational per- group) is running an application or a workload, who formance requirements. Next, we give an overview of these is causing performance problems, and so on. Such an requirements. accounting process is typically known as ‘chargeback’. 1.1 Application Performance Requirements 2. ARCHITECTURE OF A PERFORMANCE The nature of distributed applications is that they interact with many different components that could be independent MANAGEMENT SOLUTION or interdependent. This nature is often referred to in popular Addressing the challenges from Section 1 needs an archi- literature as “having many moving parts.” In such an envi- tecture like the one shown in Figure 2. Next, we will discuss ronment, questions like the following can become nontrivial the main components of this architecture. to answer: To answer questions like those Full-Stack Data Collection: raised in Sections 1.1 and 1.2, monitoring data is needed What caused this application to fail, and how Failure: • from every level of the big data stack. For example, (i) SQL can I fix it? queries, execution plans, data pipeline dependency graphs, This application seems to have made little progress • Stuck: and logs from the application level; (ii) resource allocation in the last hour. Where is it stuck? and wait-time metrics from the resource management and scheduling level; (iii) actual CPU, memory, and network us- Will this application ever finish, or will it • Runaway: age metrics from the infrastructure level; (iv) data access finish in a reasonable time? and storage metrics from the file-system and storage level; and so on. Collecting such data in nonintrusive and low- Will this application meet its SLA? SLA: • overhead ways from production clusters remains a major Is the behavior (e.g., performance, resource • Change: technical challenge, but this problem has received attention usage) of this application very different from the past? in the database and systems community [8]. If so, in what way and why? Event-driven Data Processing: Some of the clusters that we work with are more than 500 nodes in size and run mul- • Is this application causing problems on Rogue/victim: tiple hundreds of thousands of applications every day across my cluster; or vice versa, is the performance of this ETL, BI, data science, and streaming. These deployments application being affected by one or more other appli- generate tens of terabytes of logs and metrics every day. The cations? challenges from this data are definitely velocity and volume It has to be borne in mind that almost every application in and variety challenges consistency nontrivial. However, the the big data stack interacts with multiple distributed sys- here, to the best of our knowledge, have not been addressed tems. For example, a SQL query may interact with Spark by the database and systems community. for its computational aspects, with YARN for its resource The Variety Challenge: The monitoring data collected from allocation and scheduling aspects, and with HDFS or S3 for the big data stack covers the full sprectrum from unstruc- its data access and IO aspects. Or, a streaming application tured logs to semistructured data pipeline dependency DAGs may interact with Kafka, Flink, and HBase (as illustrated and to structured time-series metrics. Stitching this data to- in Figure 1). gether to create meaningful and useable representations of application performance is a nontrivial challenge. 1.2 Operational Performance Requirements The Consistency Challenge: Monitoring data has to be col- Many performance requirements also arise at the “macro” lected independently and in real-time from various moving level compared to the level of individual applications. Ex- parts of the multiple distributed systems that comprise the amples of such requirements are: big data stack. Thus, no prior assumptions can be made about the timeliness or order in which the monitoring data Configuring resource allocation policies in order to meet • arrives at the processing layer in Figure 2. For example, SLAs in multi-tenant clusters [9, 19]. consider a Hive Query Q that runs two MapReduce jobs Detecting rogue applications that can affect the per- • J in turn run 200 contain- and J J . Suppose, J and 1 2 2 1 formance of SLA-bound applications through a variety and 800 containers ers , . . . , C , . . . , C C C respec- 200 201 1 1000 of low-level resource interactions [10]. tively. One may expect that the monitoring data from these

3 Figure 2: Architecture of a performance management platform for the big data stack components arrives in the order J Q , C , , . . . , C , , J example, but the concepts generalize to the big data stack. 2 1 1 200 C , . . . , C . However, the data can come in any order, 1000 201 Automatic Identification of the Root Cause of Ap- C e.g., , . . . , C J , Q , J . , , , . . . , C , . . . , C C , C 200 101 1000 2 1 1 201 100 Spark platform providers like Amazon, plication Failure: As a result, the data processing layer in Figure 2 has to be Azure, Databricks, and Google Cloud as well as Applica- based on event-driven processing algorithms whose outputs tion Performance Management (APM) solution providers converge to same final state irrespective of the timeliness and like Unravel have access to a large and growing dataset of order in which the monitoring data arrives. From the user’s logs from millions of Spark application failures. This dataset perspective, she should get the same insights irrespective of is a gold mine for applying state-of-the-art artificial intelli- the timeliness and order in which the monitoring data ar- gence (AI) and machine learning (ML) techniques. Next, let rives. An additional complexity that we do not have space to us look at possible ways to automate the process of failure discuss further is the chance of some monitoring data getting diagnosis by building predictive models that continuously lost in transit due to failure, network partitions, overload, learn from logs of past application failures for which the re- etc. It is critical to account for this aspect in the overall spective root causes have been identified. These models can architecture. then automatically predict the root cause when an applica- ML-driven Insights and Policy-driven Actions: En- tion fails. Such actionable root-cause identification improves abling all the monitoring data to be collected and stored in a the productivity of Spark users significantly. single place opens up interesting opportunities to apply sta- A distributed Spark application consists of a Driver con- tistical analysis and learning algorithms to this data. These tainer and one or more Executor containers. A number of algorithms can generate insights that, in turn, can be applied logs are available every time a Spark application fails. How- manually by the user or automatically based on configured ever, the logs are extremely verbose and messy. They contain policies to address the performance requirements identified multiple types of messages, such as informational messages in Sections 1.1 and 1.2. Unlike the big data stack that we from every component of Spark, error messages in many dif- consider in this paper, efforts such as self-driving databases ferent formats, stacktraces from code running on the Java [2, 16] address similar problems for traditional database sys- Virtual Machine (JVM), and more. The complexity of Spark tems like MySQL, PostgreSQL, and Oracle, and in the cloud usage and internals make things worse. Types of failures and (e.g., [6, 12, 13, 14]). In the next section, we will use example error messages differ across Spark SQL, Spark Streaming, problems to dive deeper into the solutions. iterative machine learning and graph applications, and in- teractive applications from Spark shell and notebooks (e.g., Jupyter, Zeppelin). Furthermore, failures in distributed sys- 3. SOLUTIONS DEEP DIVE tems routinely propagate from one component to another. Such propagation can cause a flood of error messages in the 3.1 Application Failure log and obscure the root cause. In distributed systems, applications can fail due to many Figure 3 shows our overall solution to deal with these reasons. But when an application fails, users are required to problems and to automate root cause analysis (RCA) for fix the cause of the failure to get the application running suc- Spark application failures. Overall, the solution consists of: cessfully. Since applications in distributed systems interact • Continuously collecting logs from a variety of Spark with multiple components, a failed application throws up a application failures large set of raw logs. These logs typically contain thousands of messages, including errors and stacktraces. Hunting for • Converting logs into feature vectors the root cause of an application failure from these messy, raw, and distributed logs is hard for experts, and a night- Learning a predictive model for RCA from these fea- • mare for the thousands of new users coming to the big data ture vectors stack. The question we will explore in this section is how to automatically generate insights into a failed application in Data collection for training: As the saying goes: garbage a multi-engine big data stack that will help the user get the in, garbage out. Thus, it is critical to train RCA models application running successfully. We will use Spark as our on representative input data. In addition to relying on logs

4 have designed to produce an application failure in our lab framework. The second way in which a label is given to the logs for an application failure is when a Spark domain expert manually diagnoses the root cause of the failure. Once the logs are available, there are var- Input Features: ious ways in which the feature vector can be extracted from these logs (recall the overall approach in Figure 3). One way is to transform the logs into a bit vector (e.g., 1001100001). Each bit in this vector represents whether a specific message template is present in the respective logs. A prerequisite to this approach is to extract all possible message templates from the logs. A more traditional approach for feature vec- tors from the domain of information retrieval is to represent the logs for a failure as a bag of words. This approach is mostly similar to the bit vector approach except for a cou- Figure 3: Approach for automatic root cause analysis (RCA) ple of differences: (a) each bit in the vector now corresponds to a word instead of a message template, and (b) instead of 0s and 1s, it is more common to use numeric values gener- from real-life Spark application failures observed on cus- ated using techniques like TF-IDF. tomer sites, we have also invested in a lab framework where More recent advances in ML have popularized vector em- root causes can be artificially injected to collect even larger beddings. In particular, we use the Doc2Vec technique [11]. and more diverse training data. At a high level, these vector embeddings map words (or para- Logs are mostly Structured versus unstructured data: graphs, or entire documents) to multidimensional vectors by unstructured data. To keep the accuracy of model predic- evaluating the order and placement of words with respect to tions to a high level in automated RCA, it is important to their neighboring words. Similar words map to nearby vec- combine this unstructured data with some structured data. tors in the feature vector space. The Doc2Vec technique uses Thus, whenever we collect logs, we are careful to collect a three-layer neural network to gauge the context of the doc- trustworthy structured data in the form of key-value pairs ument and relate similar content together. that we additionally use as input features in the predictive models. These include Spark platform information and en- vironment details of Scala, Hadoop, OS, and so on. Figure 4: Taxonomy of failures ML techniques for prediction fall into two broad Labels: categories: supervised learning and unsupervised learning. Figure 5: Feature vector generation We use both techniques in our overall solution. For the su- pervised learning part, we attach root-cause labels with the Once the feature vectors are generated along with the la- logs collected from an application failure. This label comes bel, a variety of supervised learning techniques can be ap- from a taxonomy of root causes that we have created based plied for automatic RCA. We have evaluated both shallow as on millions of Spark application failures seen in the field well as deep learning techniques, including random forests, and in our lab. Broadly speaking, as shown in Figure 4, the support vector machines, Bayesian classifiers, and neural networks. The overall results produced by our solution are taxonomy can be thought of as a tree data structure that promising as shown in Figure 5. (Only one result is shown categorizes the full space of root causes. For example, the first non-root level of this tree can be failures caused by: (i) due to space constraints.) In this figure, 14 different types of Configuration errors, (ii) Deployment errors, (iii) Resource root causes of failure are injected into runs of various Spark applications in order to collect a large set of logs. Figure 5 errors, (iv) Data errors, (v) Application errors, and (vi) Un- known factors. shows the accuracy of the approach in Figure 3 to predict the correct root cause based on a 75-25% split of training The leaves of the taxonomy tree form the labels used in the supervised learning techniques. In addition to a text label and test data. The accuracy of prediction is fairly high. representing the root cause, each leaf also stores additional We are currently enhancing the solution in some key ways. One of these is to quantify the degree of confidence in the information such as: (a) a description template to present root cause predicted by the model in a way that users will the root cause to a Spark user in a way that she will easily easily understand. Another key enhancement is to speed up understand, and (b) recommended fixes for this root cause. the ability to incorporate new types of application failures. The labels are associated with the logs in one of two ways. The bottleneck currently is in generating labels. We are First, the root cause is already known when the logs are generated, as a result of injecting a specific root cause we working on active learning techniques [4] that nicely pri-

5 Figure 6: Automated tuning of a failed Spark application new oritize the human efforts required in generating labels. The = min ( m , m ) m hi obs hi intuition behind active learning is to pick the unlabeled fail- m by the application is the observed usage of m Here, obs ure instances that provide the most useful information to in the successful run. At any point: build an accurate model. The expert labels these instances set m A new run of the application can be done with • m + m and then the predictive model is rebuilt. hi lo to 2 We did a Automatic Fixes for Failed Applications: • m is the most resource-efficient setting that is known hi deeper analysis of the Spark application failure logs available to run the application successfully so far to us from more than 20 large-scale production clusters. The The above approach is incomplete because the search space key findings from this analysis are: of configuration parameters to deal with OOM across the There is a “90-10” rule in the root cause of application • Driver, Executor, container, JVM, as well as a few other pa- failures. That is, in all the clusters, more than 90% of rameters that affect Spark memory usage is multi-dimensional. the failures were caused by less than 10 unique root Space constraints prevent us from going into further details, causes. but the algorithm from a related problem can be adapted to the OOM problem [5]. • The two most common causes where: (i) the applica- Figure 6 shows an example of how the algorithm works in tion fails due to out of memory ( OOM ) in some com- practice. Note that the first run is a failure due to OOM. ponent; and (ii) the application fails due to timeout The second run, which was based on a configuration setting while waiting for some resource. produced by the algorithm, managed to get the application For application failures caused by OOM, we designed algo- running succcessfully. The third run—the next in sequence rithms that, in addition to using examples of successful and produced by the algorithm—was able to run the application failed runs of the application from history, can intelligently successfully, while also running it faster than the second run. try out a limited number of memory configurations to get The third run was faster because it used a more memory- the application quickly to a running state; followed by get- efficient configuration than the second run. Overallocation of ting the application to a resource-efficient running state. memory can make an application slow because of the large As mentioned earlier, a Spark application runs one Driver wait to get that much allocated. Note how the algorithm container and one or more Executor containers. The appli- is able to automatically find configurations that run the ap- cation has multiple configuration parameters that control plication successfully while being resource efficient. Thereby, the allocation and usage of memory at the overall container we can remove the burden of manually troubleshooting failed level, and also at the level of the JVMs running within the applications from the hands of users, enabling them to fo- container. If the overall usage at the container level exceeds cus entirely on solving business problems with the big data the allocation, then the application will be killed by the re- stack. source management layer. If the overall usage at the Java heap level exceeds the allocation, then the application will 3.2 Cluster Optimization be killed by the JVM. Performing cluster level workload analysis and optimiza- The algorithm we developed to enable finding fixes auto- tion on a multiplicity of distributed systems in big data matically for OOM problems refines intervals based on suc- stacks is not straightforward. Key objectives often met in cessful and failed runs of the application. For illustration, practice include performance management, autoscaling, and let represent the Executor container allocation. We de- m cost optimization. Satisfying such objectives is imperative is the maximum m and m fine two variables, , where m hi lo lo for both on-premises and cloud deployments and can serve is the mini- m that causes OOM; and m known setting of hi different classes of users like Ops and Devs altogether. m that does not cause OOM. Given a mum known setting of Operational Insights: Toward this end, we analyze the run of the application that failed due to OOM while running metrics collected and provide a rich set of actionable in- , we can update with m m to: = m curr lo new sights, as for example: m m ) , m = max ( curr lo lo Insights into application performance issues; e.g., de- • Given a run of the application that succeeded while running termine whether an application issue is due to code in- m m , we can update to: m = with curr hi

6 Figure 7: An example set of cluster wide recommendations Figure 8: Example improvements of applying cluster level recommendations efficiency, contention with cloud/cluster resources, or supporting the entire big data stack, not just individual sys- hardware failure or inefficiency (e.g., slow node) tems, and also employ advanced analytics techniques to un- ravel problems and inefficiencies, whilst we also recommend • Insights on cluster tuning based on aggregation of ap- concrete solutions to such issues. plication data; e.g., determine whether a compute clus- One of the most difficult challenges in managing multi- ter is properly tuned at both a cluster and application tenant Big Data stack clusters is understanding how re- level sources are being used by the applications running in the clusters. We are providing a forensic view into each clus- • Insights on cluster utilization, cloud usage, and au- ter’s key performance indicators (KPIs) over time and how toscaling. they relate to the applications running in the cluster. For example, we can pinpoint the applications causing a sud- We also provide users with tools to help them understand den spike in the total cpu (e.g., vcores) or memory usage. how they are using their compute resources, as for exam- And then, we enable drill down into these applications to ple, compare cluster activity between two time periods, ag- understand their behavior, and whenever possible, we also gregated cluster workload, summary reports for cluster us- provide recommendations and insights to help improve how age, chargeback reports, and so on. A distinctive difference the applications run. from other monitoring tools (e.g., Ambari, Cloudera Man- In addition to that, we also provide cluster level recom- ager, Vertica [17]) is that we offer a single pane of glass for

7 Figure 9: Example resource utilization for two queues mendations to fine tune cluster wide parameters to maxi- a calculated risk (here, percentage of jobs that are predicted mize a cluster’s efficiency based upon the cluster’s typical to run if the candidate value is applied shown as a red bar). workload. For doing so, we work as follows: Based on these data, we make a recommendation to set the memory size to 2048MB and calculate the improvement po- • Collect performance data of prior completed applica- tential: the recommended value could halve memory usage tions for 97% of the expected workload. Similar recommendation are made for the reduce containers shown at the bottom of Analyze the applications w.r.t. the cluster’s current • Figure 7. Figure 8 shows example improvements of applying configuration cluster level recommendations on a production cloud deploy- • Generate recommended cluster parameter changes ment of a financial institution: our cluster tuning enabled ∼ 200% more applications (i.e., from 902 to 1934 applica- • Predict and quantify the impact that these changes 50% lower cost (i.e., from 641 to ∼ tions/day) to be run at will have on applications that will execute in the future 341 vCore-Hours/day), increasing the organization’s confi- dence in using the cloud. Example recommendations involve parameters such as Map- Typically, how an application per- Workload Analysis: SplitSizeParams, HiveExecReducersBytesParam, HiveExec- forms depends on what else is also running in the big data ParallelParam, MapReduceSlowStartParam, MapReduceMem- stack, altogether forming an application workload. A work- oryParams, etc. Figure 7 shows example recommendations load may contain heterogeneous applications in Hive, Spark for tuning the size of map containers (top) and reduce con- SQL, Spark ML, etc. Understanding how these applications tainers (bottom) on a production cluster, and in particular run and affect each other is critical. the allocated memory in MB. In this example, at the cluster 1 usage on a set of clusters and iden- We analyze queue level, the default value of the memory for map tasks was tify queue usage trends, suboptimal queue designs, work- set to 4096MB. Our analysis of historical data has identi- loads that run suboptimally on queues, convoys, ‘problem’ fied alternative memory sizes for map containers. The figure shows a distribution of applications over different memory 1 sizes shown as a histogram, along with a reward (here, pre- Various systems use different terms like ‘queue’ or ‘pool’ to characterize resource budget configurations. dicted number of memory savings shown as a green bar) and

8 Figure 10: Disk capacity forecasting applications (e.g., recognize and remediate excessive applica- 4. CONCLUSIONS tion wait times), ‘problem’ users (e.g., users who frequently In this paper, we attempted to bring attention to the per- run applications that reach max capacity for a long period), formance management requirements that arise in big data queue usage per application type or user or project, etc. stacks. We provided an overview of the requirements both Figure 9 shows exemplar resource utilization charts for at the level of individual applications as well as holistic clus- two queues over a time range. In this example, the work- ters and workloads. We also presented an architecture that load running in the root.oper queue does not use all the can provide automated solutions for these requirements and resources allocated, here VCores and Memory, whilst the discussed a few of the solutions. workload in the root.olap queue needs more resources; pend- The approach that we have presented here is complemen- ing resources (in purple) go beyond the resources allocated tary to a number of other research areas in the database (in black). Similar analysis can be done for other metrics Polystore and systems community such as HadoopDB [7], like Disk, Scheduling, and so on. [3], and hybrid flows [18] (which have addressed challenges Based on such findings, we generate queue level insights self- in stacks composed of multiple systems) as well as and recommendations including queue design/settings modi- [2, 16] (which have addressed similar prob- driving databases fications (e.g., change resource budget for a queue or max/min lems for traditional database systems like MySQL, Post- limits), workload reassignment to different queues (e.g., move greSQL, and Oracle). Related complementary efforts also in- an application or a workload from one queue to another), clude the application of machine learning techniques to data queue usage forecasting, etc. Any of these recommendations management systems and cloud databases, such as (a) ML could be applied to the situation shown in Figure 9. A typi- techniques for workload and resource management for cloud cal big data deployment involves 100s of queues and such a databases [13, 14], (b) a reinforcement learning algorithm task can be tedious. for elastic resource management that employs adaptive state We can enforce some of these recommendation using auto- space partitionining [12], (c) a self-managing controller for actions, which enable complex actionable rules on a mul- tuning multi-tenant DBMS based on learning techniques tiplicity of cluster metrics. Each rule consists of a logical for tenants’ behavior, plans, and history [6], and (d) dy- expression and an action. A logical expression is used to ag- namic scaling algorithms for scaling clusters of virtual ma- gregate cluster metrics and to evaluate the rule, and consists chines [15]. of two conditions: At Unravel, we are building the next generation perfor- mance management system by solving real-world challenges A prerequisite condition that causes a violation (e.g., • arising from the big data stack which is a gold mine of data number of applications running or memory used) for applied research, including AI and ML. We are working with many enterprises that have challenging problems; and A defining condition, who/what/when can cause a vi- • by helping them understand and address these problems, we olation (e.g., user, application) help them scale at the right cost. An action is a concrete, executable task such as kill an ap- 5. REFERENCES plication, move an application to a different queue, send an HTTP post, notify a user, and so on. [1] Companies using the big data stack. https://idatalabs. com/tech/products/apache-hadoop[Online; accessed Beside the current status of the big data stack Forecasting: 24-August-2018]. systems, enterprises need to be able to provision for re- [2] Oracle autonomous database cloud. sources, usage, cost, job scheduling, and so on. One of the https://www.oracle.com/database/ advantages of our architecture includes collecting a plethora autonomous-database.html[Online; accessed of historical operational and application metrics. These can 24-August-2018]. be used for capacity planning using predictive time-series models (e.g., [20]). Figure 10 shows an example disk capacity [3] A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, forecasting chart; the black line shows actual utilization and A. Rasin, and A. Silberschatz. Hadoopdb: An the light blue line shows a forecast within an error bound. architectural hybrid of mapreduce and DBMS

9 technologies for analytical workloads. , PVLDB Elastic management of cloud applications using 2(1):922–933, 2009. 2017 IEEE adaptive reinforcement learning. In International Conference on Big Data, BigData 2017, [4] S. Duan and S. Babu. Guided problem diagnosis Boston, MA, USA, December 11-14, 2017 , pages 2008 International through active learning. In 203–212, 2017. Conference on Autonomic Computing, ICAC 2008, , pages 45–54, June 2-6, 2008, Chicago, Illinois, USA [13] R. Marcus and O. Papaemmanouil. Wisedb: A 2008. learning-based workload management advisor for PVLDB cloud databases. , 9(10):780–791, 2016. [5] S. Duan, V. Thummala, and S. Babu. Tuning database configuration parameters with ituned. [14] R. Marcus and O. Papaemmanouil. Releasing cloud , 2(1):1246–1257, 2009. PVLDB databases for the chains of performance prediction CIDR 2017, 8th Biennial Conference on models. In [6] A. J. Elmore, S. Das, A. Pucher, D. Agrawal, A. El Innovative Data Systems Research, Chaminade, CA, Abbadi, and X. Yan. Characterizing tenant behavior USA, January 8-11, 2017, Online Proceedings , 2017. for placement and crisis mitigation in multitenant Proceedings of the ACM SIGMOD dbmss. In [15] J. Ortiz, B. Lee, and M. Balazinska. Perfenforce International Conference on Management of Data, demonstration: Data analytics with performance SIGMOD 2013, New York, NY, USA, June 22-27, guarantees. In Proceedings of the 2016 International 2013 , pages 517–528, 2013. Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - [7] V. Gadepally, P. Chen, J. Duggan, A. J. Elmore, July 01, 2016 , pages 2141–2144, 2016. B. Haynes, J. Kepner, S. Madden, T. Mattson, and [16] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, M. Stonebraker. The bigdawg polystore system and P. Menon, T. C. Mowry, M. Perron, I. Quah, architecture. In 2016 IEEE High Performance Extreme S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Computing Conference, HPEC 2016, Waltham, MA, Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-driving , pages 1–6, 2016. USA, September 13-15, 2016 database management systems. In CIDR 2017, 8th [8] H. Herodotou and S. Babu. Profiling, what-if analysis, Biennial Conference on Innovative Data Systems and cost-based optimization of mapreduce programs. Research, Chaminade, CA, USA, January 8-11, 2017, PVLDB , 4(11):1111–1122, 2011. , 2017. Online Proceedings [9] S. A. Jyothi, C. Curino, I. Menache, S. M. [17] A. Simitsis, K. Wilkinson, J. Blais, and J. Walsh. Narayanamurthy, A. Tumanov, J. Yaniv, VQA: Vertica Query Analyzer. In International R. Mavlyutov, I. Goiri, S. Krishnan, J. Kulkarni, and Conference on Management of Data, SIGMOD 2014, S. Rao. Morpheus: Towards automated slos for Snowbird, UT, USA, June 22-27, 2014 , pages 12th USENIX Symposium on enterprise clusters. In 701–704, 2014. Operating Systems Design and Implementation, OSDI [18] A. Simitsis, K. Wilkinson, U. Dayal, and M. Hsu. 2016, Savannah, GA, USA, November 2-4, 2016. , HFMS: managing the lifecycle and complexity of pages 117–134, 2016. 29th IEEE International hybrid analytic data flows. In [10] P. Kalmegh, S. Babu, and S. Roy. Analyzing query Conference on Data Engineering, ICDE 2013, performance and attributing blame for contentions in , pages Brisbane, Australia, April 8-12, 2013 , CoRR a cluster computing framework. 1174–1185, 2013. abs/1708.08435, 2017. [19] Z. Tan and S. Babu. Tempo: Robust and self-tuning [11] Q. V. Le and T. Mikolov. Distributed representations resource management in multi-tenant parallel Proceedings of the 31th of sentences and documents. In , 9(10):720–731, 2016. PVLDB databases. International Conference on Machine Learning, ICML [20] S. J. Taylor and B. Letham. Forecasting at Scale. , pages 2014, Beijing, China, 21-26 June 2014 https://peerj.com/preprints/3190.pdf[Online; accessed 1188–1196, 2014. 24-August-2018]. [12] K. Lolos, I. Konstantinou, V. Kantere, and N. Koziris.

Related documents

Latin American Entrepreneurs: Many Firms but Little Innovation

Latin American Entrepreneurs: Many Firms but Little Innovation

WORLD BANK LATIN AMERICAN AND CARIBBEAN STUDIES Latin American Entrepreneurs Many Firms but Little Innovation Daniel Lederman, Julián Messina, Samuel Pienknagura, and Jamele Rigolini

More info »
a i5199e

a i5199e

Status of the Main Report World’s Soil Resources © FAO | Giuseppe Bizzarri INTERGOVERNMENTAL INTERGOVERNMENTAL TECHNICAL PANEL ON SOILS TECHNICAL PANEL ON SOILS

More info »
C:\JIM\PLUMBOOK\2012PL~1\LIVE\76304.002

C:\JIM\PLUMBOOK\2012PL~1\LIVE\76304.002

[COMMITTEE PRINT] UNITED STATES GOVERNMENT Policy and Supporting Positions f Committee on Oversight and Government Reform U.S. House of Representatives 112th Congress, 2d Session DECEMBER 1, 2012 Avai...

More info »
The Health Consequences of Smoking   50 Years of Progress: A Report of the Surgeon General

The Health Consequences of Smoking 50 Years of Progress: A Report of the Surgeon General

The Health Consequences of Smoking—50 Years of Progress A Report of the Surgeon General U.S. Department of Health and Human Services

More info »
CD Marker Handbook Human and Mouse

CD Marker Handbook Human and Mouse

/go/humancdmarkers /go/mousecdmarkers For more information, please visit: bdbiosciences.com bdbiosciences.com CD Marker Handbook Human Mouse Welcome to More Choice Human and Mouse CD Marker Handbook

More info »
1282250000 cat1 d en

1282250000 cat1 d en

Contents Terminals, W-Series Terminals, W-Series D.2 Overview W-Series Terminals, W-Series D.4 Feed-through terminals Double-level terminals D. 16 D. 22 PE terminals D. 32 Fuse terminals D D. 54 Test-...

More info »
AcqKnowledge 4 Software Guide

AcqKnowledge 4 Software Guide

® Acq 4 Software G uide Knowledge Check BIOPAC.COM > Sup port > Manuals for updates For Life Science Research Applications Data Acquisition and Analysis with BIOPAC Hardware Systems Reference Manual f...

More info »
Position Compare Examples

Position Compare Examples

Application Note Nov-2004 Position-Compare Examples These examples use Turbo PMAC’s position-compare feat ure, a dedicated hardware circuit in the Servo ASICs that creates an output pulse when an exac...

More info »
ch 16.pm7

ch 16.pm7

Japanese Biomedical Experimentation During the World-War-II Era Chapter 16 JAPANESE BIOMEDICAL EXPERIMENTATION DURING THE WORLD-WAR-II ERA * SHELDON H. HARRIS, P D H INTRODUCTION DIMENSIONS OF THE PRO...

More info »
UL White Book

UL White Book

GUIDE INFORMATION FOR ELECTRICAL EQUIPMENT THE WHITE BOOK 2015-16 UL PRODUCT CATEGORIES CORRELATED TO THE 2011 AND 2014 NATIONAL ELECTRICAL CODE® UL’s General Guide Information is updated daily. To co...

More info »