{"title": "A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions", "book": "Advances in Neural Information Processing Systems", "page_first": 11627, "page_last": 11639, "abstract": "The field of machine programming (MP), the automation of the development of software, is making notable research advances. This is, in part, due to the emergence of a wide range of novel techniques in machine learning. In this paper, we apply MP to the automation of software performance regression testing. A performance regression is a software performance degradation caused by a code change. We present AutoPerf \u2013 a novel approach to automate regression testing that utilizes three core techniques: (i) zero-positive learning, (ii) autoencoders, and (iii) hardware telemetry. We demonstrate AutoPerf\u2019s generality and efficacy against 3 types of performance regressions across 10 real performance bugs in 7 benchmark and open-source programs. On average, AutoPerf exhibits 4% profiling\noverhead and accurately diagnoses more performance bugs than prior state-of-the-art approaches. Thus far, AutoPerf has produced no false negatives.", "full_text": "A Zero-Positive Learning Approach for\n\nDiagnosing Software Performance Regressions\n\nMejbah Alam\n\nIntel Labs\n\nJustin Gottschlich\n\nIntel Labs\n\nmejbah.alam@intel.com\n\njustin.gottschlich@intel.com\n\nNesime Tatbul\n\nIntel Labs and MIT\n\nJavier Turek\n\nIntel Labs\n\ntatbul@csail.mit.edu\n\njavier.turek@intel.com\n\nTimothy Mattson\n\nIntel Labs\n\nAbdullah Muzahid\nTexas A&M University\n\ntimothy.g.mattson@intel.com\n\nabdullah.muzahid@tamu.edu\n\nAbstract\n\nThe \ufb01eld of machine programming (MP), the automation of the development\nof software, is making notable research advances. This is, in part, due to the\nemergence of a wide range of novel techniques in machine learning. In this paper,\nwe apply MP to the automation of software performance regression testing. A\nperformance regression is a software performance degradation caused by a code\nchange. We present AutoPerf \u2013 a novel approach to automate regression testing\nthat utilizes three core techniques: (i) zero-positive learning, (ii) autoencoders,\nand (iii) hardware telemetry. We demonstrate AutoPerf\u2019s generality and ef\ufb01cacy\nagainst 3 types of performance regressions across 10 real performance bugs in 7\nbenchmark and open-source programs. On average, AutoPerf exhibits 4% pro\ufb01ling\noverhead and accurately diagnoses more performance bugs than prior state-of-the-\nart approaches. Thus far, AutoPerf has produced no false negatives.\n\n1\n\nIntroduction\n\nMachine programming (MP) is the automation of the development and maintenance of software.\nResearch in MP is making considerable advances, in part, due to the emergence of a wide range of\nnovel techniques in machine learning and formal program synthesis [11, 12, 16, 37, 43, 44, 46, 47, 52,\n55, 62]. A recent review paper proposed Three Pillars of Machine Programming as a framework for\norganizing research on MP [25]. These pillars are intention, invention, and adaptation.\nIntention is concerned with simplifying and broadening the way a user\u2019s ideas are expressed to\nmachines. Invention is the exploration of ways to automatically discover the right algorithms to\nful\ufb01ll those ideas. Adaptation is the re\ufb01nement of those algorithms to function correctly, ef\ufb01ciently,\nand securely for a speci\ufb01c software and hardware ecosystem. In this paper, we apply MP to the\nautomation of software testing, with a speci\ufb01c emphasis on parallel program performance regressions.\nUsing the three pillars nomenclature, this work falls principally in the adaptation pillar.\nSoftware performance regressions are defects that are erroneously introduced into software as it\nevolves from one version to the next. While they do not impact the functional correctness of the\nsoftware, they can cause signi\ufb01cant degradation in execution speed and resource ef\ufb01ciency (e.g., cache\ncontention). From database systems to search engines to compilers, performance regressions are\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcommonly experienced by almost all large-scale software systems during their continuous evolution\nand deployment life cycle [7, 24, 30, 32, 34]. It may be impossible to entirely avoid performance\nregressions during software development, but with proper testing and diagnostic tools, the likelihood\nfor such defects to silently leak into production code might be minimized.\nToday, many benchmarks and testing tools are available to detect the presence of performance\nregressions [1, 6, 8, 17, 42, 57], but diagnosing their root causes still remains a challenge. Exist-\ning solutions either focus on whole program analysis rather than code changes [15], or depend\non previously seen instances of performance regressions (i.e., rule-based or supervised learning\napproaches [20, 29, 33, 59]). Furthermore, analyzing multi-threaded programs running over highly\nparallel hardware is much harder due to the probe effect often incurred by traditional software pro\ufb01lers\nand debuggers [23, 26, 27]. Therefore, a more general, lightweight, and reliable approach is needed.\nIn this work, we propose AutoPerf, a new framework for software performance regression diagnostics,\nwhich fuses multiple state-of-the-art techniques from hardware telemetry and machine learning to\ncreate a unique solution to the problem. First, we leverage hardware performance counters (HWPCs)\nto collect \ufb01ne-grained information about run-time executions of parallel programs in a lightweight\nmanner [10]. We then utilize zero-positive learning (ZPL) [36], autoencoder neural networks [60],\nand k-means clustering [35] to build a general and practical tool based on this data. Our tool, AutoPerf,\ncan learn to diagnose potentially any type of regression that can be captured by HWPCs, with minimal\nsupervision.\nWe treat performance defects as anomalies that represent deviations from the normal behavior of\na software program. Given two consecutive versions of a program P , Pi and Pi+1, the main task\nis to identify anomalies in Pi+1\u2019s behavior with respect to the normal behavior represented by that\nof Pi. To achieve this, \ufb01rst we collect HWPC pro\ufb01les for functions that differ in Pi and Pi+1, by\nrunning each program with a set of test inputs. We then train autoencoder models using the pro\ufb01les\ncollected for Pi, which we test against the HWPC pro\ufb01les collected for Pi+1. Run instances where\nthe autoencoder reconstruction error (RE) is above a certain threshold are classi\ufb01ed as regressions.\nFinally, these regressions are analyzed to determine their types, causes, and locations in Pi+1.\nOur framework enhances the state of the art along three dimensions:\n\u2022 Generality: ZPL and autoencoders eliminate the need for labeled training data, while HWPCs\nprovide data on any detectable event. This enables our solution to generalize to any regression\npattern.\n\u2022 Scalability: Low-overhead HWPCs are collected only for changed code, while training granularity\n\u2022 Accuracy: We apply a statistical heuristic for thresholding the autoencoder reconstruction error,\n\ncan be adjusted via k-means clustering. This enables our solution to scale with data growth.\n\nwhich enables our solution to identify performance defects with signi\ufb01cantly higher accuracy.\n\nIn the rest of this paper, after some background, we \ufb01rst present our approach and then show the\neffectiveness of our solution with an experimental study on real-world benchmarks (PARSEC [17]\nand Phoenix [57] benchmark suites) and open-source software packages (Boost, Memcached, and\nMySQL). With only 4% average pro\ufb01ling overhead, our tool can successfully detect three types\nof performance regressions common in parallel software (true sharing, false sharing, and NUMA\nlatency), at consistently higher accuracy than two state-of-the-art approaches [21, 33].\n\n2 Motivation\n\nIndustrial software development is constantly seeking to accelerate the rate in which software is\ndelivered. Due to the ever increasing frequency of deployments, software performance defects are\nleaking into production software at an alarming rate [34]. Because this trend is showing no sign\nof slowing, there is an increasing need for the practical adoption of techniques that automatically\ndiscover performance anomalies to prevent their integration to production-quality software [54]. To\nachieve this goal, we must \ufb01rst understand the challenges that inhibit building practical solutions.\nThis section discusses such challenges and their potential solutions.\n\n2.1 Challenges: Diagnosing Software Performance Regressions\n\nDetailed software performance diagnostics are hard to capture. We see two core challenges.\n\n2\n\n\fFigure 1: Example of performance regressions in parallel software.\n\nExamples are limited. Software performance regressions can manifest in a variety of forms and\nfrequencies. Due to this, it is practically impossible to exhaustively identify all of them a priori. In\ncontrast, normal performance behaviors are signi\ufb01cantly easier to observe and faithfully capture.\nPro\ufb01ling may perturb performance behavior. Software pro\ufb01ling via code instrumentation may\ncause perturbations in a program\u2019s run-time behavior. This is especially true for parallel software,\nwhere contention signatures can be signi\ufb01cantly altered due to the most minute probe effect [26, 48]\n(e.g., a resource contention defect may become unobservable).\nThese challenges call for an approach that (i) does not rely on training data that includes performance\nregressions and (ii) uses a pro\ufb01ling technique which incurs minimal execution overhead (i.e., less\nthan 5%) as to not perturb a program\u2019s performance signature. Next, we provide concrete examples\nof performance bugs that are sensitive to these two criteria.\n\n2.2 Examples: Software Performance Regressions\n\nCache contention may occur when multiple threads of a program attempt to access a shared memory\ncache concurrently. It comes in two \ufb02avors: (i) true sharing, involving access to the same memory\nlocation, and (ii) false sharing, involving access to disjoint memory locations on the same cache line.\nFor example, a true sharing defect in MySQL 5.5 is shown in Figure 1(a). Unfortunately, developer\u2019s\nattempt to \ufb01x this issue could cause a performance regression due to false sharing defect. This defect\nin Figure 1(b) was introduced into MySQL version 5.6, leading to more than a 67% performance\ndegradation [9].\nNUMA latency may arise in Non-Uniform Memory Access (NUMA) architectures due to a mismatch\nbetween where data is placed in memory vs.\nthe CPU threads accessing it. For example, the\nstreamcluster application of the PARSEC benchmark was shown to experience a 25.7% overall\nperformance degradation due to NUMA [17].\nThese types of performance defects are generally challenging to identify from source code. An\nautomatic approach can leverage HWPCs as a feature to identify these defects (more in Section 4.2).\n\n2.3 A New Approach: Zero-Positive Learning Meets Hardware Telemetry\n\nTo address the problem, we propose a novel approach that consists of two key ingredients: zero-\npositive learning (ZPL) [36] and hardware telemetry [10].\nZPL is an implicitly supervised ML technique. It is a speci\ufb01c instance of one-class classi\ufb01cation,\nwhere all training data lies within one class (i.e., the non-anomalous space). ZPL was originally\ndeveloped for anomaly detection (AD). In AD terminology, a positive refers to an anomalous data\nsample, while a negative refers to a normal one, thus the name zero-positive learning. Any test data\nthat suf\ufb01ciently deviates from the negative distribution is deemed an anomaly. Thus, ZPL, if coupled\nwith the right ML modeling technique, can provide a practical solution to the \ufb01rst challenge, as it\ndoes not require anomalous data.\nHardware telemetry enables pro\ufb01ling program executions using hardware performance counters\n(HWPCs). HWPCs are a set of special-purpose registers built into CPUs to store counts of a wide\nrange of hardware-related activities, such as instructions executed, cycles elapsed, cache hits or\nmisses, branch (mis)predictions, etc. Modern-day processors provide hundreds of HWPCs, and more\n\n3\n\n\fFigure 2: Overview of AutoPerf\n\nare being added with every new architecture. As such, HWPCs provide a lightweight means for\ncollecting \ufb01ne-grained pro\ufb01ling information without modifying source code, addressing the second\nchallenge.\n\n3 Related Work\n\nThere has been an extensive body of prior research in software performance analysis using statistical\nand ML techniques [15, 31, 38, 53, 58]. Most of the past ML approaches are based on traditional\nsupervised learning models (e.g., Bayesian networks [20, 59], Markov models [29], decision trees\n[33]). A rare exception is the unsupervised behavior learning (UBL) approach of Dean et al., which\nis based on self-organizing maps [21]. Unfortunately, UBL does not perform well beyond a limited\nnumber of input features. To the best of our knowledge, ours is the \ufb01rst scalable ML approach for\nsoftware performance regression analysis that relies only on normal (i.e., no example of performance\nregression) training data.\nPrior efforts commonly focus on analyzing a speci\ufb01c type of performance defect (e.g., false and/or\ntrue sharing cache contention [22, 33, 39\u201341, 51, 63], NUMA defects [42, 56, 61]). Some of these\nalso leverage HWPCs like we do [14, 18, 28, 40, 56, 61]. However, our approach is general enough to\nanalyze any type of performance regression based on HWPCs, including cache contention and NUMA\nlatency. Overall, the key difference of our contribution lies in its practicality and generality. Section 5\npresents an experimental comparison of our approach against two of the above approaches [21, 33].\n\n4 The Zero-Positive Learning Approach\n\nIn this section, we present a high-level overview of our approach, followed by a detailed discussion\nof its important and novel components.\n\n4.1 Design Overview\n\nA high-level design of AutoPerf is shown in Figure 2. Given two versions of a software program,\nAutoPerf \ufb01rst compares their performance. If a degradation is observed, then the cause is likely to\nlie within the functions that differ in the two versions. Hence, AutoPerf automatically annotates\nthe modi\ufb01ed functions in both versions of the program and collects their HWPC pro\ufb01les. The\ndata collected for the older version is used for zero-positive model training, whereas the data\ncollected for the newer version is used for inferencing based on the trained model. AutoPerf uses\nan autoencoder neural network to model normal performance behavior of a function [60]. To scale\nwith a large number of functions, training data for functions with similar performance signatures are\nclustered together using k-means clustering and a single autoencoder model per cluster is trained [35].\nPerformance regressions are identi\ufb01ed by measuring the reconstruction error that results from testing\n\n4\n\n\fthe autoencoders with pro\ufb01le data from the new version of the program. If the error comes out to be\nsuf\ufb01ciently high, then the corresponding execution of the function is marked as a performance bug\nand its root cause is analyzed as the \ufb01nal step of the diagnosis.\n\n4.2 Data Collection\n\nModern processors provide various hardware performance counters (HWPCs) to count low-level\nsystem events such as cache misses, instruction counts, memory accesses [10]. AutoPerf uses\nPerformance Application Programming Interface (PAPI) to read values of hardware performance\ncounters [49]. For example, for the speci\ufb01c hardware platform that we used in our experimental work\n(see Section 5.1 for details), PAPI provides access to 50 different HWPCs. Many of these performance\ncounters re\ufb02ect speci\ufb01c performance features of a program running on the speci\ufb01c hardware. For\nexample, Hit Modi\ufb01ed (HITM) is closely related to cache contention [45]. Essentially, this counter\nis incremented every time a processor accesses a memory cache line which is modi\ufb01ed in another\nprocessor\u2019s cache. Any program with true or false sharing defects will see a signi\ufb01cant increase in\nthe HITM counter\u2019s value. Similarly, the counter for off-core requests served by remote DRAM\n(OFFCORE_RESPONSE: REMOTE_DRAM) can be used to identify NUMA-related performance\ndefects [42]. AutoPerf exploits these known features in its \ufb01nal root-cause analysis step.\nTo collect HWPC pro\ufb01les of all modi\ufb01ed functions, we execute both of the annotated program\nversions with a set of test inputs (i.e., regression test cases). Test inputs generally capture a variety\nof different input sizes and thread counts. During each execution of an annotated function foo,\nAutoPerf reads HWPCs at both the entry and the exit points of foo, calculates their differential\nvalues, normalizes these values with respect to the instruction count of foo and thread count, and\nrecords the resulting values as one sample in foo\u2019s HWPC pro\ufb01le.\n\n4.3 Diagnosing Performance Regressions\n\nAutoPerf uses HWPC pro\ufb01les to diagnose performance regressions in a modi\ufb01ed program. First, it\nlearns the distribution of the performance of a function based on its HWPC pro\ufb01le data collected\nfrom the original program. Then, it detects deviations of performance as anomalies based on the\nHWPC pro\ufb01le data collected from the modi\ufb01ed program.\n\n4.3.1 Autoencoder-based Training and Inference\n\nOur approach to performance regression automation requires to solve a zero-positive learning task.\nZero-positive learning involves a one-class training problem, where only negative (non-anomalous)\nsamples are used at training time [50]. We employ autoencoders to learn the data distribution of the\nnon-anomalous data [13]. At test time, we then exploit the autoencoder to discover any deviation that\nwould indicate a sample from the positive class. The autoencoder model is a natural \ufb01t for our ZPL\napproach, since it is unsupervised (i.e., does not require labeled training data as in one-class training)\nand it works well with multi-dimensional inputs (i.e., data from multiple HWPCs).\nTo formalize, let {xi}Nold\ni=1 be a set of Nold samples obtained from pro\ufb01ling the old version of\nthe function foo. Next, we train an autoencoder Afoo (x) = f (g (x)) such that it minimizes\n2. During\ntraining, the autoencoder Afoo (x) learns a manifold embedding represented by its encoder g (x).\nIts decoder f (x) learns the projection back to sample space. Learning the manifold embedding is\ncrucial to the autoencoder to reconstruct a sample with high \ufb01delity.\nOnce the autoencoder is trained, AutoPerf collects an additional set of samples {zi}Nnew\ni=1 pro\ufb01ling\nthe newer version of function foo\u2019s code. Next, the system discovers anomalies by encoding and\ndecoding the new samples zi and measuring the reconstruction error, i.e.,\n\nthe reconstruction error over all samples, i.e., L (xi,Afoo (xi)) =(cid:80)\n\ni (cid:107)xi \u2212 Afoo (xi)(cid:107)2\n\n\u0001 (zi) = (cid:107)zi \u2212 Afoo (zi)(cid:107)2\n\n(1)\n\nIf the reconstruction error for a sample zi is above a certain threshold \u03b3, i.e., \u0001 > \u03b3, the sample is\nmarked as anomalous, as it lays suf\ufb01ciently distant from its back-projected reconstruction.\n\n5\n\n\f(a) dedup [17]\n\n(b) MySQL [9]\n\nFigure 3: Histograms of reconstruction error \u0001 for training samples {xi} of two real datasets\n\n4.3.2 Reconstruction Error Threshold Heuristic\n\nThe success to detect the anomalous samples heavily depends on setting the right value for threshold\n\u03b3. A value too high and we may fail to detect many anomalous samples, raising the number of false\nnegatives. A value too low, and AutoPerf will detect many non-anomalous samples as anomalous,\nincreasing the number of false positives. Figure 3(a) and (b) show the reconstruction errors for the\ntraining samples for the dedup and MySQL datasets, respectively [9, 17]. Clearly, the difference\nin histograms signals that na\u00efvely setting a threshold would not generalize across datasets or even\nfunctions.\nThe skewness in the reconstruction error distributions of all test applications ranges from -0.46 to\n0.08. The kurtosis ranges from 1.96 to 2.64. Therefore, we approximate the reconstruction error\u2019s\ndistribution with a Normal distribution and de\ufb01ne a threshold \u03b3 (t) relative to the errors as\n\n\u03b3 (t) = \u00b5\u0001 + t\u03c3\u0001\n\n(2)\nwhere \u00b5\u0001 is the mean reconstruction error and \u03c3\u0001 its standard deviation for the training samples {xi}.\nThe t parameter controls the level of thresholding. For example, with t = 2, the threshold provides\n(approximately) a 95% con\ufb01dence interval for the reconstruction error.\nTo \ufb01nd the cause (and type) of performance regression, we calculate the reconstruction error (RE)\nfor each performance counter corresponding to an anomalous sample and then, sort the counters\naccordingly. We take a majority vote among all performance counters for each anomalous sample.\nThe counter that comes \ufb01rst is the one that causes the performance regression. We report that counter\nand the corresponding regression type as the root cause.\n\n4.3.3 Scaling to Many Functions via k-means Clustering\n\nSo far, we have focused on analyzing the performance of a single function that is modi\ufb01ed with new\ncode. In reality, the number of functions that change between versions of the code is higher. For\nexample, 27 functions are modi\ufb01ed between two versions of MySQL used in our experiments [9].\nTraining one autoencoder per such function is impractical. Furthermore, the number of samples\nrequired to train these grows, too. To alleviate this, we group multiple functions into clusters and\nassign an autoencoder to each group. AutoPerf applies k-means clustering for this purpose [35]. It\ncomputes k clusters from the training samples. Then, we assign function f to cluster c, if c contains\nmore samples of f than any other cluster. For each cluster c, we build one autoencoder. We train\nthe autoencoder using the training samples of all the functions that belong to that cluster. During\ninferencing, when we analyze pro\ufb01ling samples for a newer version of a function, we feed them to\nthe autoencoder of the cluster where that function belongs to.\n\n5 Experimental Evaluation\n\nIn this section, we (i) evaluate AutoPerf\u2019s ability to diagnose performance regressions and compare\nwith two state-of-the-art machine learning based approaches: Jayasena et al. [33] and UBL [21], (ii)\nanalyze our clustering approach, and (iii) quantify pro\ufb01ling and training overheads.\n\n6\n\n\fTable 1: Diagnosis ability of AutoPerf vs. DT [33] and UBL [21]. TS = True Sharing, FS = False\nSharing, and NL = NUMA Latency. K, L, M are the # of executions (K = 6, L = 10, M = 20).\n\nNormal\nProgram\n\nAnomalous\nProgram\n\nUBL\n\nDefect\nType\n\nFalse Positive Rate\nAutoPerf DT\n0.0\nblackscholesL\n0.0\nbodytrackL\n0.0\ndedupL\nhistogramM\n0.0\nlinear_regressionM 0.0\n0.0\nreverse_indexM\n0.0\nstreamclusterL\n0.3\nBoostL\nMemcachedL\n0.0\n0.2\nMySQLL\n\nN/A 0.2\n0.8\n0.7\n0.2\n1.0\n0.0\n0.0\n0.3\n0.0\n0.4\n0.15\nN/A 0.6\n0.4\n1.0\n1.0\n0.4\n0.1\n1.0\n\nNL\nblackscholesK\nTS\nbodytrackK\nTS\ndedupK\nFS\nhistogramM\nlinear_regressionM FS\nFS\nreverse_indexM\nNL\nstreamclusterK\nFS\nBoostL\nTS\nMemcachedL\nFS\nMySQLL\n\nUBL\n\nFalse Negative Rate\nAutoPerf DT\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\nN/A 0.0\n0.1\n0.17\n0.0\n0.0\n1.0\n0.1\n0.4\n0.35\n0.1\n0.05\nN/A 0.1\n0.2\n0.2\n0.4\n0.3\n0.8\n0.5\n\n5.1 Experimental Setup\n\nWe used PAPI to read hardware performance counter values [49], and Keras with TensorFlow to\nimplement autoencoders [19]. PAPI provides a total of 50 individualized and composite HWPCs. We\nread the 33 individualized counters during pro\ufb01ling as input features to AutoPerf. We performed all\nexperiments on a 12-core dual socket Intel Xeon R(cid:13) Scalable 8268 processor [3] with 32GB RAM.\nWe used 7 programs with known performance defects from the PARSEC [17] and the Phoenix [57]\nbenchmark suites. Additionally, we evaluated 3 open-source programs: Boost [2], Memcached [4],\nand MySQL [5].\n\n5.2 Diagnosis Ability\n\nWe experiment with 10 programs to evaluate AutoPerf. Two versions of source code for each program\nare used for these experiments: 1) a version without any performance defect; 2) a version where a\nperformance defect is introduced after updating one or more functions in the \ufb01rst version. We run\nthe \ufb01rst version n number of times. If a system reports x number of these runs as anomalous (i.e.,\npositive), we de\ufb01ne false positive rate as x/n. Similarly, we run the second version m number of\ntimes and de\ufb01ne false negative rate as x/m, where x is the number of anomalous runs detected as\nnon-anomalous. Each run of a program uses different inputs.\nAutoPerf\u2019s diagnosis results are summarized in Table 1. We experimented with 3 different types of\nperformance defects across 7 benchmark programs and 3 real-world applications. These are known\nperformance bugs (con\ufb01rmed by developers) in real-world and benchmark applications. A version\nof each application, for which corresponding performance defect is reported, is used for generating\nanomalous runs. AutoPerf detects performance defects in all anomalous runs. However, it reports 3\nfalse positive runs in Boost and 2 false positive runs in MySQL. Anomalies in Boost are detected\nin a function that implements a spinlock. It implements lock acquisition by iteratively trying to\nacquire the lock within a loop. Moreover, these runs are con\ufb01gured with increased number of threads.\nWe suspect that these false positive test runs experienced increased lock contention, which was not\npresent in training runs. This could be improved by increasing the variability of inputs for training\nruns. The two false positive runs in MySQL are reported in two functions. These are small functions\nwith reduced number of instructions, which could affect the accuracy of pro\ufb01ling at a \ufb01xed sampling\nrate.\nWe quantitatively compared AutoPerf with two state-of-the-art machine learning based approaches:\nJayasena et al. [33] and UBL [21]. Jayasena et al. uses a decision tree of 12 performance counters to\ndetect true sharing and false sharing defects (DT in Table 1). This approach is limited to detection of\nfalse sharing and true sharing types of performance defect. Therefore, it cannot detect the NUMA\nperformance defects in blackscholes and streamcluster. Moreover, [33] uses a \ufb01xed ratio of various\ncounters and therefore, cannot detect all anomalous runs in 6 programs and reports false positive runs\nfor all 8 programs.\n\n7\n\n\fWe implemented UBL using a 120\u00d7120 self-organizing map (SOM) to detect performance anomalies.\nTable 1 shows UBL reports greater number of false positive runs for 7 programs and greater false\nnegative runs for 7 programs. The reduction in accuracy is caused by SOM\u2019s limitation in handling\nlarge variations in performance counter values. Overall, AutoPerf produces false positives for Boost\nand MySQL, whereas other approaches produces false positives or false negatives nearly for every\nprogram. We further evaluated the anomaly prediction accuracy of AutoPerf using the standard\nreceiver operating characteristic (ROC) curves. Figure 4 shows ROC curves for Boost and MySQL.\nAlthough AutoPerf produces false positives for these two applications, the ROC curves show that it\nachieves better accuracy than UBL for these two applications.\n\n(a)\n\n(b)\n\nFigure 4: Diagnosis of false sharing defects in (a) Boost and (b) MySQL. True positive rates and false\npositive rates of AutoPerf and state-of-the-art approach UBL [21] for an application with different\nthresholds are shown in each \ufb01gure.\n\n5.3\n\nImpact of Clustering\n\nTo analyze many functions that change between versions of code with reduced number of autoen-\ncoders, AutoPerf combines groups of similar functions into clusters and train an autoencoder for each\ncluster. We experimented with AutoPerf\u2019s accuracy to evaluate if the clustering reduces the accuracy\nof the system compared to using one autoencoder for each function.\nOne way to evaluate this is to test it against a program with multiple performance defects in different\nfunctions. To achieve this, we performed a sensitivity analysis using a synthetic program constructed\nwith seven functions. We created a modi\ufb01ed version of this program by introducing performance\ndefects in each of these functions and evaluated the F1 score of AutoPerf with different number of\nclusters for these seven functions in the program. AutoPerf achieves a reasonable F1 score (from\n0.73 to 0.81) using one autoencoder per function. When it uses one autoencoder across all seven\nfunctions, F1 degrades signi\ufb01cantly to 0.31. Using k-means clustering we can achieve reasonable\naccuracy even without one autoencoder per function. As shown in Figure 5(a), there is an increase in\naccuracy (F1 score) as k increases from 2 to 3 to 4.\nWe evaluate the effects of clustering in three real-world programs: Boost, Memcached, and MySQL.\nFigure 5(b) shows accuracy of these programs using F1 score. For Memcached, AutoPerf creates\nthree clusters from eight candidate functions (i.e., changed functions). The F1 score after clustering\nbecomes equal to the F1 score of an approach that uses one autoencoder per function. For other two\nprograms: Boost and MySQL, clustering results in slightly reduced F1 score. However, as shown in\nFigure 5(c), the clustering approach reduces overall training time of AutoPerf by 2.5x to 5x.\n\n5.4 Effectiveness of the Error Threshold\n\nWe evaluated the effectiveness of our threshold method for \u03b3 (t). We compared with a base approach\nof setting an arbitrary threshold based on the input vector x instead of reconstruction errors. This\narbitrary threshold, \u03b1 (t), implies that if the difference between the output and input vector length\nis more than t% of the input vector length x, it is marked as anomalous. We compared accuracy of\nAutoPerf with UBL and this base approach using the mean true positive rates and mean false positive\nrates of these approaches across 10 candidate applications listed in Table 1. Figure 6(a) shows\n\n8\n\n\f(a) Sensitivity to k\n\n(c) Impact on training time\nFigure 5: Impact of clustering, where k denotes the number of clusters (i.e., autoencoders)\n\n(b) Impact on accuracy\n\nthe accuracy of AutoPerf using arbitrary threshold and \u03b3 (t). We evaluated AutoPerf with different\nthresholds determined using equation (2), where values of t ranges from 0 to 3. AutoPerf achieves\ntrue positive rate of 1 and false positive rate of 0.05 using \u03b3 (t) at t = 2. For arbitrary threshold using\n\u03b1 (t), we experimented with increasing values of t from 0 to 55, at which point both true positive rate\nand false positive rate become 1. Figure 6 also shows the accuracy of UBL with different thresholds.\n\u03b3 (t) achieves increased accuracy compared to UBL and \u03b1 (t). Moreover, \u03b1 (t) performs even worse\nthan the best results from UBL.\n\n(a)\n\n(b)\n\nFigure 6: (a) Effect of error threshold, (b) Pro\ufb01ling overhead.\n\n5.5 Pro\ufb01ling and Training Overheads\n\nPro\ufb01ling of a program introduces performance overhead. However, AutoPerf uses HWPCs to\nimplement a lightweight pro\ufb01ler. The execution time of an application increases by only 4%, on\naverage, with AutoPerf. MySQL experiments results in the highest performance overhead of 7%\namong three real-world applications. AutoPerf monitors greater number of modi\ufb01ed functions\nin MySQL compared to the other two real-world applications: Memcached and Boost. We also\ncollected the training time of autoencoders. On average, it takes approximately 84 minutes to train an\nautoencoder. An autoencoder for MySQL, which models a cluster with many functions, takes the\nlongest training time, which is little less than 5 hours using our experimental setup (Section 5.1).\n\n6 Conclusion\n\nIn this paper, we presented AutoPerf, a generalized software performance analysis system. For\nlearning, it uses a fusion of zero-positive learning, k-means clustering, and autoencoders. For features,\nit uses hardware telemetry in the form of hardware performance counters (HWPCs). We showed that\nthis design can effectively diagnose some of the most complex software performance bugs, like those\nhidden in parallel programs. Although HWPCs are useful to detect performance defects with minimal\nperturbation, it can be challenging to identify the root cause of such bugs with HWPCs alone. Further\ninvestigation into a more expressive program abstraction, coupled with our zero-positive learning\napproach, could pave the way for better root cause analysis. With better root cause analysis, we might\nbe able to realize an automatic defect correction system for such bugs.\n\n9\n\n\fAcknowledgments\n\nWe thank Jeff Hammond for his suggestions regarding experimental setup details. We thank Mostofa\nPatwary for research ideas in the early stages of this work. We thank Pradeep Dubey for general\nresearch guidance and continuous feedback. We thank Marcos Carranza and Dario Oliver for their\nhelp improving the technical correctness of the paper. We also thank all the anonymous reviewers\nand area chairs for their excellent feedback and suggestions that have helped us improve this work.\n\nReferences\n[1] Apache HTTP server benchmarking tool. https://httpd.apache.org/docs/2.4/\n\nprograms/ab.html.\n\n[2] Boost C++ Library. https://www.boost.org/.\n[3] Intel Xeon Platinum 8268 Processor. https://ark.intel.com/.\n[4] Memcached: A Distributed Memory Object Caching System. https://memcached.org/.\n[5] MySQL Database. http://www.mysql.com/.\n[6] SysBench Benchmark Tool. https://dev.mysql.com/downloads/benchmarks.\n\nhtml.\n\n[7] MySQL bug 16504. https://bugs.mysql.com/bug.php?id=16504, 2006.\n[8] Visual Performance Analyzer.\n\nftp://ftp.software.ibm.com/aix/tools/\n\nperftools/SystempTechUniv2006/UnixLasVegas2006-A09.pdf, 2006.\n\n[9] Bug 79454::Inef\ufb01cient InnoDB row stats implementation. https://bugs.mysql.com/\n\nbug.php?id=79454, 2015.\n\n[10] IA-32 Architectures Software Developers Manual Volume 3b System Programming Guide, part\n\n2. Intel Manual, September 2016.\n\n[11] A. Adams, K. Ma, L. Anderson, R. Baghdadi, T.-M. Li, M. Gharbi, B. Steiner, S. Johnson,\nK. Fatahalian, F. Durand, and J. Ragan-Kelley. Learning to Optimize Halide with Tree Search\nand Random Programs. ACM Trans. Graph., 38(4):121:1\u2013121:12, July 2019.\n\n[12] M. B. S. Ahmad, J. Ragan-Kelley, A. Cheung, and S. Kamil. Automatically Translating Image\n\nProcessing Libraries to Halide. ACM Transactions on Graphics, 38(6), Nov 2019.\n\n[13] G. Alain and Y. Bengio. What Regularized Auto-Encoders Learn from the Data-Generating\n\nDistribution. Journal of Machine Learning Research, 15:3743\u20133773, 2014.\n\n[14] J. Arulraj, P.-C. Chang, G. Jin, and S. Lu. Production-run Software Failure Diagnosis via\nHardware Performance Counters. In Proceedings of the Eighteenth International Conference\non Architectural Support for Programming Languages and Operating Systems, ASPLOS \u201913,\npages 101\u2013112, New York, NY, USA, 2013. ACM.\n\n[15] M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating Root-cause Diagnosis of Performance\nAnomalies in Production Software. In Proceedings of the 10th USENIX Conference on Operat-\ning Systems Design and Implementation, OSDI\u201912, pages 307\u2013320, Berkeley, CA, USA, 2012.\nUSENIX Association.\n\n[16] K. Becker and J. Gottschlich. AI Programmer: Autonomously Creating Software Programs\n\nUsing Genetic Algorithms. CoRR, abs/1709.05703, 2017.\n\n[17] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization\nand Architectural Implications. In Proceedings of the 17th International Conference on Parallel\nArchitectures and Compilation Techniques, PACT \u201908, pages 72\u201381, New York, NY, USA, 2008.\nACM.\n\n[18] M. Brocanelli and X. Wang. Hang Doctor: Runtime Detection and Diagnosis of Soft Hangs for\nSmartphone Apps. In Proceedings of the Thirteenth EuroSys Conference, EuroSys \u201918, pages\n6:1\u20136:15, New York, NY, USA, 2018. ACM.\n\n[19] F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.\n\n10\n\n\f[20] I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating Instrumentation\nData to System States: A Building Block for Automated Diagnosis and Control. In Proceedings\nof the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume\n6, OSDI\u201904, pages 16\u201316, Berkeley, CA, USA, 2004. USENIX Association.\n\n[21] D. J. Dean, H. Nguyen, and X. Gu. UBL: Unsupervised Behavior Learning for Predicting\nPerformance Anomalies in Virtualized Cloud Systems. In Proceedings of the 9th International\nConference on Autonomic Computing, ICAC \u201912, pages 191\u2013200, New York, NY, USA, 2012.\nACM.\n\n[22] A. Eizenberg, S. Hu, G. Pokam, and J. Devietti. Remix: Online Detection and Repair of\nCache Contention for the JVM. In Proceedings of the 37th ACM SIGPLAN Conference on\nProgramming Language Design and Implementation, PLDI \u201916, pages 251\u2013265, New York, NY,\nUSA, 2016. ACM.\n\n[23] J. Gait. A Probe Effect in Concurrent Programs. Software Practice and Experience, 16(3):225\u2013\n\n233, March 1986.\n\n[24] T. Glek. Massive Performance Regression From Switching to GCC 4.5. http://gcc.gnu.\n\norg/ml/gcc/2010-06/msg00715.html.\n\n[25] J. Gottschlich, A. Solar-Lezama, N. Tatbul, M. Carbin, M. Rinard, R. Barzilay, S. Amarasinghe,\nJ. B. Tenenbaum, and T. Mattson. The Three Pillars of Machine Programming. In Proceedings\nof the 2Nd ACM SIGPLAN International Workshop on Machine Learning and Programming\nLanguages, MAPL 2018, pages 69\u201380, New York, NY, USA, 2018. ACM.\n\n[26] J. E. Gottschlich, M. P. Herlihy, G. A. Pokam, and J. G. Siek. Visualizing Transactional Memory.\nIn Proceedings of the 21st International Conference on Parallel Architectures and Compilation\nTechniques, PACT \u201912, pages 159\u2013170, New York, NY, USA, 2012. ACM.\n\n[27] J. E. Gottschlich, G. A. Pokam, C. L. Pereira, and Y. Wu. Concurrent Predicates: A Debugging\nTechnique for Every Parallel Programmer. In Proceedings of the 22nd International Conference\non Parallel Architectures and Compilation Techniques, PACT \u201913, pages 331\u2013340, Piscataway,\nNJ, USA, 2013. IEEE Press.\n\n[28] J. L. Greathouse, Z. Ma, M. I. Frank, R. Peri, and T. Austin. Demand-driven Software\nRace Detection Using Hardware Performance Counters. In 2011 38th Annual International\nSymposium on Computer Architecture (ISCA), pages 165\u2013176, June 2011.\n\n[29] X. Gu and H. Wang. Online Anomaly Prediction for Robust Cluster Systems. In Proceedings\nof the 2009 IEEE International Conference on Data Engineering, ICDE \u201909, pages 1000\u20131011,\nWashington, DC, USA, 2009. IEEE Computer Society.\n\n[30] S. Han, Y. Dang, S. Ge, D. Zhang, and T. Xie. Performance Debugging In The Large via Mining\n\nMillions of Stack Traces. In ICSE, pages 145\u2013155. IEEE, 2012.\n\n[31] L. Huang, J. Jia, B. Yu, B.-G. Chun, P. Maniatis, and M. Naik. Predicting Execution Time\nIn Proceedings of the 23rd\nof Computer Programs Using Sparse Polynomial Regression.\nInternational Conference on Neural Information Processing Systems - Volume 1, NIPS\u201910,\npages 883\u2013891, USA, 2010. Curran Associates Inc.\n\n[32] P. Huang, X. Ma, D. Shen, and Y. Zhou. Performance Regression Testing Target Prioritization\nvia Performance Risk Analysis. In Proceedings of the 36th International Conference on Software\nEngineering, ICSE 2014, pages 60\u201371. ACM, 2014.\n\n[33] S. Jayasena, S. Amarasinghe, A. Abeyweera, G. Amarasinghe, H. D. Silva, S. Rathnayake,\nX. Meng, and Y. Liu. Detection of False Sharing Using Machine Learning. In 2013 SC -\nInternational Conference for High Performance Computing, Networking, Storage and Analysis\n(SC), pages 1\u20139, Nov 2013.\n\n[34] G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu. Understanding and Detecting Real-world\nPerformance Bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming\nLanguage Design and Implementation, PLDI \u201912, pages 77\u201388, New York, NY, USA, 2012.\nACM.\n\n[35] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An\nEf\ufb01cient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern\nAnal. Mach. Intell., 24(7):881\u2013892, July 2002.\n\n11\n\n\f[36] T. J. Lee, J. Gottschlich, N. Tatbul, E. Metcalf, and S. Zdonik. Greenhouse: A Zero-Positive\nMachine Learning System for Time-Series Anomaly Detection. In Inaugural Conference on\nSystems and Machine Learning (SysML\u201918), Stanford, CA, USA, February 2018.\n\n[37] C. Lemieux, R. Padhye, K. Sen, and D. Song. PerfFuzz: Automatically Generating Pathological\nInputs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing\nand Analysis, ISSTA 2018, pages 254\u2013265, New York, NY, USA, 2018. ACM.\n\n[38] J. Li, Y. Chen, H. Liu, S. Lu, Y. Zhang, H. S. Gunawi, X. Gu, X. Lu, and D. Li. Pcatch:\nAutomatically Detecting Performance Cascading Bugs in Cloud Systems. In Proceedings of\nthe Thirteenth EuroSys Conference, EuroSys \u201918, pages 7:1\u20137:14, New York, NY, USA, 2018.\nACM.\n\n[39] T. Liu and E. D. Berger. SHERIFF: Precise Detection and Automatic Mitigation of False Sharing.\nIn Proceedings of the 2011 ACM International Conference on Object Oriented Programming\nSystems Languages and Applications, OOPSLA \u201911, pages 3\u201318, New York, NY, USA, 2011.\nACM.\n\n[40] T. Liu and X. Liu. Cheetah: Detecting False Sharing Ef\ufb01ciently and Effectively. In Proceedings\nof the 2016 International Symposium on Code Generation and Optimization (CGO\u201916), pages\n1\u201311, 2016.\n\n[41] T. Liu, C. Tian, Z. Hu, and E. D. Berger. PREDATOR: Predictive False Sharing Detection. In\nProceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel\nProgramming, PPoPP \u201914, pages 3\u201314, New York, NY, USA, 2014. ACM.\n\n[42] X. Liu and J. Mellor-Crummey. A Tool to Analyze the Performance of Multithreaded Programs\non NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles\nand Practice of Parallel Programming, PPoPP \u201914, pages 259\u2013272, New York, NY, USA, 2014.\nACM.\n\n[43] C. Loncaric, M. D. Ernst, and E. Torlak. Generalized Data Structure Synthesis. In Proceedings\nof the 40th International Conference on Software Engineering, ICSE \u201918, pages 958\u2013968, New\nYork, NY, USA, 2018. ACM.\n\n[44] S. Luan, D. Yang, C. Barnaby, K. Sen, and S. Chandra. Aroma: Code Recommendation via\nStructural Code Search. Proc. ACM Program. Lang., 3(OOPSLA):152:1\u2013152:28, Oct. 2019.\n\n[45] L. Luo, A. Sriraman, B. Fugate, S. Hu, G. Pokam, C. J. Newburn, and J. Devietti. LASER:\nLight, Accurate Sharing dEtection and Repair. In 2016 IEEE International Symposium on High\nPerformance Computer Architecture (HPCA), pages 261\u2013273, March 2016.\n\n[46] S. Mandal, T. A. Anderson, J. Gottschlich, S. Zhou, and A. Muzahid. Learning Fitness Functions\n\nfor Genetic Algorithms, 2019.\n\n[47] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and\nN. Tatbul. Neo: A Learned Query Optimizer. Proc. VLDB Endow., 12(11):1705\u20131718, July\n2019.\n\n[48] C. E. McDowell and D. P. Helmbold. Debugging Concurrent Programs. ACM Comput. Surv.,\n\n21(4):593\u2013622, Dec. 1989.\n\n[49] S. V. Moore. A Comparison of Counting and Sampling Modes of Using Performance Monitoring\n\nHardware. In In International Conference on Computational Science (ICCS 2002, 2002.\n\n[50] M. M. Moya and D. R. Hush. Network Constraints and Multi-objective Optimization for\n\nOne-class Classi\ufb01cation. Neural Networks, 9(3):463 \u2013 474, 1996.\n\n[51] M. Nanavati, M. Spear, N. Taylor, S. Rajagopalan, D. T. Meyer, W. Aiello, and A. War\ufb01eld.\nWhose Cache Line is It Anyway?: Operating System Support for Live Detection and Repair\nof False Sharing. In Proceedings of the 8th ACM European Conference on Computer Systems,\nEuroSys \u201913, pages 141\u2013154, New York, NY, USA, 2013. ACM.\n\n[52] L. Nelson, J. Bornholt, R. Gu, A. Baumann, E. Torlak, and X. Wang. Scaling symbolic\nevaluation for automated veri\ufb01cation of systems code with Serval. In 27th ACM Symposium on\nOperating Systems Principles (SOSP). ACM, October 2019.\n\n[53] T. H. Nguyen, B. Adams, Z. M. Jiang, A. E. Hassan, M. Nasser, and P. Flora. Automated Detec-\ntion of Performance Regressions Using Statistical Process Control Techniques. In Proceedings\nof the 3rd ACM/SPEC International Conference on Performance Engineering, ICPE \u201912, pages\n299\u2013310, New York, NY, USA, 2012. ACM.\n\n12\n\n\f[54] A. Nistor, T. Jiang, and L. Tan. Discovering, Reporting, and Fixing Performance Bugs. In 2013\n10th Working Conference on Mining Software Repositories (MSR), pages 237\u2013246, May 2013.\n[55] P. M. Phothilimthana, A. S. Elliott, A. Wang, A. Jangda, B. Hagedorn, H. Barthels, S. J.\nKaufman, V. Grover, E. Torlak, and R. Bodik. Swizzle Inventor: Data Movement Synthesis for\nGPU Kernels. In Proceedings of the Twenty-Fourth International Conference on Architectural\nSupport for Programming Languages and Operating Systems, ASPLOS \u201919, pages 65\u201378, New\nYork, NY, USA, 2019. ACM.\n\n[56] A. Rane and J. Browne. Enhancing Performance Optimization of Multicore Chips and Multichip\nNodes with Data Structure Metrics. In Proceedings of the 21st International Conference on\nParallel Architectures and Compilation Techniques, PACT \u201912, pages 147\u2013156, New York, NY,\nUSA, 2012. ACM.\n\n[57] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce\nfor Multi-core and Multiprocessor Systems. In HPCA \u201907: Proceedings of the 2007 IEEE\n13th International Symposium on High Performance Computer Architecture, pages 13\u201324,\nWashington, DC, USA, 2007. IEEE Computer Society.\n\n[58] L. Song and S. Lu. Statistical Debugging for Real-world Performance Problems. In Proceed-\nings of the 2014 ACM International Conference on Object Oriented Programming Systems\nLanguages & Applications, OOPSLA \u201914, pages 561\u2013578, New York, NY, USA, 2014. ACM.\n[59] Y. Tan, H. Nguyen, Z. Shen, X. Gu, C. Venkatramani, and D. Rajan. PREPARE: Predictive\nPerformance Anomaly Prevention for Virtualized Cloud Systems. In Proceedings of the 2012\nIEEE 32Nd International Conference on Distributed Computing Systems, ICDCS \u201912, pages\n285\u2013294, Washington, DC, USA, 2012. IEEE Computer Society.\n\n[60] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and Composing Robust\nFeatures with Denoising Autoencoders. In Proceedings of the 25th International Conference on\nMachine Learning, ICML \u201908, pages 1096\u20131103, New York, NY, USA, 2008. ACM.\n\n[61] R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. Pro\ufb01ling Directed NUMA\nOptimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code.\nIn Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium,\nIPDPS \u201911, pages 1046\u20131057, Washington, DC, USA, 2011. IEEE Computer Society.\n\n[62] X. Zhang, A. Solar-Lezama, and R. Singh. Interpreting Neural Network Judgments via Minimal,\nIn Proceedings of the 32Nd International Conference\nStable, and Symbolic Corrections.\non Neural Information Processing Systems, NIPS\u201918, pages 4879\u20134890, USA, 2018. Curran\nAssociates Inc.\n\n[63] Q. Zhao, D. Koh, S. Raza, D. Bruening, W.-F. Wong, and S. Amarasinghe. Dynamic Cache\nContention Detection in Multi-threaded Applications. In Proceedings of the 7th ACM SIG-\nPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE \u201911, pages\n27\u201338, New York, NY, USA, 2011. ACM.\n\n13\n\n\f", "award": [], "sourceid": 6217, "authors": [{"given_name": "Mejbah", "family_name": "Alam", "institution": "Intel Labs"}, {"given_name": "Justin", "family_name": "Gottschlich", "institution": "Intel Labs"}, {"given_name": "Nesime", "family_name": "Tatbul", "institution": "Intel Labs and MIT"}, {"given_name": "Javier", "family_name": "Turek", "institution": "Intel Labs"}, {"given_name": "Tim", "family_name": "Mattson", "institution": "Intel"}, {"given_name": "Abdullah", "family_name": "Muzahid", "institution": "Texas A&M University"}]}