Wednesday, October 28, 2009

Our Paper at Teradata Partners Conference

At the recent Teradata Partners conference, Teradata announced the Extreme Performance Appliance 4555 - the new first solid state data warehouse and cloud computing device. In contrast to Oracle's database machine which uses up to 5 terabytes of flash storage, Teradata is using solid state drives which improve performance for both read and write operations.

According to Scott Gnau, Teradata Chief Development Officer, the new appliance based on multi-core Intel processor technology and the 64-bit SLES operating system will allow scaling from seven to 200 terabytes of user data.

Teradata announced versions of its Teradata Express software for Amazon's Elastic Compute Cloud (EC2) and VMware Player. Teradata Express provides Teradata developers and testers access to a database at no charge. This announcement directly competes with Greenplum’s "Enterprise Data Cloud" strategy.

Planning and managing a DW environment is difficult. Teradata TASM simplifies creation of rules to manage performance of mixed workload environments, but it can still be difficult to select the right TASM parameters capable of satisfying SLO for each workload. In our joint paper “Capacity Management and Optimization in TASM Environments” co-authored with Doug Brown we demonstrated that:

• It is easy to change TASM settings, but it is difficult to decide how to change values to satisfy SLGs for each workload.
• Modeling and optimization technology can be used to justify strategic capacity planning and tactical performance management decisions and set TASM rules to satisfy workloads SLGs
• Workload characterization and performance prediction results can be used to justify realistic SLGs, set throttling, priorities and resource allocation TASM rules and organize a continuous process of proactive Service Level Management.

Progress in technology provides many options to decision makers for planning, managing and controlling performance of critical applications supporting business processes. Even with the currently available tools, it is still difficult to evaluate different options and make the right decisions. The role of modeling and optimization is to automate the process of evaluation and provide information to justify capacity planning, performance management and workload management decisions and enable verification of actual results with expectations while helping define a process for continuous proactive performance management.

Our Presentation at Oracle Open World 2009

Over 40,000 people from over 120 countries attended Oracle Open World 2009 earlier this month. One of the most important announcements during the showw was Oracle OLTP DB Machine V2. Oracle conducted several benchmark tests and demonstrated excellent performance for both OLTP and BI DW applications.

On Tuesday, I presented a paper that I co-authored with Alex Lupersolsky on “Modeling and Optimization for Multi-tier Virtualized Oracle Environments”. In this paper, we reviewed the challenges of planning and managing a complex environment, where easy-to-add hardware, changing software parameters controlling workload concurrency, priorities, and allocation of CPU and memory resources are all available but it is difficult to make decisions which will satisfy SLOs effectively.

We also reviewed several case studies illustrating the impact of workload growth and evaluating different options, including creation of RAC and an Oracle OLTP Database Machine V2.

We analyzed modeling results predicting the impact of migration ETL, OLTP, BI and archiving workloads to Oracle DB Machine v2. We used the following measurement data to build the model:

1) For each RAC node we used performance measurement data contained in GV$ views and Oracle OEM Grid Control:
• total physical CPU utilization
• the number of CPUs used
• total I/O rate in IOps and KBps
• the number of OS-visible disks
• read/write ratio
• average I/O operation response time

2) for each database instance per workload element (User/Program/Machine/Module):
• #executions
• total or average server response time per execution
• average number of parallel sessions (client sessions existing at the same time)
• parsing and execution CPU time consumed
• # physical IO operations with storage
• GV$ views contain information about master and slave sessions running in the same or different instances allowing us to estimate average intra-request parallelism and an average amount of data transferred between master and slave sessions trough a "node interconnect"

3) For each Exadata cell:
• arrival rate/throughput in number of SQL requests/hour,
• average response time,
• CPU utilization,
• number of logical and physical I/Os per hour per User/Program/Machine/Module

We discussed how modeling and optimization can be used to compare alternatives, justify and verify operational workload management, tactical performance tuning and strategic capacity planning decisions to ensure SLO support for the critical workloads.

We illustrated the importance of workload management. Without any constraints, low priority ETL workloads can monopolize resources. Workload management, database tuning and hardware configuration changes can all improve performance for one workload, but they also carry the risk of moving rather than eliminating bottlenecks and negatively affect other workloads. Strategic capacity planning, tactical performance management and operational workload management decisions should take into consideration the interdependence between servers and workloads and virtualization overhead.

It is impossible to manually evaluate all of the possible permutations of changes in concurrency, priority or resource allocation, database tuning or hardware upgrade options. We demonstrated how comparing the actual with expected results, based on modeling and optimization, enables organizations to practice continuous, proactive service level management.

Monday, August 31, 2009

Challenges of Teradata Workload Management in TASM Environment

It is easy to change Teradata Active Systems Management (TASM) parameters affecting workloads priorities, concurrency and resource allocation settings, but it is difficult to decide how to change values to satisfy Service Level Goals for each workload.

Modeling and optimization technology can be used to justify not only strategic capacity planning, tactical performance management, but also operational workload management TASM parameters to satisfy workloads SLGs.







Let's review how workload characterization and performance prediction results can be used to justify realistic SLGs, set Concurrency/Throttling, Priorities and Resource Allocation TASM rules and organize continuous proactive Service Level Management.

Reducing the level of concurrency/throtling reduces the number of concurrently processed requests (Multi Programming Level (MPL) ), but it increase the number of requests waiting for the tread as it shown on Figure below:






One of the challenges is to find for each workload the approximation of the distribution of the probability requests in the system, number of requests waiting for service and number of requests being processed.

Below are performance prediction results illustrating how throttling for Batch workload can improve performance of other workloads




Change of priority for one of the workloads can improve it's performance, but negatively affect the performance of other workloads.




One of the approaches is to reduce priority for the not critical workloads using excesive amount of resources.




Hardware upgrade, change of the DBMS or OS release can change balance in usage of resources and it require reevaluation of the workload management TASM parameters.



Below are performance prediction results illustrating the impact of the proposed hardware upgrade and change workload management TASM parameters.


  • As we can see the challenge in Teradata workload management is to coordinate selection of TASM parameters to satisfy SLGs for each workload.
  • Modeling and optimization technology can be used to justify strategic capacity planning and tactical performance management decisions and set TASM rules to satisfy workloads SLGs
  • Workload characterization and performance prediction results can be used to justify realistic SLGs, set Throtteling, Priorities and Resource Allocation TASM rules and organize continuous proactive Service Level Management.

Sunday, August 23, 2009

Hot Summer

During last several months, we've seen a significant burst of activity. Many customers, in spite of the budget cuts, are starting to evaluate how to streamline and optimize their IT operations. In the next couple of postings, I will describe several examples illustrating how analytic modeling technology is used to justify movement of workloads and data from one system to another, how modeling technology is used to reduce the risk of performance surprises during implementation of new applications, and why planning of hardware upgrades, changes of OS and migration to a new release of the DBMS should include reevaluation of the workload management rules. Next week I will be on vacation and plan to finish several papers.



I am working on paper for Oracle World 2009: "Modeling and Optimization of Virtualized Multi-Tier Distributed Environment. We will review the challenges of planning and managing complex multi-tier virtualized distributed environments with many interdependent servers supporting multiple workloads.



Any change in workload management, database tuning, or hardware upgrades can improve performance for one workload while also moving one or more bottlenecks to another server on another tier and negatively affect other workloads for variety of reasons:



  • There many parameters you can change, including concurrency, priority or resource allocation by workload, you can change database design, create new indexes, materialized views or upgrade the hardware configuration

  • It is impossible to evaluate all possible permutations of parameters

  • We will discuss how modeling technology can answer specific "what if" questions

  • We will also review how optimization technology iteratively and intelligently generates "what if" questions for the modeling engine to find what should be changed within workload management, performance tuning and hardware upgrades to satisfy SLOs for critical workloads

  • We will also review how comparison of the actual results (after the change) with expected results enables organizations to implement a continuous proactive performance management process



Another paper I am preparing for the upcoming Teradata Partners Conference about the application of modeling and optimization for workload management and creation of the continuous, closed loop proactive performance management titled "Teradata Infrastructure Optimization in TASM Environment". In this paper we will discuss:


  • Challenges of setting workload management Teradata TASM parameters

  • The role of modeling and optimization in finding optimal workload management parameters to meet Service Level Goals for each workload

  • Workload characterization in TASM environment

  • Strategic capacity planning in a TASM environmen

  • Tactical performance management in a TASM environment

  • Operational workload management in a TASM environment

  • How to optimize the selection of TASM throttling, priority and resource allocation rules based on SLG for each workload

  • How to use performance prediction results to organize a continuous, closed loop proactive performance management process in a TASM environment



For CMG 2009 I am preparing a half day session titles "Hands on Workshop on Modeling and Optimization in Virtualized Multi-tier Environments". This is an intensive "hands on" workshop for performance management professionals who would like to learn how to build and apply analytic models to proactively manage the performance of applications in virtualized multi-tier environments based on VMware, WebLogic and WebSphere Application Servers as well as Oracle, DB2, Teradata and SQL Server Database Servers. During the workshop, attendees will learn how to build and apply analytic models to predict the impact of workload and database size, growth, the impact of implementing new applications, adding or moving VMs and upgrading hardware. We will not use our commercial modeling tool, BEZVision, but instead I will teach attendees how to use an Excel spreadsheet with prepared exercises to illustrate how to perform workload characterization, build simple analytic queueing network models, and apply modeling results to justify strategic capacity planning, tactical performance management and operational workload management recommendations. At the end of the workshop, participants will summarize results and will be ready to present a report with capacity management recommendations.



In addition, Tim R. Norton invited me to participate in a Panel Discussion at CMG titled "Hardware’s Cheap so Why Do Modeling?". I will be preparing some materials for this panel as well. The cost of hardware is rapidly trending down while other costs are rising even more rapidly. The result of this interplay is that cost saving opportunities are shrinking while the analysis takes increasingly more time, effort and money. This panel of world-renowned experts in application and systems modeling will candidly discuss this and other questions related to the future of modeling as a tool to achieve business objectives. Panel discussion areas:


  • Hardware’s cheap so why do modeling at all?

  • Does analysis cost more than just buying the hardware?

  • How close is good enough?

  • Hardware’s evermore powerful so why try for prediction precision?

  • Business vs. Math: What’s the trade-off between political costs and technical value?

  • How can the modeler find the tipping-point?

  • How does application and infrastructure complexity affect the value of modeling?

  • Is there such a thing as a “simple” model anymore?

  • Is modeling headed to the clouds?

  • Is traditional modeling at odds with the utility model of cloud computing?

  • Where’s the value as datacenters move toward commodity pricing and “on demand” capacity?

  • What’s driving the costs du jour?

  • Can a modeling analysis effort be successful before it is superseded by the next management priority?

  • How can modeling optimize multiple mutually exclusive objectives?



My wife does not know yet, but if I have time left between hiking and finishing papers, I have an obligation to prepare an abstract for CMG on a Late Breaking paper with Charlie Gary on "New application infrastructure modeling and optimization"

Wednesday, July 15, 2009

Predicting New Application Implementation Impact

Business people, application developers and IT management are all concerned that even after thorough stress testing, new applications in production environments will not perform as expected and they may negatively impact existing production applications. Indeed, HP LoadRunner, Jmeter and other stress testing software can evaluate the impact of increasing the number of users in pre-production environments, but it can not evaluate the impact of implementing new applications on systems with different architecture, different hardware, software and DBMS platforms. For example, a significant increase in the volume of data, a change of the policy and rules of distribution of resources in virtualized application server environments (priorities, concurrency and resource allocation) and changes in the workload management policies of DBMS servers can all shift bottlenecks from one tier to another tier, or server, and negatively affect performance of new and existing applications.

Oracle Real Application Testing (RAT) allows you to capture, analyze and replay production transactions on a small test system to evaluate the impact of upgrades and system changes, including implementing a new OS or DBMS patch or version, the impact of the performance tuning, the impact of Database upgrades, patches, parameters, schema changes, configuration changes, such as conversion from a single instance to RAC, ASM, etc.

DBAs can test and upgrade data center infrastructure components. In fact, the goal of RAT is to assist DBAs in testing and identifying the full impact of upgrades and system changes and include them in a certification process.

Value of new application certification

· Identify potential problems with new application and justify changes required to be sure that new application will perform well and to be sure that existing applications will be able to meet their SLOs after new application implementation
· Organize collaboration between business people, application developers and IT management in setting realistic SLO, negotiating SLA and organizing proactive SLM
· Provide a basis for comparison of the actual with expected results and organizing a continuous Proactive Performance Management (PPM) process during application life cycle

Sunday, May 17, 2009

Collaborate09 IOUG Conference

I April I presented paper on Modeling and Optimization for Multi-Tier Virtualized Oracle environment at Collaborate09 IOUG conference in Orlando. In spite of the bad economy and swine fly epidemic danger the Collaborate09 attracted about 4500 people and over 200 exhibitors. Conference had many tracks divided into 3 major sub conferences, including International Oracle User Group (IOUG), Oracle Application User Group (OAUG) and Quest User Group united EJ Edwards and People Soft users.

I attended Oracle’s presentations on new development in Oracle Enterprise Manager and Grid Control. It contains valuable information for modeling and performance optimization of the multitier virtualized distributed environment.
I had an opportunity to see new development in the systems management area presented by Oracle, IBM, CA, HP, BMC. All of them have repositories, containing performance measurement data with valuable information for building and calibrating analytical models.
Several speakers presented papers about challenges in planning and managing virtualization. VMware introduced performance measurement data characterizing the Hypervisor overhead. several presenters described their experience in virtualization of the DBMS servers.

Customers like the fact that virtualization can potentially reduce cost hardware and software by more than 50%.

We saw attempts to virtualize DBMS servers supporting test and development environment, where there performance level is not so critical. Most of the customers are still skeptical about virtualization of the DBMS servers in production environment.

Saturday, March 21, 2009

MODELING AND OPTIMIZATION IN VIRTUALIZED MULTI-TIER DISTRIBUTED ENVIRONMENT

At the end of April, Cisco will announce their Unified Computing System (UCS) which will compete with HP, IBM and Dell. This is part of Cisco's plan of partnering with EMC. EMC is planning to announce a new Symmetrix and VMware device. VMware is also planning to announce VMware vSphere 4.0 at the same time on April 21. (EMC's announcement was actually reported by the Boston Globe on April 14). Cisco is also buying Tidal software which is focusing on job scheduling, application performance management, and automation software products.

It is clear that virtualization solutions are becoming cheaper, but complexity of planning and management of the multi-tier virtualized environment increases the risk of surprises. Systems management tools will play an important role in competition between virtualization and cloud computing solution providers.

In order to avoid surprises, you will need to know what the impact of virtualization and cloud computing is on the performance of your applications. Analytical queueing network models can be used to evaluate capacity planning, performance tuning and workload management options, provide reasonable expectations and justify proactive performance management decisions.

We will review how prediction results can be used to set realistic Service Level Objectives, find the most effective solutions, set expectations, verify the results and organize proactive Service Level Management. We will discuss how performance prediction can be used to find the right candidates for virtualization, justify hardware and software upgrades for the application tier and DBMS tier, optimize VM migration, predict the impact of new VMs and new workload implementation, set optimum level of concurrency, priorities, and resource allocation for each workload to support critical workloads’ SLOs with minimum cost.

Virtualization in a Multi-Tier Distributed Environment
Virtualization can reduce cost, but hypervisor overhead can negatively affect performance. As a result, not all applications are good candidates for virtualization. For example, applications with high I/O rate can have significant performance degradation after virtualization. Hypervisor overhead that depends on the number of VMs and workload parameters can affect performance of all applications.

Workload and database size growth, implementation of the new applications, adding new VMs can increase hypervisor’s CPU, memory and I/O overhead and can negatively affect applications performance.

We will review how modeling and optimization technology can be used to evaluate options and justify strategic capacity planning, tactical performance management and operational workload management decisions, verify results and enable organization of continuous proactive performance management.

Response time in the multi-tier environment includes service time and queueing time for CPU, I/O and interprocessor communication in application servers and DBMS servers, plus different types of delays caused by limited concurrency. Workloads have different resource utilization profiles.

Complexity of requests, volume of data and processing speed all affect service time. Virtualization overhead of managing VMs affects CPU service time, elongates I/O response time due to hypervisor scheduling overhead.

Increase in number of users, implementation of new applications, and concurrency limitations increase contention for resources and affect queueing time.

Workloads are interdependent because they compete for the shared resources, and changing the priority of one workload can improve its response time, but it can increase queueing time and response time for others.

Let’s review the application of modeling and optimization of the simple configuration shown in Figure 1. Application servers have unbalanced usage of resources by Java EE applications. Java EE applications generate SQL requests accessing data from Oracle DBMS servers. We will predict the impact of virtualization, replacing the servers with VMs placed on one physical server. Each JVM running the application server software has a limited number of execution threads and a limited number of connections to the databases. JVM thread pool size and connection pool size limit the number of requests that can access DBMS concurrently.


Figure 1. An example of Server Consolidation and Virtualization. Two physical application servers are replaced by two VMs in one physical host.

In order to support workload growth and implementation of new applications, you can add a VM to an existing physical server or create a cluster of application servers and place the new VM in a new physical server. Decisions about migration of VMs between physical servers, change of a VM’s priority, tuning decisions, and change of concurrency should take into consideration that all components of the system are interdependent, and a change in one place can move a bottleneck and affect other workloads.

VIRTUALIZATION IS FOUNDATION OF CLOUD COMPUTING

Virtualization is the first step toward Cloud Computing. An individual VM can run on a desktop or can be moved to a private Cloud inside the Computer Center where the VM will run on a so called “software mainframe” – a shared, high availability, high performance computing platform based on distributed physical machines. For example, VMware VMotion can move a hosted operating environment from one physical machine to another.

Workloads in this environment can then be moved from an in-house private Cloud to an Internet Accessible Cloud provider. New capacity can be added to the VM runtime environment when needed during high peak processing within the physical server, or VMs can be moved to a bigger physical machine using VMotion, and finally, the total physical capability of the Cloud hosting environment can be increased as well.

The customer’s challenge is to decide what type of workloads should be placed on an in-house private cloud and which applications should be moved to the Internet accessible cloud provider, and which provider should be selected to support SLO with minimum cost.

The provider’s challenge is to manage concurrently different workloads, dynamically move VMs and reallocate resources depending on SLO and actual activity within VMs. New capacity can be added to the VM run time environment during high peak processing within the physical server. VMs can be moved to the bigger physical machine. The total physical capability of the Cloud hosting environment can be increased as well.


Figure 2. Virtualization is a first step toward Cloud Computing. Within the data center there are traditional computing resources and private clouds. Applications can be moved back and forth to external an Cloud Host Environment. One of the challenges is to find candidates for virtualization. Another challenge is workload management and migration of VMs between physical hosts within private cloud of the data center. And finally, a decision has to be made about which applications and when they should be moved to an external cloud host environment.

Challenges of modeling virtualized multi-tier environments

Closed queueing network models based on MVA algorithms can be applied to modeling Java EE applications in multi-tier distributed environments with Oracle DBMS.

The queueing network model of clusters of virtual servers with hundreds of VMs containing JVMs with limited number of JVM threads and limited connection pool sizes affecting the flow of requests between application and DBMS server can be very complex. On the other hand, the structure of each individual server model is the same. We will review use of the hierarchical modeling approach where the lower level tier is treated as an additional I/O device (like disk) for the current tier.

We apply a central server, closed queuing network model to model each server. An iterative multi-class Mean Value Analysis algorithm has been modified to reflect intra-request parallelism and take into consideration the software limitations on the number of requests that can be concurrently processed by the server (MPL) and limitation of the CPU utilization by workload.
Modeling parameters are dynamically recalculated during the modeling iterations to reflect contention for memory between requests:

• Response time of DBMS server workloads is calculated during workload characterization based on measurement data extracted from SQLArea for SQL requests that belong to the corresponding workloads.

• Response time of the workload’s request to application server includes the application server’s own response time and total time the request will spend in the DBMS server, calculated as multiplication of the average DBMS response time for corresponding workload by the number of calls to the DBMS server per one request to the application server.

• Measured response time of the Web server includes average response time of the Web server plus corresponding response time of the application server.

• Calibration results synchronize and coordinate all model parameters, correct inaccuracies of measurement data and workload characterization assumptions.

• To be able to calibrate each server independently, calibration should fix external parameters: response time, throughput, think time and the average number of processes generating requests. Only CPU utilization and I/O rate are redistributed between workloads while preserving total node metrics.





Figure 3. Iterative Modeling Algorithm Applied to Interdependent Queueing Network Models of Servers in a Multi-Tier Distributed Environment.

If one physical server hosts both VM/AS and VM/DBMS servers, the interconnection between VMs should be taken into consideration. Bottom-up effect incorporates next tier delay change for calling server workloads. Top-down effect reflects change of concurrency level (equivalent number of parallel sessions) and equivalent think time of workloads on the called server. Queueing network MVA algorithm of each server should take into consideration Regression Analysis coefficients to predict how different planned changes will affect response time (service, queuing and delays), throughput and resource utilization for each workload.

PREDICTING IMPACT OF THE WORKLOAD AND DATABASE SIZE GROWTH

The graph below shows the result of performance prediction showing the impact of the workload and database size increase in a multi-tier environment with the two application servers and two DBMS servers shown in Figure 1. As you can see in Figure 5, the response time of one of the workloads will not be affected, but the response time of the Admin Cluster workload will be significantly increased.
Predicted Impact of Workload and Database Growth on Response Time and Throughput without Virtualization


Figure 4. As a result of the expected workload growth the response time for the Physical Cluster and Admin Cluster workloads will grow significantly, but other workloads’ response times will not be significantly affected. Throughput for all workloads, except DBora1-Catch All will be reduced by 10-30% in a year.

One of the areas where analytical queueing network models can provide value is selection of candidates for virtualization. Planning virtualization includes analysis of many alternatives, including selection of candidates for virtualization, justification for the size and number of the physical servers required to support selected VMs, making decisions about which VMs should be placed on which physical servers, etc.
Let’s review performance prediction results based on measurement data collected on the system shown in Figure 1. The CPU of Application server #1 is underutilized, while the utilization of Application server #2 CPU is almost 100% (Figure 6). We have unbalanced application servers, and one of the solutions is to evaluate the impact of virtualization: what if we place Application server #1 workloads into VM1 and workloads of Application server #2 into VM2 onto the physical application tier server #2?
Predicted Impact of Workload and Database Growth on CPU Utilization of Application Server #1 and Application Server #2 Without Virtualization






Figure 5. Application server #1 is underutilized, but Application server #2 is saturated. CPU utilization of Application server #1 will be growing from 7% to almost 12%, and Application server #2 CPU utilization will be saturated at the level of 100%.

Predicted Impact of Virtualization
Replacement of the physical servers with VMs and placement of the VMs on one host with double CPU capacity is shown in Figure 7 where the hypervisor controls VMs to balance the utilization of resources, but the impact of the workload growth on performance of applications is a concern. Analytical models take into consideration workload profiles [5], and also the hypervisor overhead, which will be increasing with workload growth. Performance prediction results shown in Figure 6 illustrate that response time and throughput of different workloads will be affected differently depending on the workloads’ profiles. Prediction results set realistic expectations, reduce risk of surprises and provide the information to review different proactive performance management measures, including change of workloads’ priorities, level of concurrency, resource allocation, etc.




Figure 6. Physical Cluster and Admin Cluster workloads response time and OE workload throughput will be very sensitive to workload growth, and CPU utilization of the host server will be increased from 60% to 85% in a virtual environment with two VMs, representing workloads from Application servers 1 and 2.

Predicted Impact change of Workload Priority

Performance prediction results can be used to identify when the SLO for critical workload is not satisfied, where there is a potential problem (is it the application server or the DBMS server, CPU or I/O, service time or queueing time, or are delays caused by concurrency limitations?, etc.). One of the possible tuning options is to increase dispatching priority for the critical workload. Figure 8 illustrates the performance prediction results reflecting the impact of the proposed change of the priority for one of the workloads. As a result, this workload will have improvement in response time and throughput, but it will negatively affect performance of other workloads. The model takes into consideration that all workloads compete for resources, and improvement in one place can create a bottleneck in another. Sometimes improved performance on an application server can create a bottleneck in a DBMS server and vice versa.



Figure 7. Increase of the priority for VM hosting Physician Cluster workload will improve response time for this workload, but other workloads, especially Admin Cluster workload, will be negatively affected. As a result of the change, throughput and CPU utilization by each workload will be affected as well.

Predicted Impact of Concurrency limitation

Enforcement of concurrency limitation for one of the workloads limits the number of concurrent requests for this workload and can have a very different impact on all other workloads. Reduction in the number of JVM threads in the application server can limit resource consumption for one workload, but increase consumption of resources by other workloads using a different JVM. It can move a bottleneck from the application server to the DBMS server. Performance prediction results in Figure 8 illustrate the change in response time, throughput and CPU utilization as a result of change in level of concurrency for the Physician Cluster workload.





Figure 8 . Implementation of throttling and limitation of the level of concurrency for Physician Cluster will increase response time for the Physician Cluster workload, but will significantly improve the response time for all other workloads.

Predicted Impact of Change in CPU Consumption Limit

Limitation of the CPU utilization by one of the workloads can have a similar impact. As is shown in Figure 9, setting a limit on CPU consumption on the DBMS server for the OE workload will increase the response time and reduce throughput for the OE workload, but significantly improve performance for other workloads, especially DBora1. Modeling results take into consideration not only contention for resources of the application server, but also the impact on performance of the DBMS servers.






Figure 9. Setting limit on CPU consumption on the DBMS server for the OE workload will increase the response time and reduce throughput for the OE workload, but significantly improve performance for other workloads, especially DBora1




Figure 10. As a result of adding a new node to an Oracle RAC system, response time improved and throughput increased. CPU time consumed increased because the throughput increased.

Performance prediction results allow evaluation of the impact of the proposed Oracle RAC hardware upgrade on each workload. Shared disk subsystems, variable degree of parallelism, contention for the interconnect, and memory limitations can affect Oracle RAC scalability. Potentially, it also can limit the ability of RAC to provide consistent service for dynamic environments with mixed workloads.

Additional nodes will allow redistribution of requests between nodes, and the arrival rate to each node will diminish, which will reduce average CPU utilization, and each node will process requests faster. Users will wait less time for the response and will be able to generate more requests. It is positive, because the system will be able to process more business transactions – systems’ throughout will increase.

According to the performance prediction results, after adding a new index, performance of one of the workloads will be improved, but several workloads that do not include SQL accessing tables with the new index will experience performance degradation.





Figure 11. Prediction results show different expected improvements of the new index creation on response time for different workloads. OE and DBora1 workloads will benefit the most, while Admin Cluster and Physician Cluster will not. Modeling expectations provides the basis for comparing the actual results with expected, and to verify that the goal of the change is reached.

Predicted impact of adding new vm containing new application dentist on new virtual server

Tuning will reduce contention for storage subsystems and DBMS throughput will increase. Suddenly, the maximum number of JVM treads will become a bottleneck. Increasing the number of JVM threads will increase the number of concurrent requests within the application server and the amount of heap memory used by all concurrent requests. Heap size is limited to 2 GB, and creation of an additional JVM will be required to support the increased number of concurrent requests. Creation of a JVM within the same application server will increase contention for the CPU, and adding a new physical application server will be required. Creation of a new JVM on a new application server will balance the application servers’ utilization and reduce time requests spent within the application tier, but it will increase the arrival rate of requests to the DBMS tier, and increase contention for the DBMS server again. The DBA can decide to increase the degree of parallelism or change priority and resource allocation for one of the workloads, but it will affect different workloads differently. Some of them may benefit by that, but some of them may not. Modeling results show that adding a new VM containing a new application accessing data from the same DBMS server will increase the contention for the DBMS and affect performance of all workloads.




Figure 12. Predicted impact of adding a new VM containing a new application DENTIST on a new virtual server accessing data from the same DBMS server

OPTIMIZATION AND AUTOMATION OF THE PERFORMANCE MANAGEMENT DECISIONS

Each workload has unique performance, resource utilization and data usage profiles. When changing the hardware and software configurations, applications and database tuning can affect performance of workloads differently. Finding the best configuration and rules defining concurrency, priority, resource allocation and migration of VMs and JVMs between virtualization servers to support individual SLOs is a very difficult task.

Analytical models can be used to evaluate different options to justify workload management, performance tuning and capacity planning decisions [7,8,9]. To make the model independent from the number of servers in the system, we can build models hierarchically. Each server is modeled by the separate queuing model where called servers are included as additional data sources, and calling servers determine the equivalent number of users (sessions) and the equivalent think time. The whole model is solved iteratively server by server with several iterations until convergence.

Dynamically adjusted evaluation scenarios can be used to evaluate different software parameters and find the level of concurrency, priority and resource allocation for each workload that will satisfy SLO for each workload, and minimize Total Cost of Ownership (TCO).

The algorithm starts with evaluating the impact of the software parameters change. If changing the software parameters is not sufficient to satisfy SLO, then migration of VMs and JVMs to balance usage of resources is evaluated. If that is not enough, then hardware a capacity increase on overutilized servers is considered.

We can describe such optimization as a multi-criteria (each workload has own SLO), multi-dimensional (system and workloads’ software parameters and hardware parameters) optimization. It’s not possible to optimize for any workload separately because they all use the same physical resources, thus, affect each other. There’s not even approximate analytical expression that link variables and goal functions. We have to run the multi-tier model for each and every combination of software and hardware parameters of all workloads and all servers to get the corresponding performance metrics and compare them with the SLO.

The following steps make the search for the optimum solution more effective:

1. Select a workload where the SLO will not be met first
2. Select the server corresponding to the greatest component of the workload’s system response time.
3. Look at the components of the workload’s request elapsed time on this server and make corresponding actions: if the request spends the most time on (or waiting for) CPU, increase the workload’s CPU limit or priority; if the request spends the most time waiting for execution thread, increase the number of threads available, if the available memory allows (application tier, there are similar concurrency limitation in DBMS as well); etc.

Each such change affects all workloads, so after the model runs another workload can violate its SLO earlier or another server can become a bottleneck.

If after workload control adjustments all SLOs are not satisfied, we turn to the server-level software parameters mostly available for virtual servers: CPU share, memory share (affects swapping IO rate), etc. Finally, if all attempts to change software parameters, move VMs and JVMs cannot satisfy SLO, a hardware upgrade is evaluated.

SUMMARY
Complexity of virtualized multi-tier distributed Oracle environment makes it difficult to plan and manage dynamic environments effectively. We presented a methodology and approach to modeling of the complex, multi-tier distributed environment with virtualization. We demonstrated how modeling and optimization improves effectiveness and reduces risk of performance surprises during planning and management of the virtualized, multi-tier distributed Oracle environments. We reviewed how to model the impact of the workload growth and other changes on hypervisor overhead. We demonstrated how performance prediction and optimization technology allows evaluation of different options, setting realistic SLO, finding virtualization candidates, predicting the impact of workload growth and adding new VMs, justification of migration of VMs, predicting impact of new applications implementation, justification of the application tier servers and Oracle DBMS servers hardware upgrades and provides a framework to organize a continuous proactive performance management process.

REFERENCES
  1. B. Zibitsker, IOUG 2008. Reducing Risk of Surprises in Changing Multi-tier Distributed Oracle RAC Environment
  2. B. Zibitsker, DAMA 2007, Enterprise Data Management and Optimization
  3. B. Zibitsker, CMG 2008, Hands on Workshop on Performance Prediction for Multi-tier Distributed Environments
  4. J. Buzen, B. Zibitsker, CMG 2006, Challenges of Performance Prediction in Multi-tier Parallel Processing Environments
  5. B, Zibitsker, G. Sigalov, A. Lupersolsky Modeling and Proactive Performance Management of Multi-tier Distributed Environments, International conference "mathematical methods for analysis and optimization of information and telecommunication networks" (Byelorussian Winter Workshop in Queueing Theory – 2007)
  6. Mark Friedman and Stephen Marksamer, Measure IT, March 2007 A Realistic Assessment of the Performance of Windows Guest Virtual Machines
  7. Nocedal, Jorge. Stephen J. Wright, Numerical optimization, ISBN 0-387-98793-2
  8. Michael W. Trosset, Numerical Optimization Using Computer Experiments, Adjunct Associate Professor, Department of Computational & Applied Mathematics, Rice University, Houston, TX, Virginia Torczon, Assistant Professor, Department of Computer Science, College of William & Mary, Williamsburg, VA
  9. Charbonneau, High altitude observatory, national center for atmospheric research, Boulder, Colorado