Performance Gains Release 2.1 -> 2.3

Between YAWL 2.1 and 2.2 releases, a great deal of refactoring was done to redress and improve performance, particularly in three key areas that had proven problematic: the persistence of case state, event notification handling and the logging framework.


Each change of state in each process instance is persisted to database so that if a server restart is required, instances can continue transparently from their saved state. Naturally, each commit to the database takes some time to perform. It was found that there were a unnecessarily large number of commits being performed for common activities, such as starting a work item or creating a sub-net. A re-definition of which actions could be grouped into logical units of work resulted in the removal of unnecessary commits, as well as the repeated creation and closing of database connection and transaction objects. These changes resulted in a marked decrease in the time consumed by each process instance with persistence activities.


Event Notification Handling

It was found that the framework for the notification of events between the YAWL Engine and registered custom services, which had remained relatively static since early YAWL versions, was failing to cope when large numbers of process instances were being invoked within a short space of time. The symptoms were a gradual increase in the time the Engine was taking to start each process instance and, under heavy load, the occasional work item was being announced twice to services.

To solve these issues, the framework was stripped back and completely rebuilt, so that now a process instance takes the same amount of time to launch, whether there is zero or many thousand cases already running (see Figure 1).


Logging framework

The new logging framework, introduced in version 2.1, improved the capabilities of the process logs, but saw a large increase in the number and complexity of relations between tables. The process logs are written to often for each process instance, but it was found that while the way in which primary key lookup was being performed worked well under low-to-medium load, response times degraded quickly as the number of concurrent cases increased. This turned out to be the major reason for the performance issues demonstrated by release 2.1 (as can be seen in Figure 1). A complete refactoring of primary key lookup algorithms and a new methodolgy for the caching of repeatedly required keys solved these performance issues.


Figure 1 shows the results of a series of 'stress' tests performed against versions 2.1, 2.2 and 2.3, where 10,000 cases were launched in the Engine sequentially, each new launch occurring as soon as the previous launch completed. The y-axis shows the number of milliseconds taken to start each new case; the x-axis shows the number of concurrently running cases.

For each version, Figure 1 shows the time that elapsed for the launch of each instance between the moment a custom service initiated it and the time the Engine completed the launch and returned the new case id to the service, and the effect that the number of cases already running had when each new launch was initiated. All communications between the Engine and the custom service were performed over Interface B. It can be seen that all three versions perform similarly until a threshold of around 700 concurrent cases is reached, at which point the performance of version 2.1 begins to degrade alarmingly. While version 2.2 and 2.3 gurgle along at an average of around 21ms per case regardless of the number of concurrent cases (that is, the number of cases already running has no effect on how long the Engine takes to start a new case), the test for release 2.1 was stopped at 3,000 cases, when each new case was taking over 500ms and growing. The graph demonstrates that the performance problems encountered with version 2.1 under load have been eliminated from versions 2.2 onwards.

All tests were carried out on the same Macbook Pro (i7, 8Gb).