The Search for Fast Aggregates - Implementation & Results

By Daniel Wood, 13 August 2011

In the final part of this two part series, we go into a little more detail about the aggregate technique we developed in the first article. We also carry out a number of comparison aggregate tests and present the results. We also try to address some of the limitations of this technique and ways in which you could try to work around these:

Link to Part One - Trial & Error

A very quick overview of where we are at

Summary fields & aggregate calcs require all record data to be sent from host to client
This makes aggregates on large datasets or large records slow
We found an "All Values" value list based on a single field, does not send all record data from host to client. Instead, it sends just that field value from all records (perhaps using the fields index).
Using this idea, we build a value list of values we want to aggregate, the idea being we evaluate the value list and apply an aggregate to the values, bypassing the need to download all the records.
These value lists like to combine unique values from records into a single entry. For this reason our value list had to be based on a special calculation whose contents was unique for every record, that contained the value we wanted to aggregate, and be easy for us to retrieve the original value when needed.

If you haven't yet read part one of the article then it is probably a really good idea to give it a look before reading this one, it will make this article make a lot more sense to you!

Retrieving the Value List contents

So at this stage we have built our special calculation, and built a value list based on that. The end result is we obtain a value list with 1 entry per record for every record in the table. Here is a snippet of what the value list looks like:

107.6JDBD
107.7BFHA
107.7DDGB
107.7FGIC
107.7HCFB
107.7HEFE
107.7HHEI
107.7IFCD
107.8BAGB
107.8BICE
107.8CJAF

You can see the numeric component is the weight field value we wish to aggregate eventually. The alpha part is a conversion of the primary key value from a number into an alphabetical string.

The value list contents is obtained using the ValueListItems function:

ValueListItems ( Get ( FileName ) ; "Aggregate_Weight All" )

With the contents of the value list obtained, we can use the Filter function to strip out the alpha-representation of the primary key:

Filter ( theValueList ; "0123456789-.&#195;&#130;&#194;&#182;")

You will note we have also kept some other possible characters that occur in a number, the negative sign and a decimal point. We also retain the return delimiters to keep the values as a list. The end result is:

Hey, this is looking promising! We now have a list of every Weight value in a list, and we obtained it by getting just the value, not the entire record, we are on the right track.

So how do we perform an aggregate?

We looked at a couple of ways to turn this list into an aggregate result, and decided that a custom function would be a great way to achieve this.

Our initial attempt focused simply on a sum aggregate. We began by running a substitute on the list, replacing all return characters with plus signs:

Substitute ( theList ; "&#195;&#130;&#194;&#182;" ; "+" )

Taking the snippet of the value list above, this gave us:

107.6+107.7+107.7+107.7+107.7+107.7+107.7+107.7+107.8+107.8+107.8

Which is now a simple FileMaker expression. We then ran this result through the Evaluate function. This function takes a text string as a parameter, and evaluates it as if it were a FileMaker expression:

Let ( [
  theList = ValueListItems ( Get ( FileName ) ; "Aggregate_Weight All" ) ;
  theList = Substitute ( theList ; "&#195;&#130;&#194;&#182;" ; "+" )
];
  Evaluate ( theList )
)

End result - well it "kind" of worked. With small lists it worked great, but we found the larger the number of items we aggregated, we were presented with a "?" result.

The reason for this is that FileMaker has a limit on the number of characters permitted in any calculation expression. According to the FileMaker websites technical specifications, this limit is 30,000 characters.

To get around this issue, we built a recursive custom function that performed the above operation on the list, but in blocks of 500 values at a time. So the first 500 values were summed, and the result of that was added to the sum of the second 500, the third 500, and so on until each block of values had been summed. This worked great, and was pretty quick too. For example if you were summing 10,000 values, the custom function would perform 20 recursions, each summing 500 values.

That's great for sum, but what about min/max/average?

Because we were using the Evaluate function, we realised that we could adapt our custom function to take a second parameter, that being the name of the aggregate function we wished to apply

fastaggregates2 1

This is the original version of the custom function. The overall list is broken into chunks of 1000 values aggregated at any given recursion of the function.

The first few variable declarations in the function grab the 1000 values, chop off the trailing return character, and replace all return delimiters with semi-colons (the character that is used as the separator for aggregate functions). At this point we end up with something like:

107.6;107.7;107.7;107.7;107.7;107.7;107.7;107.7;107.8;107.8;107.8

The AggregateFunction parameter is the name of an aggregate function, which is then placed before the above string. For example if the parameter was sum, we build the following string:

Sum ( ... the above string... )

Run that sucker through the Evaluate function, and you have your aggregate! The handling of blocks of values using recursion keeps the evaluation expression under the necessary character limit.

The "Average" aggregate function, a curly one!

Sum, Min and Max work great with this custom function. The reason is that if you add values, you get one total result. Similarly there is only ever 1 minimum and 1 maximum value in an overall list. It doesn't matter if you add the list up in smaller blocks, or compare blocks of values with other blocks to obtain a min or max value, the outcome is always the same.

Average is different however. To determine average aggregate, the total number of values has to be known, and the Average function determines this based on the number of parameters it is passed.

Consider this example, a list of the following values:

Average ( 1 ; 2 ; 3 ; 4 ; 5 )

The result of this calculation is 3. The average function first sums the values, and then divides it by the number of passed values to obtain the average. In this case, that is 15/3 = 5.

Now consider out implementation of the above custom function using blocks of values from the overall list. What happens if we split the list into two:

1
2
3

and

4
5

Using the custom function, we end up with the average of the first list being 6/3 = 2. The average of the second list is 9/2 = 4.5. The function would then combine the results into another aggregate calculation:

Average ( 2 ; 4.5 )

Which is deemed to be 6.5 / 2 = 3.25 , this is certainly NOT the expected result of 5 we were after.

The simple fact is the average function has to be aggregated in a different way. We first need to sum the contents of the list, and then divide it by the number of values in the original list.

Rather than try to make the current custom function any more complicated, we simply created a new custom function called ValueList_Average that simply runs the other Aggregate custom function as a Sum aggregate, then divides the result by the number of values in the list:

fastaggregates2 2

Enough technical mumbo jumbo, time for some results

For testing, we built an example file (see bottom of this article). The test file carries out three types of aggregate tests:

Summary fields
Aggregate Functions
Value List technique with our Custom Aggregate Function

For each type of aggregate, we test the sum, average, min and max aggregates. The testing is done on a Contacts table with 10,000 records. There are a total of 32 stored fields on the table, and each record averages about 0.6kb in size for a total table size of 5.9 megabytes.

Test One: File opened locally client machine

For the first test, we are eliminating network as a factor in testing. This will hopefully give us insight into the performance of the functions used, and the performance of the value list retrieval and processing.

fastaggregates2 3

fastaggregates2 4

fastaggregates2 5

As you can see, the results of the 3 tests are very similar. Multiple run-throughs of the test yield various aggregates reach the 1 second mark, but on following tests are under a second, and it is fairly random. This suggests the summary fields and aggregate functions are basically identical in speed. The higher frequency of more 1 second results for the value list implementation means it's basically the same speed, but maybe fractionally slower.

Speeds for all 3 tests are very fast with 10,000 records when the database is local because no records have to be sent over the network, so any overhead is a result of the aggregate methods themselves - of which there is very little overhead at all.

Test Two: File opened remotely over wide area network - 10k records

Here is the real acid test. The file has been hosted on a remote server and is being accessed over a fairly average New Zealand internet connection. In order to give a fair test for each method, the file was closed and reopened after each test. The reason for this is to clear the clients cache of downloaded records from the host.

fastaggregates2 6

fastaggregates2 7

fastaggregates2 8

So, what can we tell from these results? Well first thing to note is that these tests above were carried out in order of the aggregates shown. This gives us clear indication that the very first time an aggregate is evaluated, the high performance hit is taken first time, as all of the records are downloaded from host to client. As you can see, the summary and aggregate function tests yield pretty much the same time taken to download all 10,000 records.

The value list implementation on the other hand, was over 11 times faster than the other two aggregate methods. The performance gain in just aggregating using a value list is huge. The main reason is far less traffic is sent between host and client. We showed in the local file test that overhead of the actual processing of the values is negligible, so by reducing the major bottleneck of traffic sent between host and client, we have greatly improved the speed of the aggregate calculations!

Note that if the tests were run a second or third time directly after opening the file and running the first test, they were almost instant to calculate, the reason is the file had downloaded and cached all the records in the first 2 cases, and in the third the value list had been cached, so future tests were essentially working with local data.

Changes to weight values on the records prior to running a new test resulted in the summary field and aggregate function tests taking relatively the same amount of time. The value list option took slightly longer in this case because since the value list had changed contents, the whole value list is resent from host to client - however given it is pretty damn quick the performance impact is minimal.

Test Three: File opened remotely over wide area network - 100k records

For the last test we are going all out and bumping up the record count to 100,000. The file is opened under the same conditions as test number two.

fastaggregates2 10

fastaggregates2 11

fastaggregates2 12

Again, tests 1 and 2 take relatively the same amount of time, this time over 10 minutes, subsequent aggregates are much faster as they are working on the now cached records.

But once again we see the value list test has shone through, being over 10 times faster than the regular aggregate methods.

Another thing we did when running this 100k test was to fire up an application called Wireshark. This is a utility which lets you capture network traffic and keep track of packets transmitted and data flowing over the network.

We setup Wireshark to only capture data sent between the client and the host server, on FileMaker port 5003. Here is what we found:

fastaggregates2 13

For the aggregate function test we found that total packet traffic between host and client was 84,608 packets of data over the 10+ minutes. There was a total of 72,588,262 total bytes transmitted, which equates to a whopping 69.2 megabytes!.

fastaggregates2 14

And here are the results for the value list technique. Note the total packets transmitted was 5,372. That is close to 16 times less packets transmitted - and less packets means less overall latency and data transmitted. The total data transmitted was 4,430,483 bytes which equates to only 4.2 megabytes - 16.5x less data transmitted!

Wow, so it's faster that's awesome... and here's the bad news...

To this point we've been demonstrating a pretty ideal example. We have aggregated every record in the table. But seriously how often would you really want to do that? Perhaps there are some situations where you might want to do that, for example if you are using a special reporting table where you are generating records for reporting purposes.

More often than not however you want to aggregate a subset of a tables overall records. Consider aggregate functions, they work best when aggregating related records through a relationship, which generally are filtered to be a subset of the tables overall records. Summary fields also excel in sub-summary reports where they are tallying subsets of the records in groupings.

Unfortunately this value list technique does have limitations on how far you can take it.

The value list has to be stored.

This is perhaps the biggest limiting factor. Because the field used in the value list has to be stored, you cannot introduce complex filtering into your calculation based on globals, or fields from other tables.

Lets consider an example where you have a couple of global fields where you enter a date range. You want to sum all records whose date field falls within your chosen range. How would you achieve this?

Well first instinct would be to change your value list calculation to reflect this filtering. One cool and useful property of the value list, is if a record contains an empty value for the field used in the value list, that record is essentially omitted from the value list. this has no performance impact when it comes to our method which is great - if anything it makes the method faster as the a smaller set of values in the value list means faster processing times.

Back to the problem though. Our date range is specified in global fields, so we cannot use those in the calculation, which means we can't set the calculation result to blank if the record calls outside of the chosen date range, bummer!

Having said that, some filtering is possible...

Hard-coded filtering is possible though and works quite well. Consider a contacts table with 10,000 records. There is a status field that contains either "Active" or "Deleted". Further to that is a Country field.

Lets say we want to find the average weight of all contacts living in america whose status is active. To do this we would simply modify our value list calculation as such:

if ( Status = "Active" and Country = "USA" ; 
     Weight & numToText ( Contact ID ) ; ""
)

Here we have hard-coded a check in. If the contact is active and in the USA, we put a result in the calculation, if not the result is blank, and this record will not appear in the value list. Then we just run through the usual process of obtaining the aggregate from the value list.

This is nice, but very restrictive. For each subset of records you wanted to aggregate in this way, you would require a separate calculation field and value list to do so. Perhaps this might be fine depending on your situation, but it is quite limited.

Found sets and summaries

The other thing this technique is not, is a replacement for summary fields in sub-summary reports. This technique simply cannot be used to replace summary fields in this manner. Calculations just can't know about sub-summary parts in this way and if they could, in all likelihood the calculation would end up being unstored anyway, thus breaking at the value list stage.

So in conclusion....

We've talked about a lot in this article and the previous one. We've shown various aggregate methods, and why some perform slow over wide are networks. We've walked through the process of coming up with an alternative solution to the performance issue, and as a result came up with a technique combining a number of FileMaker areas including Value Lists, custom functions, and calculations.

We have also gone into detail about building an aggregate custom function to turn a list of values into the desired result. We then showed results of testing of each of the three presented aggregate methods, and showed that the value list method has huge potential for significant performance gains under very specific conditions.

Finally, we showed that this technique is most certainly not a perfect replacement for good old fashioned summary fields and aggregate functions. It can only be used under certain conditions, but if those match your situation, then you could be in for a good performance gain!

Example File

Please find attached an example file. This file was used for all the screenshots in this article, and is provided to help you fully understand what is going on in this article, and to let you experiment in FileMaker with this solution. In the file you will also find a couple of additional tests briefly touched upon in the article but not fully tested. These are in the area of aggregating a subset of the overall records in a table.

Download Example File With 10,000 Records (3.5mb)

Download Example File With 100,000 Records (36.4mb)

Readers Note:

Christoph Teschers pointed out in the comments that in the example files the tests do not yield correct results if negative values are used in the field being aggregated. He points out that this can be resolved by adding a - symbol into the Filter function, ie:

 Filter ( ValueListItems ( Get ( FileName ) ; "Aggregate_Weight All" ) ; "-0123456789.&#195;&#130;&#194;&#182;" )

The Search for Fast Aggregates - Implementation & Results

A very quick overview of where we are at

Retrieving the Value List contents

So how do we perform an aggregate?

That's great for sum, but what about min/max/average?

The "Average" aggregate function, a curly one!

Enough technical mumbo jumbo, time for some results

Test One: File opened locally client machine

Test Two: File opened remotely over wide area network - 10k records

Test Three: File opened remotely over wide area network - 100k records

Wow, so it's faster that's awesome... and here's the bad news...

The value list has to be stored.

Having said that, some filtering is possible...

Found sets and summaries

So in conclusion....

Example File

Readers Note:

Categories(show all)

Subscribe

Tags

The Search for Fast Aggregates - Implementation & Results

A very quick overview of where we are at

Retrieving the Value List contents

So how do we perform an aggregate?

That's great for sum, but what about min/max/average?

The "Average" aggregate function, a curly one!

Enough technical mumbo jumbo, time for some results

Test One: File opened locally client machine

Test Two: File opened remotely over wide area network - 10k records

Test Three: File opened remotely over wide area network - 100k records

Wow, so it's faster that's awesome... and here's the bad news...

The value list has to be stored.

Having said that, some filtering is possible...

Found sets and summaries

So in conclusion....

Example File

Readers Note:

Categories(show all)

Subscribe

Thank you!

Tags