SAS Visual Investigator - Chunk Size and Indexer Performance

2 Likes

This post is adapted from the SAS R&D article Have you tried increasing your chunk size? written by Liam Macpherson, Senior Software Developer at SAS.

In SAS Visual Investigator, data needs to be indexed before they are searchable by end users. Re-indexing is often necessary when the configuration of existing objects (entities, relationships, and transactions) have been updated. Thus, the performance of the indexer is vital.

Under global properties, the Indexer Partition option allows SAS Visual Investigator administrators to optimize parallel data processing during indexing. To enable this, data is divided into equal-sized segments, known as partitions. Administrators control how many partitions are processed simultaneously by configuring the system to read, enrich, and write data in chunks, starting from the beginning of a partition and working through to the end.

There are three key settings that influence the size of the partitions and chunks:

Partition multiplier
Partition size
Chunk size

This post focuses on chunk size, both its usage and its limitations. Additionally, this post explores how adjusting the chunk size can enhance indexing performance, ultimately improving data accessibility and helping SAS Visual Investigator users make faster and more informed decisions.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Introduction

Indexing is often seen as a black box: data in, search results out. This means when something breaks or doesn't perform as expected, often there's not a great deal of understanding about why or what can be done to resolve it. When we want to dig deeper on one particular mechanism, we are often opening up multiple black boxes nested within one another.

We have taken steps to improve output and clarity. The Job Monitor gives us indexer metrics which provide a peek into the first box, to see how long things spent in the various nested black boxes.

Adding tuning settings directly to the SAS Visual Investigator’s Administrator User Interface (Manage Investigate and Search) allows some tweaking based on previous experiences or guidance from experts. The indexer team executed some formal testing making use of these tuning settings and have analyzed the results here.

This testing was carried out as verification of a fix for particular customer, they saw results which far exceeded our expectations—an original throughput of 62,500 records/minute increased to 1.2 million records/minute. This is an 1820% increase in throughput. In our testing, we saw marked increases between the pre-fix version and post-fix version.

These things help but more testing and learning are required, as well as making even more parts of the process visible to end users.

Terminology

Search Index: An external piece of software which stores data in such a way that means it can be simply accessed and searched. This in SAS Visual Investigator has been Elasticsearch (prior to version 10.8) and OpenSearch (post version 10.8).

Indexing: This is the process of taking external data, enriching it and loading it into the search index.

Bulk Indexing: This is taking all the data stored for a given data object and loading it all into the search index.

Incremental Indexing: This takes a subset of the data and loads it into the search index based on the date at which it was most recently updated. The intent is to index smaller volumes of data more frequently.

Chunk Size: The smallest unit of processing. A chunk of data is read, enriched, and indexed as a single unit. You must consider any memory implications when setting the chunk size as different object types have different memory requirements.

Considerations

Further research is always needed as the indexers are constantly evolving and improving, while organization-specific environments and use cases are nuanced and complex. These are the two main limitations to any testing of the indexer. It is fixed in time and tied to several factors outside of our control.

At time of writing, the indexer is at version 3.23.28, the testing which was carried out was against version 1.6.39 for 10.8 Hot Fix 6. It was carried out as means of verifying a proposed fix the indexer was putting forward. We extended it to cover more data stores and tweaked it to give us as full a picture as we could get while still getting confirmation that the customer's indexing was going to perform as well it could in Hot Fix 6.

Many releases have followed, aspects of the indexer have changed but this testing and result set can act as a point of comparison for just some of the variable aspects which customers have control over. We also hope that it'll guide some testing we want to carry out in the future.

We are not testing:

Relationships
Transactions
Customer representative data or deployments
The tweaking of any other settings—everything else is left as default
The latest version of the indexer which may contain several performance improvements, which can affect the benefit of tweaking any particular settings.

Testing Methodology

The testing was repeated against all our currently supported data stores, making for simple comparison of results. These tests analyze the effect of changing the chunk size with various data loads. We looked at two questions for each test case:

Did the indexing complete?
How long did the indexing take to complete?

Our variable factors were:

Chunk size: set intervals of 1000, 5000, 7500 and 10,000.
Target records: set intervals of 10,000, 100,000, 1,000,000 and 2,000,000.
Data stores: IBM DB2, Microsoft SQL Server, MySQL, Oracle, PostgreSQL and Teradata.

All tests were carried out against tables with similar data that were partially randomly generated but in the format of:

Two primary key columns which were VARCHARs.
A single TIMESTAMP column.

An example of this can be seen below.

An external entity was created in SAS Visual Investigator backed by this table. We also did a baseline full index of the entire table (always 2,000,000 records in these tests) which follows a slightly different path and is therefore useful for giving us a different comparison of full indexing versus incremental.

Testing was carried out against an development machine with out-of-the-box configuration. The machines have 32GB of RAM and 4 CPUs.

Results

The result section is split by data store:

IBM DB2

Incremental

Bulk

Microsoft SQL Server

Incremental

Bulk

MySQL

Incremental

Bulk

Oracle

Incremental

Bulk

PostgreSQL

Incremental

Bulk

Teradata

Incremental

Bulk

Analysis

This section looks at some of the results with a view to make use of them to guide tuning, or as a point of comparison for other sets of results going forward.

Incremental vs Bulk

This is the expected behavior of incremental indexing versus bulk. The mechanisms through which we split the workload of records to be read, processed and written to the search index introduces extra overhead. Therefore, comparisons between the two are intended to show that chunk size has an effect on both bulk and incremental indexing, rather than show how much quicker bulk indexing is than incremental indexing.

It is worth noting that bulk indexing can be viewed as the simplest method of indexing, and therefore an ideal to be aimed for. Which is why seeing incremental indexing approaching bulk indexing times by lowering the chunk size is an excellent indicator that systems can indeed benefit from tuning.

Data Stores

The first thing which jumps out are the stark differences between the different data stores. Teradata takes around 105 minutes with a chunk size of 5000, PostgreSQL takes under 10 minutes to process the same volume of records with the same chunk size. This speaks more to the differences between data stores and the hardware which is backing them than any reflection of how the indexer interacts with different data stores. There are some specific limitations which are forced upon the indexer's processes from the data vendors, but the main reasoning for the huge discrepancy is just that they are different data stores which handle queries differently, have different hardware hosting them. This should not be read as "don't bother tuning if you are using Teradata/MySQL/Microsoft SQL Server". It may just be the chunk size should not be the first port of call for tuning.

Chunk Size Recommendation

Over the years we have changed the default chunk size. Generally speaking, we want it increased to the point we are utilizing as much of the resources available as we can. From this we want to maximize the usage of the indexer's memory without overloading it, we want to be reading as much from the database and writing as much as we can, as quickly as we can. Sometimes one of these resources will provide pushback or cause a bottleneck. This explains the diminishing returns in throughput we see in IBM DB2, Oracle and PostgreSQL results when the chunk size increased from 1000 to 5000.

This resourcing issue also explains the failures we saw in the upper volume chunk size Oracle tests, we were simply reading too much into the indexer's memory. More can be understood about the mechanisms going on in parallel by looking at the workers which operate as part of indexing.

Further Testing

Further testing is always needed. Automation of this testing will allow for even simpler and more frequent comparisons to be drawn as the indexer improves version to version.

Areas requiring further testing based on the discoveries here are:

Replicating these tests with more representative environments. The hypothesis being we'd see even bigger improvements or at least see the results scale.
Identify resourcing issues preventing chunk size from having an effect in SQL Server/Teradata
Test the effects of the other tunable components of the indexer:
- Partition Multiplier
- Partition Size
- Worker pools
Combinations of these settings together

It’s worth noting that the above settings are well documented in the indexer performance tuning chapter of the SAS Visual Investigator: Administrator’s Guide.

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library