SAS Communities Library

We’re smarter together. Learn from this collection of community knowledge and add your expertise.
BookmarkSubscribeRSS Feed

SAS Visual Investigator - Chunk Size and Indexer Performance

Started ‎10-13-2024 by
Modified ‎10-13-2024 by
Views 1,772

This post is adapted from the SAS R&D article Have you tried increasing your chunk size? written by Liam Macpherson, Senior Software Developer at SAS.

 

In SAS Visual Investigator, data needs to be indexed before they are searchable by end users. Re-indexing is often necessary when the configuration of existing objects (entities, relationships, and transactions) have been updated. Thus, the performance of the indexer is vital.

 

Under global properties, the Indexer Partition option allows SAS Visual Investigator administrators to optimize parallel data processing during indexing. To enable this, data is divided into equal-sized segments, known as partitions. Administrators control how many partitions are processed simultaneously by configuring the system to read, enrich, and write data in chunks, starting from the beginning of a partition and working through to the end.

 

There are three key settings that influence the size of the partitions and chunks:

 

  • Partition multiplier
  • Partition size
  • Chunk size

 

This post focuses on chunk size, both its usage and its limitations. Additionally, this post explores how adjusting the chunk size can enhance indexing performance, ultimately improving data accessibility and helping SAS Visual Investigator users make faster and more informed decisions.

 

01_AL_SAS_VI_index_chunking_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Introduction

 

Indexing is often seen as a black box: data in, search results out. This means when something breaks or doesn't perform as expected, often there's not a great deal of understanding about why or what can be done to resolve it. When we want to dig deeper on one particular mechanism, we are often opening up multiple black boxes nested within one another.

 

We have taken steps to improve output and clarity. The Job Monitor gives us indexer metrics which provide a peek into the first box, to see how long things spent in the various nested black boxes.

 

02_AL_SAS_VI_index_chunking_1p5.png

 

Adding tuning settings directly to the SAS Visual Investigator’s Administrator User Interface (Manage Investigate and Search) allows some tweaking based on previous experiences or guidance from experts. The indexer team executed some formal testing making use of these tuning settings and have analyzed the results here.

 

This testing was carried out as verification of a fix for particular customer, they saw results which far exceeded our expectations—an original throughput of 62,500 records/minute increased to 1.2 million records/minute. This is an 1820% increase in throughput. In our testing, we saw marked increases between the pre-fix version and post-fix version.

 

These things help but more testing and learning are required, as well as making even more parts of the process visible to end users.

 

Terminology

 

Search Index: An external piece of software which stores data in such a way that means it can be simply accessed and searched. This in SAS Visual Investigator has been Elasticsearch (prior to version 10.8) and OpenSearch (post version 10.8).

 

Indexing: This is the process of taking external data, enriching it and loading it into the search index.

 

Bulk Indexing: This is taking all the data stored for a given data object and loading it all into the search index.

 

Incremental Indexing: This takes a subset of the data and loads it into the search index based on the date at which it was most recently updated. The intent is to index smaller volumes of data more frequently.

 

Chunk Size: The smallest unit of processing. A chunk of data is read, enriched, and indexed as a single unit. You must consider any memory implications when setting the chunk size as different object types have different memory requirements.

 

Considerations

 

Further research is always needed as the indexers are constantly evolving and improving, while organization-specific environments and use cases are nuanced and complex. These are the two main limitations to any testing of the indexer. It is fixed in time and tied to several factors outside of our control.

 

At time of writing, the indexer is at version 3.23.28, the testing which was carried out was against version 1.6.39 for 10.8 Hot Fix 6. It was carried out as means of verifying a proposed fix the indexer was putting forward. We extended it to cover more data stores and tweaked it to give us as full a picture as we could get while still getting confirmation that the customer's indexing was going to perform as well it could in Hot Fix 6.

 

Many releases have followed, aspects of the indexer have changed but this testing and result set can act as a point of comparison for just some of the variable aspects which customers have control over. We also hope that it'll guide some testing we want to carry out in the future.

 

We are not testing:

 

  • Relationships
  • Transactions
  • Customer representative data or deployments
  • The tweaking of any other settings—everything else is left as default
  • The latest version of the indexer which may contain several performance improvements, which can affect the benefit of tweaking any particular settings.

 

Testing Methodology

 

The testing was repeated against all our currently supported data stores, making for simple comparison of results. These tests analyze the effect of changing the chunk size with various data loads. We looked at two questions for each test case:

 

  1. Did the indexing complete?
  2. How long did the indexing take to complete?

 

Our variable factors were:

 

  1. Chunk size: set intervals of 1000, 5000, 7500 and 10,000.
  2. Target records: set intervals of 10,000, 100,000, 1,000,000 and 2,000,000.
  3. Data stores: IBM DB2, Microsoft SQL Server, MySQL, Oracle, PostgreSQL and Teradata.

 

All tests were carried out against tables with similar data that were partially randomly generated but in the format of:

 

  • Two primary key columns which were VARCHARs.
  • A single TIMESTAMP column.

 

An example of this can be seen below.

 

03_AL_SAS_VI_index_chunking_2.png

 

An external entity was created in SAS Visual Investigator backed by this table. We also did a baseline full index of the entire table (always 2,000,000 records in these tests) which follows a slightly different path and is therefore useful for giving us a different comparison of full indexing versus incremental.

 

Testing was carried out against an development machine with out-of-the-box configuration. The machines have 32GB of RAM and 4 CPUs.

 

Results

 

The result section is split by data store:

 

IBM DB2

Incremental

04_AL_SAS_VI_index_chunking_3.png

 

05_AL_SAS_VI_index_chunking_14.png

 

Bulk

06_AL_SAS_VI_index_chunking_4.png

 

07_AL_SAS_VI_index_chunking_4p5.png

 

Microsoft SQL Server

Incremental

08_AL_SAS_VI_index_chunking_5.png

 

09_AL_SAS_VI_index_chunking_15.png

 

Bulk

10_AL_SAS_VI_index_chunking_6.png

 

11_AL_SAS_VI_index_chunking_6p5.png

 

MySQL

Incremental

12_AL_SAS_VI_index_chunking_7.png

 

13_AL_SAS_VI_index_chunking_16.png

 

Bulk

14_AL_SAS_VI_index_chunking_8.png

 

15_AL_SAS_VI_index_chunking_8p1.png

 

Oracle

Incremental

16_AL_SAS_VI_index_chunking_8p5.png

 

17_AL_SAS_VI_index_chunking_17.png

 

Bulk

18_AL_SAS_VI_index_chunking_9.png

  

15a_AL_SAS_VI_index_chunking_9p5.png

 

PostgreSQL

Incremental

19_AL_SAS_VI_index_chunking_10.png

 

20_AL_SAS_VI_index_chunking_18.png

 

Bulk

21_AL_SAS_VI_index_chunking_11.png

  

22_AL_SAS_VI_index_chunking_11p5.png

 

Teradata

Incremental

23_AL_SAS_VI_index_chunking_12.png

 

24_AL_SAS_VI_index_chunking_19.png

 

Bulk

25_AL_SAS_VI_index_chunking_13.png

 

26_AL_SAS_VI_index_chunking_13p5.png

 

Analysis

 

This section looks at some of the results with a view to make use of them to guide tuning, or as a point of comparison for other sets of results going forward.

 

Incremental vs Bulk

 

This is the expected behavior of incremental indexing versus bulk. The mechanisms through which we split the workload of records to be read, processed and written to the search index introduces extra overhead. Therefore, comparisons between the two are intended to show that chunk size has an effect on both bulk and incremental indexing, rather than show how much quicker bulk indexing is than incremental indexing.

 

It is worth noting that bulk indexing can be viewed as the simplest method of indexing, and therefore an ideal to be aimed for. Which is why seeing incremental indexing approaching bulk indexing times by lowering the chunk size is an excellent indicator that systems can indeed benefit from tuning.

 

Data Stores

 

The first thing which jumps out are the stark differences between the different data stores. Teradata takes around 105 minutes with a chunk size of 5000, PostgreSQL takes under 10 minutes to process the same volume of records with the same chunk size. This speaks more to the differences between data stores and the hardware which is backing them than any reflection of how the indexer interacts with different data stores. There are some specific limitations which are forced upon the indexer's processes from the data vendors, but the main reasoning for the huge discrepancy is just that they are different data stores which handle queries differently, have different hardware hosting them. This should not be read as "don't bother tuning if you are using Teradata/MySQL/Microsoft SQL Server". It may just be the chunk size should not be the first port of call for tuning.

 

Chunk Size Recommendation

 

Over the years we have changed the default chunk size. Generally speaking, we want it increased to the point we are utilizing as much of the resources available as we can. From this we want to maximize the usage of the indexer's memory without overloading it, we want to be reading as much from the database and writing as much as we can, as quickly as we can. Sometimes one of these resources will provide pushback or cause a bottleneck. This explains the diminishing returns in throughput we see in IBM DB2, Oracle and PostgreSQL results when the chunk size increased from 1000 to 5000.

 

This resourcing issue also explains the failures we saw in the upper volume chunk size Oracle tests, we were simply reading too much into the indexer's memory. More can be understood about the mechanisms going on in parallel by looking at the workers which operate as part of indexing.

 

Further Testing

 

Further testing is always needed. Automation of this testing will allow for even simpler and more frequent comparisons to be drawn as the indexer improves version to version.

 

Areas requiring further testing based on the discoveries here are:

 

  • Replicating these tests with more representative environments. The hypothesis being we'd see even bigger improvements or at least see the results scale.
  • Identify resourcing issues preventing chunk size from having an effect in SQL Server/Teradata
  • Test the effects of the other tunable components of the indexer:
    • Partition Multiplier
    • Partition Size
    • Worker pools
  • Combinations of these settings together

 

It’s worth noting that the above settings are well documented in the indexer performance tuning chapter of the SAS Visual Investigator: Administrator’s Guide.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎10-13-2024 04:32 AM
Updated by:
Contributors

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags