SAS Product Suggestions

Your product suggestions make our software better!
BookmarkSubscribeRSS Feed
0 Likes

The SAS parquet libname engine is unable to determine the length of string variables and by default assigns a length of 32767 bytes.  For datasets of even a medium size (~100k rows) this results in unusably slow reading times.  The recommended solution is to use CHAR_COLUMN_LIMIT to change the default size allocation.  This however is a substandard solution of a couple of reasons:


1. Not all columns need the same length, a flag column ("Y" / "N") doesn't need 400 bytes of space to be allocated just because another column does need that much space
2. Users are required to specify this in advance, potentially even before they know what is in the dataset, if no spec was provided with the data the user may not know what an appropriate length limit is
3. If the user selects a length too short SAS just silently truncates the data with no warning / error


The feature request here is that SAS should determine for itself what the length of each column  individually is and assign the appropriate amount of space with no prior knowledge or input required from the user.  

6 Comments
LinusH
Tourmaline | Level 20

This means SAS needs to query the data to be able to assign an appropriate length, since I don't think this information is available in Parquet metadata.

This is a big ask, and will potentially consume a lot of resources during libname assignment.

The basic problem here is that SAS Base doesn't support VARCHAR/STRING data types. CAS does however, but your use case might not be suitable for a CAS only implementation.

Maybe SAS can do something using Data Catalog agents somehow, so this can be fixed in batch (could be feasible solution for a multi user environment?).

Patrick
Opal | Level 21

String columns without a defined length are always a challenge - not only with Parquet files but also with any DB that supports character type columns without a defined length. 
SAS V9 doesn't support such column types so best it can do on a metadata level is assign the max length for SAS character columns. The CAS Varchar column allows for such undefined lengths and then somehow works out the max length "on the go" when converting the data to another format like sas7bdat under compute.

I consider these string(*) data types often as lazy because very often people would/could know the actual max length when creating the table (=know your data).

What I would wish for on the SAS side is a way to predefine metadata (other than SAS 9 table metadata object) so when accessing a table out of SAS it uses this metadata to read the data into a SAS table. Something that also works under SAS Viya compute.

gowerc
Fluorite | Level 6

> This means SAS needs to query the data to be able to assign an appropriate length, since I don't think this information is available in Parquet metadata. This is a big ask, and will potentially consume a lot of resources during libname assignment.

Yes & no.  Whilst yes it is an expensive operation it is actually negligible compared to the overall runtime of SAS to convert the data into the sas7bdat format.   As an example I created a dummy dataset that had 25 character columns and 1 million rows with no missing values and cell lengths varying between 1 & 40 for each column.  With a manually specified fixed width of 40 my current SAS server was able to read in the dataset in 11.05 seconds (this assumes you know the length in advance and ignores the fact that in real examples not every column would be the same length. Also for reference if you were to incorrectly set the length limit to 1600 it would take 28.47 seconds instead, a significant performance degradation). 

The python script below uses the apache-arrow OS library to read-in and loop through each character column to determine it's length individually.  This script took 0.721 seconds to run on the same dataset on the same server. Yes it is having to read-in the entire dataset which is an expensive operation, but I'm hoping this example shows that it is (relatively) cheap compared to actually converting the data format to sas7bdat, to the point where I doubt a user would even notice.

At the very least I think it would be good to have this as an option that the user could toggle so they can select between performance vs usability (it can even be off by default for backwards compatibility) e.g. something like

option PARQUET_AUTO_STRING_LENGTH = yes;

  
Python script for reference:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import argparse


parser = argparse.ArgumentParser(description='')

parser.add_argument(
    'infile',
    type=str,
    nargs = 1,
    help='The file to read-in'
)


args = parser.parse_args()

table = pq.read_table(args.infile[0])


def get_string_len(col) -> int:
    value_len = pc.call_function("binary_length", [col])
    maxlen = pc.call_function("max", [value_len]).as_py()
    return maxlen


col_lens = {}
for col_name in table.column_names:
    col_data = table[col_name]
    if pa.types.is_string(col_data.type):
        col_lens[col_name] = get_string_len(col_data)

print(col_lens)

 Example of use:

python3 pqmaxlen.py ./data/t_char40_s3.parquet


At an absolute minimum I would expect SAS to allow us to specify the column lengths individually as it is a huge waste of processing time to allocate say $400 to a flag variable just because one other string variable in the dataset does need the full $400 width. But yes, given how cheap it is (relatively) to determine the max length I really would be expecting SAS to handle this for the user to abstract it away from them.

SASKiwi
PROC Star

@gowerc - Are you aware that most SAS software enhancements now happen only in SAS Viya and not SAS 9.4? Viya is updated monthly, SAS 9.4 every 2 years or so and only maintenance releases. Improvements like this are only likely in Viya unless there is huge customer demand. 

SASKiwi
PROC Star

FYI there is the SCANSTRINGCOLUMNS option for certain Viya relational database connectors which does exactly what you want, just not for Parquet and only available in SAS 9.4 for Google BigQuery.

gowerc
Fluorite | Level 6

@SASKiwi  - Apologies I'm not aware of the relationships between all the different products, I believe my company currently licences SAS analytics pro which is surfaced to users via docker containers running on kubernetes. 

 

> FYI there is the SCANSTRINGCOLUMNS option for certain Viya relational database connectors which does exactly what you want, just not for Parquet and only available in SAS 9.4 for Google BigQuery.


Yes that looks like it would be perfect if it could be extended to cover parquet libname engine as well.