Presentation: Three SQL Server Always Encrypted Findings.

On Friday I gave a presentation at the April meeting of the newly created Santa Fe SQL Server User Group santafesql.org.

I was a little nervous since I haven’t given a presentation since a DOE conference in Amarillo, TX in 1991. But in spite of my nerves, it went OK since it was a short 30-minute talk and since we had a great group of attendees. The group was patient with a couple technical difficulties and interacted well when I lost focus doing things like shifting back from slides to demos.

My three findings were amazement, horror, and satisfaction.

The first finding, amazement, was the subject of my last blog post. SQL Server’s ability to flawlessly upgrade an Always Encrypted database over an upgrade and a platform shift to Linux is amazing. I covered it more thoroughly in the blog post than I did in the presentation Taking SQL Server Always Encrypted on a road trip.

I had some pretty serious technical issues in my presentation.   For my demos, I used my paperspace client machine, a Windows 10/Windows Server 2016 at paperspace.com which was a little too risky. There was a little lag in the demos, but my bigger problems were from the laptop that I used. The display was very pixellated with little purple and green dots that came from the hdmi port on the laptop.

In the first demo, I zoomed in on the cygwin ssh command to setup the tunnel to SQL Server. I talked about the tunneled port to the Linux database server and ran sample queries on an Always Encrypted table.

The second finding was horror. I haven’t covered the horror in a blog post yet. In a recent programming job, I needed to write a C# program to load 45 million rows of data and update it with over 50000 rows of daily updates. We were required to keep the data encrypted since it was sensitive, which was the use case for Always Encrypted. I also needed to unduplicate the data. I turned to a nice ETL pattern by Veronika Peralta in an academic paper, Extraction and Integration of MovieLens and IMDb Data, apmd.prism.uvsq.fr.

The paper showed how to extract, clean up, unduplicate data and more. To unduplicate the author outlined a sample query to do a group-by on the key attributes. I quickly discovered that I couldn’t use group by on the key attributes. This is a similar version of the query using the NOAA data:


SELECT
noaa_year, noaa_jday, noaa_month, noaa_day, noaa_hour, noaa_min,
count(*) AS dcount,
min(dt ), max(dt ),
min(zen ), max(zen ),
min(dw_solar ), max(dw_solar ),
min(dw_solar_qc ), max(dw_solar_qc ),
...snip...
min(temp ), max(temp ),
min(temp_qc ), max(temp_qc ),
min(pressure ), max(pressure ),
min(pressure_qc ), max(pressure_qc ),
min(file_date ), max(file_date ),
min(added_date ), max(added_date )
FROM _raw_SurfRad
GROUP BY
noaa_year, noaa_jday, noaa_month, noaa_day, noaa_hour, noaa_min
HAVING COUNT(*) > 1;

This resulted in the 33299 error:


Msg 33299, Level 16, State 2, Line 4
Encryption scheme mismatch for columns/variables 'ID'.
The encryption scheme for the columns/variables is (encryption_type = 'DETERMINISTIC', encryption_algorithm_name = 'AEAD_AES_256_CBC_HMAC_SHA_256', column_encryption_key_name = 'noaakey_cek', column_encryption_key_database_name = 'noaaC')
and the expression near line '4' expects it to be (encryption_type = 'PLAINTEXT') (or weaker).

This error sent me back to a review of college data structures books. I came up with an alternative solution in the middle tier. This was to sort the data first, then after sorting on the group-by columns all duplicates would be in adjacent groups in (O(Nlog(N)) time). The database was able to help a little by sorting, which only took 48 seconds on a large unindexed table.
This only left adjacencies to process. So I walked through the full sorted list, putting groups of duplicates into a dictionary object and sending the dictionary object to a method which used a linq query on the dictionary to find the min and max values and then put one resulting record into a duplicate table. This resulted in the ugliest code I think I’ve ever written: noaa-etl-daily My joke was that the code looked like an American flag, lots of red, white and blue:

The code was even harder to debug. The debugger did some very bizarre jumping around when stepping through a linq query.

Other horrors that I discussed were that the encrypted columns needed about 3X more storage space. Also, the encryption wizard will fail with a client side memory error when trying to do initial encryption on a large table, so pre-encrypting the empty table is a necessary step.

There were some concerns and questions from the audience regarding using encrypted columns in where clauses with randomized encryption. I showed an example with deterministic encryption. Howerver, I’ll need to add some more demo queries with randomized encrypted columns. My claim is if the query is parameterized properly, then it is possible to query on an exact value. I showed the checkbox in SSMS for parameterized variables.

The third finding was satisfaction. At my last job where I used Always Encrypted it was a good feeling to know that the data was secure on the database server. My message was that it helped me sleep better at night. The other good findings I discussed (good from a dba’s perspective) are that the risks are in the middle tier and not in the database server, and that proper certificate management practices by the development team in the middle tier are the key to mitigating these risks. I also added that a satisfying finding (from a dba’s perspective) was that the performance hit was limited to cpu utilization the middle tier, and not the database server or the network.

My concluding message with the SQL Server Always Encrypted feature was a word of warning when working with vendors who might make empty promises about delivering new technologies. It’s really tempting to think that a vendor can just turn on AE to solve the encryption problem. The changes to the schema and code may be substantial and may essentially turn into a complete rewrite if the application has very involved SQL queries.

The group gave me lots of feedback that will help me with my submission to an upcoming sqlsaturday in Denver.  I now have a great list of questions for which I’ll have to go get answers. I’ll add them as edits to this post.

Here are the links from the presentation:

www.researchgate.net
apmd.prism.uvsq.fr
Two links to Veronika Peralta’s paper, Extraction and Integration of MovieLens and IMDb Data.

t-sql-tuesday-69-encryption
Early initial blogs on AE, look at Aaron’s blog for a more detailed look at types of potential errors.

www.databasejournal.com
A good introduction to AE.

noaa-etl-daily
The horror, the C# code I wrote by following Veronica’s ETL pattern, with workarounds in Linq.