Semi-Structured JSON Data Handling Automation: Part 2

Semi-Structured JSON Data Handling Automation: Part 2


This blog post presents a technique for automatically building database views based on semi-structured JSON data stored in Snowflake tables. It’s real time-saver, and you’ll find the complete code plus a usage example near the bottom of this post. Feel free to jump there if you’re in need of a quick solution. Or, read on if you’d like to dig a little deeper – you’ll find details here on exactly how everything works!

Overview

The first of this two-part blog post series described a technique for automating creation of a database view over JSON data that has been loaded into a table column of type VARIANT.  This is typically done to hide the complexity associated with SQL SELECT statements that must reference JSON document elements by their hierarchical paths.  But, it’s a manual process to both create and maintain these views, so having an automated process can help reduce errors and save time.

Background

The stored procedure outlined in the first blog post was pretty basic in its construction.  Its primary purpose was to illustrate the concept of how the overall process essentially works. Here are some topics that we’ll consider in this blog post to enhance that stored procedure so it’s ready for regular use in the real world:

  • Column case : We should have an option that allows the generated view’s column case match that of the JSON elements, rather than being uppercase (which is Snowflake’s default behavior).
  • Column types – We should also have an option to disable defining the datatype of each view column such that it matches that of the underlying data. This will allow us to handle cases where multiple JSON documents have attributes with the same name but actually contain data with different datatypes. I’ll provide more details for this in the sections below.
  • Arrays – JSON documents can contain both simple and object arrays, and the existing stored procedure simply returns these as view columns of datatype ARRAY.  It would be desirable to have the contents of these arrays exposed as separate view columns.

Column Case

Here’s an example of one of the documents in the sample JSON dataset from the first blog post:

{
  "ID": 1,
  "color": "black",
  "category": "hue",
  "type": "primary",
  "code": {
    "rgb": "255,255,255",
    "hex": "#000"
  }
}

Now, compare that with the Data Definition Language generated by the existing stored procedure:

CREATE OR REPLACE VIEW mydatabase.public.colors_vw AS 
SELECT 
json_data:"ID"::FLOAT as ID, 
json_data:"category"::STRING as category, 
json_data:"code"."hex"::STRING as code_hex, 
json_data:"code"."rgb"::STRING as code_rgb, 
json_data:"color"::STRING as color, 
json_data:"type"::STRING as type
FROM mydatabase.public.colors

When we look at the view’s description, we see that the column names are uppercased (which is Snowflake’s default behavior):

desc view colors_vw;
+----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+
| name     | type              | kind   | null? | default | primary key | unique key | check | expression | comment |
|----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------|
| ID       | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CATEGORY | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_HEX | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_RGB | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| COLOR    | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| TYPE     | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
+----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+

But what if we want the case of the generated view column names to match that of the JSON document elements? It turns out that the fix is relatively easy. What we need to do is enclose the view column name aliases within double quotes, like so:

CREATE OR REPLACE VIEW mydatabase.public.colors_vw AS 
SELECT 
json_data:"ID"::FLOAT as "ID", 
json_data:"category"::STRING as "category", 
json_data:"code"."hex"::STRING as "code_hex", 
json_data:"code"."rgb"::STRING as "code_rgb", 
json_data:"color"::STRING as "color", 
json_data:"type"::STRING as "type"
FROM mydatabase.public.colors;

The case of the generated view’s columns will now match those of the JSON document elements:

desc view colors_vw;
+----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+
| name     | type              | kind   | null? | default | primary key | unique key | check | expression | comment |
|----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------|
| ID       | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| category | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| code_hex | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| code_rgb | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| color    | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| type     | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
+----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+

We’ll implement this by adding a COLUMN_CASE parameter to the stored procedure definition:

create or replace procedure create_view_over_json (TABLE_NAME varchar, COL_NAME varchar, VIEW_NAME varchar, COLUMN_CASE varchar)

I happen to like self-describing parameters, so we’ll give the user who calls this stored procedure to specify either of these two settings for this parameter:

  • ‘uppercase cols’ : The generated view column names will be uppercased (the current behavior)
  • ‘match col case’ : The generated view column names will match those of the JSON document elements

Since these parameter settings differ in their first character, we can simply check to see if the first character of the COLUMN_CASE parameter is set to ‘M’ and, and if it is, add double-quotes around the column aliases that get generated for the view creation DDL.

First, here’s the parameter check, with assignment to the alias_dbl_quote variable:

var alias_dbl_quote = "";
if (COLUMN_CASE.toUpperCase().charAt(0) == 'M') {
      alias_dbl_quote = "\""; }

Later, as we loop through the elements in the JSON documents and build column expressions, we’ll include a reference to the alias_dbl_quote variable:

while (element_res.next()) {
      col_list += COL_NAME + ":" + element_res.getColumnValue(1);                       // Path name
      col_list += "::" + element_res.getColumnValue(2);                                 // Datatype
      col_list += + alias_dbl_quote + element_res.getColumnValue(3) + alias_dbl_quote;  // Alias
     }

It turns out that this ability to enclose view column aliases with double quotes solves another potentially thorny problem.  Sometimes JSON documents contain elements whose names are actually reserved words, which can cause SQL compilation errors to be thrown during the CREATE VIEW execution. Here’s an example of what that might look like:

{
  "ID": 1,
  "color": "black",
  "category": "hue",
  "table": "primary",
  "code": {
    "rgb": "255,255,255",
    "hex": "#000"
  }
}

We’d get an error if we tried to generate a view that had TABLE as a column name:

CREATE OR REPLACE VIEW mydatabase.public.colors_vw AS 
SELECT 
json_data:"ID"::FLOAT as ID, 
json_data:"category"::STRING as category, 
json_data:"code"."hex"::STRING as code_hex, 
json_data:"code"."rgb"::STRING as code_rgb, 
json_data:"color"::STRING as color, 
json_data:"table"::STRING as table
json_data:"ID"::STRING as ID
FROM mydatabase.public.colors;

The easiest work-around in this case is to enclose those column aliases with double quotes – and we now have that option with our new COLUMN_CASE parameter:

CREATE OR REPLACE VIEW mydatabase.public.colors_vw AS 
SELECT 
json_data:"ID"::FLOAT as "ID", 
json_data:"category"::STRING as "category", 
json_data:"code"."hex"::STRING as "code_hex", 
json_data:"code"."rgb"::STRING as "code_rgb", 
json_data:"color"::STRING as "color", 
json_data:"table"::STRING as "table"
FROM mydatabase.public.colors;

Column Types

Our existing stored procedure sets the datatype of each view column so it matches that of the underlying data. However, sometimes there are cases where multiple JSON documents have attributes with the same name but actually contain data with different datatypes. For example, one document might have an attribute that contains the value “1” while another document has the same attribute with a value of “none”. Here’s what that might look like:

{
  "ID": 1,
  "color": "black",
  "category": "hue",
  "type": "primary",
  "code": {
    "rgb": "255,255,255",
    "hex": "#000"
  }
},{
  "ID": "none",
  "color": "white",
  "category": "value",
  "code": {
    "rgb": "0,0,0",
    "hex": "#FFF"
  }
}

The current stored procedure would try to generate a view containing two columns of the same name, but with different datatypes:

CREATE OR REPLACE VIEW mydatabase.public.colors_vw AS 
SELECT 
json_data:"ID"::FLOAT as ID, 
json_data:"ID"::STRING as ID,
json_data:"category"::STRING as category, 
json_data:"code"."hex"::STRING as code_hex, 
json_data:"code"."rgb"::STRING as code_rgb, 
json_data:"color"::STRING as color, 
json_data:"type"::STRING as type
FROM mydatabase.public.colors

That obviously won’t work.  There’s probably a couple of different ways around this problem. For the sake of simplicity (or maybe because I’m just basically lazy) I’ve chosen to add a parameter that disables setting the datatype of each view column so it matches that of the underlying data and instead forces all datatypes to STRING (or ARRAY, as we’ll discuss later).  We’ll start off by adding a COLUMN_TYPE parameter to the stored procedure definition:

create or replace procedure create_view_over_json (TABLE_NAME varchar, COL_NAME varchar, VIEW_NAME varchar, COLUMN_CASE varchar, COLUMN_TYPE varchar)

Next, we’ll give the user who calls this stored procedure to specify either of these settings for this parameter:

  • ‘match datatypes’ : The generated view column will match those of the corresponding JSON data attributes (the current behavior)
  • ‘string datatypes’ : The generated view column names will be set to STRING or ARRAY

Once again, since these parameter settings differ in their first character, we can simply check to see if the first character of the COLUMN_TYPE parameter is set to ‘S’ and – if so – set the datatypes of the columns that get generated for the view creation DDL to STRING or ARRAY.  

First, here’s the parameter check, with assignment to the attribute_type variable:

if (COLUMN_TYPE.toUpperCase().charAt(0) == 'S') {
   attribute_type = "DECODE (typeof(f.value),'ARRAY','ARRAY','STRING')"; }

You might recall from the first blog post that this variable was originally defined like so:

var attribute_type = "DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING')";

So, what we’re doing is overriding that definition with this new one, which simply returns either STRING or ARRAY for the column datatype.  Here’s what that looks like for our current example:

CREATE OR REPLACE VIEW mydatabase.public.colors_vw AS 
SELECT 
json_data:"ID"::STRING as ID,
json_data:"category"::STRING as category, 
json_data:"code"."hex"::STRING as code_hex, 
json_data:"code"."rgb"::STRING as code_rgb, 
json_data:"color"::STRING as color, 
json_data:"type"::STRING as type
FROM mydatabase.public.colors

Yes, it’s a lazy solution, but it effectively solves the problem!

Arrays

JSON documents can contain both simple and object arrays, and the current stored procedure simply returns these as view columns of datatype ARRAY.  Here’s an example of what a simple array might look like – I’ll simply modify our current example dataset like so:

{
  "ID": 1,
  "color": "black",
  "category": "hue",
  "type": "primary",
  "code": {
    "rgb": [255,255,255],
    "hex": "#000"
  }
}

The existing procedure will generate the corresponding view column with a datatype of ARRAY:

desc view colors_vw;
+----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+
| name     | type              | kind   | null? | default | primary key | unique key | check | expression | comment |
|----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------|
| ID       | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CATEGORY | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_HEX | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_RGB | ARRAY             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| COLOR    | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| TYPE     | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
+----------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+

This could present problems for end users, because when they access the data in this view they’ll get something like this:

select * from colors_vw;
+----------+-------------------+--------+-------+----------+
|  ID | CATEGORY | CODE_HEX | CODE_RGB | COLOR  | TYPE     |
|----+----------+----------+----------+--------+-----------|
|  1 | hue      | #000     | [        | black  | primary   |
|    |          |          |   255,   |        |           |
|    |          |          |   255,   |        |           |
|    |          |          |   255    |        |           |
|    |          |          | ]        |        |           |
|  2 | value    | #FFF     | [        | white  | NULL      |
|    |          |          |   0,     |        |           |
|    |          |          |   0,     |        |           |
|    |          |          |   0      |        |           |
|    |          |          | ]        |        |           |
|  3 | hue      | #FF0     | [        | red    | primary   |
|    |          |          |   255,   |        |           |
|    |          |          |   0,     |        |           |
|    |          |          |   0      |        |           |
|    |          |          | ]        |        |           |
|  4 | hue      | #00F     | [        | blue   | primary   |
|    |          |          |   0,     |        |           |
|    |          |          |   0,     |        |           |
|    |          |          |   255    |        |           |
|    |          |          | ]        |        |           |
|  5 | hue      | #FF0     | [        | yellow | primary   |
|    |          |          |   255,   |        |           |
|    |          |          |   255,   |        |           |
|    |          |          |   0      |        |           |
|    |          |          | ]        |        |           |
|  6 | hue      | #0F0     | [        | green  | secondary |
|    |          |          |   0,     |        |           |
|    |          |          |   255,   |        |           |
|    |          |          |   0      |        |           |
|    |          |          | ]        |        |           |
+----+----------+----------+----------+--------+-----------+

It would be desirable to have the contents of these arrays exposed as separate view columns; that way, end users can look at individual array elements as columnar data rather than being forced to dig around within an array construct.  In order to do this, we’ll need a second query that can extract these individual array elements whenever we come across an element of datatype ARRAY. Here’s how we’ll build it:

var array_query = "SELECT DISTINCT \n"+
                 path_name + " AS path_name, \n" +
                 attribute_type + " AS attribute_type, \n" +
                 alias_name + " AS attribute_name, \n" +
                 "f.index \n" +
                 "FROM \n" + 
                 TABLE_NAME + ", \n" +
                 "LATERAL FLATTEN(" + COL_NAME + ":" + element_res.getColumnValue(1) + ", RECURSIVE=>true) f \n" +
                 "WHERE REGEXP_REPLACE(f.path, '.+(\\\\w+\\\\[.+\\\\]).+', 'SubArrayEle') != 'SubArrayEle' ";  // Don't return elements of nested arrays

Note that this second query looks a lot like the query that returns the elements and their datatypes that we explored in the first blog post.  Here are the main differences between this query and that one:

  • The text of this query is assigned to array_query, while the text of the other query is assigned to element_query.  The statement object and result sets will be similarly named in order to keep everything separate.
  • This query includes a reference to index, which is a column returned by the LATERAL FLATTEN that provides the index of each array element returned by the query. We’ll use this to generate both the element path and the view column alias.
  • The LATERAL FLATTEN in this query includes a reference to the current (outer) element whose datatype was found to be ARRAY – it’s contained within element_res.getColumnValue(1).  That’s the key to being able to seek out the elements of this particular array.

Notice that this query includes a WHERE clause that effectively prevents the return of elements that might be present in any nested arrays.  Suffice it to say that handling nested arrays gets pretty complicated, so we’ll be forced to codify them as columns of type ARRAY (in other words, we’re “exploding out” only the first level arrays here).

We’ll execute this query in a similar fashion to how we executed the element_query in the first blog post:

var array_stmt = snowflake.createStatement({sqlText:array_query});
var array_res = array_stmt.execute();

We can now loop through the array elements returned by this query and build view column expressions that look like this:  

col_name:"code"."rgb"[0]::FLOAT as code_rgb_0

The mechanism we’ll use is very similar to that which was described for the element query in the first blog post.  Here it is:

while (array_res.next()) {
     simple_array_col_list += COL_NAME + ":" + element_res.getColumnValue(1);    // Path name
     simple_array_col_list += "[" + array_res.getColumnValue(4) + "]";           // Array index
     simple_array_col_list += "::" + array_res.getColumnValue(2);                // Datatype
     simple_array_col_list += " as " + alias_dbl_quote + element_res.getColumnValue(3) + "_" + array_res.getColumnValue(4) + alias_dbl_quote;   // Alias
     }

Note the reference to element_res.getColumnValue(1) which gives us the current (outer) element whose datatype was found to be ARRAY – Again, this is the key to being able to seek out the elements of this particular array.  Note also that the current array element’s index (which was returned by that index column) is referenced as array_res.getColumnValue(4) in the path definition, and is also used to generate the column alias.  Incidentally, that column alias incorporates the array element’s alias element_res.getColumnValue(3) as a prefix which helps ensure the resulting view column name is unique.

Once the loop completes, the contents of simple_array_col_list are appended to the view column list contained by the col_list string.

Now that we’ve got the necessary logic in place, the stored procedure is now capable of creating a view that exposes the contents of the array’s elements as separate columns.  Here’s what that would look like for our current example:

CREATE OR REPLACE VIEW mydatabase.public.colors_vw AS 
SELECT 
json_data:"ID"::FLOAT as ID, 
json_data:"category"::STRING as category, 
json_data:"code"."hex"::STRING as code_hex, 
json_data:"code"."rgb"[0]::FLOAT as code_rgb_0, 
json_data:"code"."rgb"[1]::FLOAT as code_rgb_1, 
json_data:"code"."rgb"[2]::FLOAT as code_rgb_2, 
json_data:"color"::STRING as color, 
json_data:"type"::STRING as type
FROM mydatabase.public.colors;

Here’s what we’ll see when we describe the view:

desc view colors_vw;
+------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+
| name       | type              | kind   | null? | default | primary key | unique key | check | expression | comment |
|------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------|
| ID         | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CATEGORY   | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_HEX   | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_RGB_0 | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_RGB_1 | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| CODE_RGB_2 | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| COLOR      | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| TYPE       | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
+------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+

As a side note, you may remember from the first blog post that I mentioned that view column datatypes will essentially be limited to ARRAY, BOOLEAN, FLOAT, and STRING.  The reason for doing this is to prevent cases where the same element is returned with multiple datatypes similar to what we explored in the Column Types section above.  For example, suppose another instance of this array elsewhere in the JSON document set looked like this: [10.3, 255, 255]. That would cause us to end up with two separate view columns of the same name (CODE_RGB_0) but with datatypes of INTEGER and FLOAT, thus causing an error to be thrown when the stored procedure tries to run the view creation DDL.  So, the best (and, once again, easiest) solution is to cast all numeric datatypes to FLOAT – and that’s why you see FLOAT as the datatype for the numerics in this example’s array.

Finally, here’s what the end result looks like when queried by an end user:

select * from colors_vw;
+----+----------+----------+------------+------------+------------+--------+-----------+
| ID | CATEGORY | CODE_HEX | CODE_RGB_0 | CODE_RGB_1 | CODE_RGB_2 | COLOR  | TYPE      |
|----+----------+----------+------------+------------+------------+--------+-----------|
|  1 | hue      | #000     |        255 |        255 |        255 | black  | primary   |
|  2 | value    | #FFF     |          0 |          0 |          0 | white  | NULL      |
|  3 | hue      | #FF0     |        255 |          0 |          0 | red    | primary   |
|  4 | hue      | #00F     |          0 |          0 |        255 | blue   | primary   |
|  5 | hue      | #FF0     |        255 |        255 |          0 | yellow | primary   |
|  6 | hue      | #0F0     |          0 |        255 |          0 | green  | secondary |
+----+----------+----------+------------+------------+------------+--------+-----------+

Object Arrays

Object arrays add a level of complexity to the mix.  Here’s an example of what they might look like in some JSON source data:

{
  ID: 1,
  primary: false, 
  name:  { first: "Jack", last: "Doe"},
  contact: {
      phone: [
        { type: "work", number:"404-555-1234" },
        { type: "mobile", number:"770-555-1234" } 
             ] ,
      email: "[email protected]" 
      },
},{
  ID: 2,
  name:  { first: "Diane", last: "Smith"},
  primary: true, 
  val: 12342.56,
  contact: {
      phone: [
        { type: "home", number:"678-555-5678" },
        { type: "work", number:"404-555-5678" } 
             ] ,
      email: "[email protected]"
     },
}

The difference between object arrays and (what I’m referring to as) simple arrays is that object array elements are referenced by name rather than by index. Consequently, we have to construct the view column expressions accordingly. For example:

a1.value:"type"::STRING as "phone_type",
a1.value:"number"::STRING as "phone_number"

Now compare this to the view column expression for the simple array (above) you’ll notice two important differences:

  • The element values are referenced by name (“type” and “number”, in this case)
  • An a1.value reference has been introduced in place of the name of the VARIANT column in the JSON source table 

It turns out that we’ll need to include a LATERAL FLATTEN in the generated view definition for each object array that we find in the source JSON structure.  We’ll then alias each of those LATERAL FLATTENs as well – in this case, that alias happens to be a1 since it’s the first (and, in this case, only) object array in the source JSON structure.  Had there been two object arrays then we would have two LATERAL FLATTENs in the generated view definition, with the first aliased as a1 and the second aliased as a2.  Likewise for additional object arrays.

Consequently, the mechanism we used to build view column expressions for object arrays looks a little different from that for simple arrays.  Here it is:

while (array_res.next()) {
     object_array_col_list += "a" + array_num + ".value:" + array_res.getColumnValue(1).substring(1);    // Path name
     object_array_col_list += "::" + array_res.getColumnValue(2);                                        // Datatype
     object_array_col_list += " as " + alias_dbl_quote + element_res.getColumnValue(3) + array_res.getColumnValue(3) + alias_dbl_quote;   // Alias
     }

Once the loop completes, the contents of object_array_col_list are appended to the view column list contained by the col_list string.  What’s different, though, is the inclusion of the LATERAL FLATTENs discussed earlier.  Rather than get into the intricacies of the code, I’ll just show you want gets generated for our example JSON dataset:

CREATE OR REPLACE VIEW mydatabase.public.contacts_vw AS 
SELECT 
json_data:"ID"::FLOAT as "ID", 
json_data:"primary"::BOOLEAN as "primary", 
json_data:"val"::FLOAT as "val", 
json_data:"name"."first"::STRING as "name_first", 
json_data:"name"."last"::STRING as "name_last",
a1.value:"number"::STRING as "contact_phone_number", 
a1.value:"type"::STRING as "contact_phone_type", 
json_data:"contact"."email"::STRING as "contact_email"
FROM mydatabase.public.contacts,
 LATERAL FLATTEN(json_data:"contact"."phone") a1;

Notice how that LATERAL FLATTEN got added at the bottom of the generated view definition, along with the a1 alias.  Again, had there been additional object arrays in the JSON structure then they’d get their own LATERAL FLATTENs (along with corresponding aliases).  

Here’s what the resulting view looks like:

desc view contacts_vw;
+----------------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+
| name                 | type              | kind   | null? | default | primary key | unique key | check | expression | comment |
|----------------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------|
| ID                   | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| primary              | BOOLEAN           | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| val                  | FLOAT             | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| name_first           | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| name_last            | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| contact_phone_number | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| contact_phone_type   | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| contact_email        | VARCHAR(16777216) | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
+----------------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+

And the data:

select * from contacts_vw;
+----+---------+----------+------------+-----------+----------------------+--------------------+----------------------+
| ID | primary |      val | name_first | name_last | contact_phone_number | contact_phone_type | contact_email        |
|----+---------+----------+------------+-----------+----------------------+--------------------+----------------------|
|  1 | False   |     NULL | Jack       | Doe       | 404-555-1234         | work               | [email protected]      |
|  1 | False   |     NULL | Jack       | Doe       | 770-555-1234         | mobile             | [email protected]      |
|  2 | True    | 12342.56 | Diane      | Smith     | 678-555-5678         | home               | [email protected] |
|  2 | True    | 12342.56 | Diane      | Smith     | 404-555-5678         | work               | [email protected] |
+----+---------+----------+------------+-----------+----------------------+--------------------+----------------------+

It’s worth noting here that we’re getting two records for each of the contacts in our example – one for each object array record.  That’s intentional, because object arrays can be composed of any arbitrary number of elements (with unique names) and records. A sufficiently large number of individual object array records – if we were to represent them as individual view columns – could result in the creation of a view with an untenable number of columns.  Thus, the approach I’ve taken with this stored procedure is to represent unique instances of object array elements as view columns, with each object array record materialized as a corresponding view record. The result is effectively a Cartesian product. I suppose this is one of the rare instances where one could make the case that a Cartesian product is the preferable outcome.  In any case, that’s the approach I ultimately decided to take, and my reasoning behind it.

Multi-schema Support

An important thing to keep in mind is that there are cases where multiple JSON document schemas exist within a given semi-structured JSON dataset. A common solution in this case is to build multiple views over such a dataset, each having WHERE clauses that return the records for the schema of interest. Note that this stored procedure does not, however, provide provision for handling those cases. In other words, the views it generates reflect “flattened” versions of all the underlying JSON document schemas found within the given dataset.  

If this describes your use case, all is not lost. You can still use this stored procedure as a starting point. One solution you might consider would be implementing a pre-process step whereby JSON document records are written to separate columns (or tables) based on their schema prior to generating corresponding views with this procedure, possibly through the use of materialized views. An alternative approach would be to add WHERE clauses to this procedure’s element and array queries so as to isolate the proper JSON schema for each execution (note that this would require inclusion of corresponding WHERE clauses in the generated views as well). Either way, it can be done. I hope this stored procedure helps you get there faster.

What About Other Semi-Structured Data Formats?

Snowflake provides excellent support for multiple semi-structured data formats, including Avro, Parquet, ORC, and XML. The stored procedure presented here should be able to work with most of these, although I haven’t rigorously tested it for those use cases as of this writing. If changes are required to support those formats they’d likely need to be made in the regular expressions path name, datatype, and alias portions of the queries.

Putting It All Together

If you created the version of the stored procedure from the first blog post of this series, you might consider removing that version before creating the updated version below. Otherwise, you’ll actually end up with two separate versions of this stored procedure that are differentiated by the number of parameters each requires.  Here’s how to get rid of that prior version:

drop procedure if exists CREATE_VIEW_OVER_JSON(VARCHAR, VARCHAR, VARCHAR);

And now here’s the complete code for the finished stored procedure that incorporates all of the functionality described in both blog posts:

CREATE OR REPLACE PROCEDURE create_view_over_json (TABLE_NAME varchar, COL_NAME varchar, VIEW_NAME varchar, COLUMN_CASE varchar, COLUMN_TYPE varchar)
RETURNS VARCHAR
LANGUAGE javascript
AS
$$
// CREATE_VIEW_OVER_JSON - Craig Warman, Snowflake Computing, DEC 2019
//
// This stored procedure creates a view on a table that contains JSON data in a column.
// of type VARIANT.  It can be used for easily generating views that enable access to 
// this data for BI tools without the need for manual view creation based on the underlying 
// JSON document structure.  
//
// Parameters:
// TABLE_NAME    - Name of table that contains the semi-structured data.
// COL_NAME      - Name of VARIANT column in the aforementioned table.
// VIEW_NAME     - Name of view to be created by this stored procedure.
// COLUMN_CASE   - Defines whether or not view column name case will match
//                 that of the corresponding JSON document attributes.  When
//                 set to 'uppercase cols' the view column name for a JSON 
//                 document attribute called "City.Coord.Lon" would be generated 
//                 as "CITY_COORD_LON", but if this parameter is set to 
//                 'match col case' then it would be generated as "City_Coord_Lon".
// COLUMN_TYPE   - The datatypes of columns generated for the view will match
//                 those of the corresponding JSON data attributes if this param-
//                 eter is set to 'match datatypes'. But when this parameter is 
//                 set to 'string datatypes' then the datatype of all columns 
//                 in the resulting view will be set to STRING (VARCHAR) or ARRAY.
//
// Usage Example:
// call create_view_over_json('db.schema.semistruct_data', 'variant_col', 'db.schema.semistruct_data_vw', 'match col case', 'match datatypes');
//
// Important notes:
//   - JSON documents may contain attributes that are actually reserved words, which
//     may cause SQL compilation errors to be thrown during the CREATE VIEW execution.
//     The easiest work-around in this case is to specify the 'match col case' for the
//     COLUMN_CASE parameter since this causes column names to be enclosed by double 
//     quotes when the view is generated.
//   - Sometimes there cases where the JSON documents attributes with the same name 
//     but actually contain data with different datatypes.  For example, one document
//     might have an attribute that contains the value "1" while another document 
//     has the same attribute with a value of "none".  This may lead to problems since
//     Snowflake will interpret the datatype of the first document as being numeric, 
//     while the second would be a string value.  Specifying 'string datatypes' for the
//     COLUMN_TYPE parameter should help alleviate such issues.     
//   - Column names for arrays in the JSON document structure will be prefixed by the
//     path to the array.  For example, consider a simple array such as:
//                  "code": {
//                    "rgb": [255,255,0]
//     The corresponding view columns in this case would be code_rgb_0, code_rgb_1, and
//     code_rgb_2.
//   - Column names for object arrays are similarly generated.  For example, consider an 
//     object array such as:
//                 contact: {
//                     phone: [
//                       { type: "work", number:"404-555-1234" },
//                       { type: "mobile", number:"770-555-1234" } 
//     The corresponding view columns in this case would be contact_phone_type and 
//     contact_phone_number.
//   - This procedure will work for arrays that are one level deep in the JSON structure.
//     Nested arrays will be materialized in the view as columns of type ARRAY.
//   - This procedure does not provide provisions for handling cases where multiple JSON
//     document schemas exist within a given dataset.  In other words, the views it 
//     generates reflect "flattened" versions of all the underlying JSON document schemas
//     found within the given dataset. Multi-schema support could be added by
//     implementing a pre-process step whereby JSON document records are written to
//     separate columns (or tables) based on their schema prior to generating
//     corresponding views with this procedure, possibly through the use of materialized
//     views. An alternative approach would be to add WHERE clauses to this procedure's
//     element and array queries so as to isolate the proper JSON schema for each
//     execution (note that this would require inclusion of corresponding WHERE clauses
//     in the generated views as well).
//   - Execution of this procedure may take an extended period of time for very 
//     large datasets, or for datasets with a wide variety of document attributes
//     (since the view will have a large number of columns)
//
// Attribution:
// I leveraged code developed by Alan Eldridge as the basis for this stored procedure.

var alias_dbl_quote = "";
var path_name = "regexp_replace(regexp_replace(f.path,'\\\\[(.+)\\\\]'),'(\\\\w+)','\"\\\\1\"')"   // This generates paths with levels enclosed by double quotes (ex: "path"."to"."element").  It also strips any bracket-enclosed array element references (like "[0]")
var attribute_type = "DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING')";    // This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
var alias_name = "REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\\\[(.+)\\\\]'),'[^a-zA-Z0-9]','_')" ;                           // This generates column aliases based on the path
var table_list = TABLE_NAME;
var col_list = "";
var array_num = 0;

if (COLUMN_CASE.toUpperCase().charAt(0) == 'M') {
   alias_dbl_quote = "\""; }          // COLUMN_CASE parameter is set to 'match col case' so add double quotes around view column alias name 
if (COLUMN_TYPE.toUpperCase().charAt(0) == 'S') {
   attribute_type = "DECODE (typeof(f.value),'ARRAY','ARRAY','STRING')"; }   // COLUMN_TYPE parameter is set to 'string datatypes' so typecast to STRING instead of value returned by TYPEPOF function

// Build a query that returns a list of elements which will be used to build the column list for the CREATE VIEW statement
var element_query = "SELECT DISTINCT \n" +
                    path_name + " AS path_name, \n" +
                    attribute_type + " AS attribute_type, \n" +
                    alias_name + " AS alias_name \n" +
                    "FROM \n" + 
                    TABLE_NAME + ", \n" +
                    "LATERAL FLATTEN(" + COL_NAME + ", RECURSIVE=>true) f \n" +
                    "WHERE TYPEOF(f.value) != 'OBJECT' \n" +
                    "AND NOT contains(f.path,'[') ";      // This prevents traversal down into arrays

// Run the query...
var element_stmt = snowflake.createStatement({sqlText:element_query});
var element_res = element_stmt.execute();

// ...And loop through the list that was returned
while (element_res.next()) {

// Add any non-array elements and datatypes to the column list
// They will look something like this when added: 
//    col_name:"name"."first"::STRING as "name_first", 
//    col_name:"name"."last"::STRING as "name_last"
// Note that double-quotes around the column aliases will be added
// only when the COLUMN_CASE parameter is set to 'match col case'

   if (element_res.getColumnValue(2) != 'ARRAY') {
      if (col_list != "") {
         col_list += ", \n";}
      col_list += COL_NAME + ":" + element_res.getColumnValue(1);                               // Start with the element path name
      col_list += "::" + element_res.getColumnValue(2);                                         // Add the datatype
      col_list += " as " + alias_dbl_quote + element_res.getColumnValue(3) + alias_dbl_quote;   // And finally the element alias 
   }

// Array elements get handled in the following section:
   else {
      array_num++;
      var simple_array_col_list = "";
      var object_array_col_list = "";

// Build a query that returns the elements in the current array
      var array_query = "SELECT DISTINCT \n"+
                         path_name + " AS path_name, \n" +
                         attribute_type + " AS attribute_type, \n" +
                         alias_name + " AS attribute_name, \n" +
                         "f.index \n" +
                         "FROM \n" + 
                         TABLE_NAME + ", \n" +
                         "LATERAL FLATTEN(" + COL_NAME + ":" + element_res.getColumnValue(1) + ", RECURSIVE=>true) f \n" +
                         "WHERE REGEXP_REPLACE(f.path, '.+(\\\\w+\\\\[.+\\\\]).+', 'SubArrayEle') != 'SubArrayEle' ";  // This prevents return of elements of nested arrays (the entire array will be returned in this case)

// Run the query...
      var array_stmt = snowflake.createStatement({sqlText:array_query});
      var array_res = array_stmt.execute();

// ...And loop through the list that was returned.
// Add array elements and datatypes to the column list
// The way that they're added depends on the type of array:
//
// Simple arrays: 
// These are lists of values that are addressible by their index number
//   For example: 
//      "code": {
//         "rgb": [255,255,0]
// These will be added to the view column list like so:
//    col_name:"code"."rgb"[0]::FLOAT as code_rgb_0, 
//    col_name:"code"."rgb"[1]::FLOAT as code_rgb_1, 
//    col_name:"code"."rgb"[2]::FLOAT as code_rgb_2
//
// Object arrays:
// Collections of objects that addressible by key
// For example:
//     contact: {
//         phone: [
//           { type: "work", number:"404-555-1234" },
//           { type: "mobile", number:"770-555-1234" } 
// These will be added to the view column list like so:
//    a1.value:"type"::STRING as "phone_type",
//    a1.value:"number"::STRING as "phone_number"
// Along with an additional LATERAL FLATTEN construct in the table list:
//    FROM mydatabase.public.contacts,
//     LATERAL FLATTEN(json_data:"contact"."phone") a1;
//

      while (array_res.next()) {
         if (array_res.getColumnValue(1).substring(1) == "") {              // The element path name is empty, so this is a simple array element
             if (simple_array_col_list != "") {
                simple_array_col_list += ", \n";}
             simple_array_col_list += COL_NAME + ":" + element_res.getColumnValue(1);    // Start with the element path name
             simple_array_col_list += "[" + array_res.getColumnValue(4) + "]";           // Add the array index
             simple_array_col_list += "::" + array_res.getColumnValue(2);                // Add the datatype
             simple_array_col_list += " as " + alias_dbl_quote + element_res.getColumnValue(3) + "_" + array_res.getColumnValue(4) + alias_dbl_quote;   // And finally the element alias - Note that the array alias is added as a prefix to ensure uniqueness
             }
         else {                                                             // This is an object array element
             if (object_array_col_list != "") {
                object_array_col_list += ", \n";}
             object_array_col_list += "a" + array_num + ".value:" + array_res.getColumnValue(1).substring(1);    // Start with the element name (minus the leading '.' character)
             object_array_col_list += "::" + array_res.getColumnValue(2);                                        // Add the datatype
             object_array_col_list += " as " + alias_dbl_quote + element_res.getColumnValue(3) + array_res.getColumnValue(3) + alias_dbl_quote;   // And finally the element alias - Note that the array alias is added as a prefix to ensure uniqueness
             }
      }


// If no object array elements were found then add the simple array elements to the 
// column list...
      if (object_array_col_list == "") {
          if (col_list != "") {
             col_list += ", \n";
		  }
          col_list += simple_array_col_list;
          }
// ...otherwise, add the object array elements to the column list along with a
// LATERAL FLATTEN clause that references the current array to the table list
      else {
          if (col_list != "") {
             col_list += ", \n";
		  }
          col_list += object_array_col_list;
          table_list += ",\n LATERAL FLATTEN(" + COL_NAME + ":" + element_res.getColumnValue(1) + ") a" + array_num;
          }
   }

}

// Now build the CREATE VIEW statement
var view_ddl = "CREATE OR REPLACE VIEW " + VIEW_NAME + " AS \n" +
               "SELECT \n" + col_list + "\n" +
               "FROM " + table_list;

// Now run the CREATE VIEW statement
var view_stmt = snowflake.createStatement({sqlText:view_ddl});
var view_res = view_stmt.execute();
return view_res.next();
$$; 

Calling the stored procedure is pretty straightforward.  Here’s an example:

call create_view_over_json('db.schema.semistruct_data', 'variant_col', 'db.schema.semistruct_data_vw', 'match col case', 'match datatypes');    

Take a look at the History tab of the Snowflake web interface after calling this stored procedure.  There you’ll see at least three SQL statements being executed – first, the stored procedure call, and then beneath that you’ll see where it runs a query that returns the JSON document elements and their datatypes.  You’ll also see one or more queries associated with the arrays in your JSON document structure (if any are present).  Finally, you should see the view creation DDL that it generated – It’ll start off with “CREATE OR REPLACE VIEW”.

Final Thoughts

My primary goal for these blog posts was to provide an automated process for generating views over semi-structured JSON data. But my secondary goal was to demonstrate how SQL statements can be dynamically generated and executed by Snowflake stored procedures. I hope the code I’ve provided here will prove helpful to the Snowflake community for both of these scenarios.

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée.