• AIPressRoom
  • Posts
  • Hacking MySQL’s JSON_ARRAYAGG Operate to Create Dynamic, Multi-Worth Dimensions | by Dakota Smith | Jul, 2023

Hacking MySQL’s JSON_ARRAYAGG Operate to Create Dynamic, Multi-Worth Dimensions | by Dakota Smith | Jul, 2023

As you possibly can most likely guess, this text particulars obstacles I bumped into at work when I discovered myself in an analogous state of affairs.

In the end, the answer was in a considerably hack-y, however remarkably understandable, workaround utilizing the native MySQL JSON information sort.

The JSON information sort in MySQL was added in model 5.7.8, and supplies a number of helpful utility for each storage and modeling.

Underneath the JSON information sort umbrella (formally known as “JSON paperwork”) are two totally different information constructions: JSON arrays and JSON objects.

A JSON array can merely be regarded as an array (a listing, if you happen to’re a Pythonista): values enclosed by sq. brackets [ ] and separated by commas.

  • An instance MySQL JSON array worth: [“foo”, “bar”, 1, 2]

A JSON object may be regarded as a hash desk (or, once more in Python phrases, a dictionary): key-value pairs, separated by commas, and enclosed by curly brackets { }.

  • An instance of a MySQL JSON object worth: {“foo”: “bar”, 1: 2}

MySQL has a number of functions that can be used to deal with both of these formats—virtually none of which carry out any form of aggregation.

Fortunately, although, there are two that do. And so they each return JSON paperwork, which implies we are able to use MySQL’s built-in capabilities to entry the values therein.

The MySQL perform JSON_ARRAYAGG acts loads like GROUP_CONCAT. The largest distinction is that it returns a JSON array, which, once more, comes with a number of useful built-in capabilities linked above.

The JSON array information sort solves one in every of our two issues with astounding simplicity: the issue of reliably counting the variety of subscriptions in a mixture. That is achieved utilizing the JSON_LENGTH perform. The syntax is splendidly simple:

SELECT JSON_LENGTH(JSON_ARRAY("foo", "bar", "hiya", "world"));
-- JSON_ARRAY perform used right here simply to rapidly create an instance array

The results of this assertion is 4, since there are 4 values within the generated JSON array.

However let’s return to the mix of subscriptions. Sadly, JSON_ARRAYAGG doesn’t include the ordering performance that GROUP_CONCAT has. Ordering the subscription values, even in a CTE earlier than the bottom question, doesn’t return the specified outcomes:

WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription
FROM subscriptions
ORDER BY subscription
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM
subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_accounts
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
;

The variety of subscriptions in every mixture is there, because of the JSON_LENGTH perform—however mixtures which are successfully the identical are as soon as once more mischaracterized as distinct due to their order.

Utilizing ROW_NUMBER to drive the ordering of values

ROW_NUMBER is a window perform that creates an index. The index needs to be outlined; that’s, it’s a must to inform it the place to begin, find out how to increment (directionally), and the place to finish.

We are able to see a fast instance of this by making use of the ROW_NUMBER perform and telling it to order by the subscription discipline:

SELECT 
customer_id,
subscription,
ROW_NUMBER() OVER(ORDER BY subscription) AS alphabetical_row_num
FROM subscriptions
;

Look intently on the outcomes. Despite the fact that we didn’t use an ORDER BY assertion on the finish of our question, the information is nonetheless ordered in accordance with the ORDER BY within the OVER clause.

However after all this nonetheless isn’t precisely what we wish. What we have to do subsequent is add a PARTITION BY clause to our window perform, in order that the ordering of the outcomes are associated to (and in reality bounded by) every buyer ID. Like so:

SELECT 
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
;

You may most likely see the place that is going.

If we execute the JSON_ARRAYAGG perform towards these ends in a CTE, we see that the duplicate mixtures now look precisely the identical, because of the subscriptions being pressured into an alphabetical order by the ROW_NUMBER perform:

WITH 
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions
FROM subscriptions_ordered
GROUP BY 1
ORDER BY 2
;

Now all we have to do is add within the grouping CTE following the one executing ROW_NUMBER, and alter the bottom question:

WITH 
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_customers,
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
ORDER BY num_customers DESC
;

This provides not solely precisely distinct mixtures of subscriptions, but additionally the variety of clients who’ve bought these mixtures, and what number of subscriptions comprise every of them:

Voila!

  • We wished to know what number of clients bought totally different mixture of subscriptions, and what number of subscriptions had been in every of these mixtures. This offered two issues: how finest to acquire the latter, and find out how to generate precisely distinct subscription mixtures.

  • To acquire the variety of subscriptions in every mixture, we selected to go along with one in every of MySQL’s JSON capabilities, JSON_ARRAYAGG. The ensuing aggregation was returned to us as a JSON information sort, permitting us to make use of the JSON_LENGTH perform.

  • We then wanted to drive the ordering of values contained in the JSON array in order that duplicate mixtures didn’t mistakenly seem distinct. To do that, we used the window perform ROW_NUMBER in a CTE previous to the bottom question, partitioning by buyer ID and ordering the subscriptions alphabetically (in ascending order).

  • This finally allowed us to combination as much as precisely distinct mixtures of subscriptions; and with this we had been ready to make use of a easy COUNT perform to see what number of clients had bought every mixture.