Analyzing Clickstream Events Using Amazon Athena UDFs
In today's digital age, businesses are constantly seeking ways to understand and analyze user behavior on their websites. Clickstream events provide valuable insights into how users interact with a website, and analyzing this data can help businesses make informed decisions to improve user experience and drive conversions.
Clickstream Analytics on AWS collects, ingests, analyzes, and visualizes clickstream events from your websites and mobile applications. The solution manages an ingestion endpoint to receive clickstream events, which are multiple events in a batch sent by the solution‘s SDKs.
Once the ingestion endpoint receives the events, they are stored in an Amazon S3 bucket without additional processing. The bucket path is configured as a Glue table in the solution's AWS Glue Data Catalog. So the data is available for analysis using Amazon Athena.
One use case is to query and analyze the raw clickstream data to gain immediate insights after the data is stored in the S3 bucket. For example, the operators can debug the clickstream events without waiting for the data to be processed. However, the challenges of querying the raw data are:
- the clickstream events are compressed by SDKs, so the data is not easily query-able
- reach the Lambda payload limitation
Response payload size exceeded maximum allowed payload size (6291556 bytes)
when using Athena UDF to extract the events
In this post, I will show you how to use Amazon Athena UDFs to query the raw clickstream data to overcome the challenges.
The steps are:
- clone repo:
https://github.com/zxkane/aws-athena-query-federation
- Follow the steps to build and deploy the UDFs as Lambda function. After completing the deployment, find the ARN of Lambda function. Let’s say it as
clickstream-udfs
. - Go to the console of Glue. Run the below query to load the latest partitions of raw data.
1msck repair table <your project id>.ingestion_events;
- Run below sample query to view compressed data.
1-- view compressed data
2USING EXTERNAL FUNCTION decompress_clickstream_common_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA '<your lambda arn>',
3 EXTERNAL FUNCTION decompress_clickstream_attribute_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA '<your lambda arn>',
4 EXTERNAL FUNCTION decompress_clickstream_user_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA '<your lambda arn>'
5SELECT
6 json_parse(decompress_clickstream_user_fields(data)),
7 json_parse(decompress_clickstream_common_fields(data)),
8 json_parse(decompress_clickstream_attribute_fields(data))
9FROM "<your project id>"."ingestion_events"
10WHERE year='2024' and month='06' and day='20' and hour='02'
11limit 10;
12
13-- count the received raw events
14USING EXTERNAL FUNCTION decompress_clickstream_common_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA '<your lambda arn>',
15 EXTERNAL FUNCTION decompress_clickstream_attribute_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA '<your lambda arn>',
16 EXTERNAL FUNCTION decompress_clickstream_user_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA '<your lambda arn>'
17SELECT
18 sum(json_array_length(json_parse(decompress_clickstream_common_fields(data))))
19FROM "<your project id>"."ingestion_events"
20WHERE year='2024' and month='06' and day='20' and hour='02';
This conclusion summarizes the key benefits of using Amazon Athena UDFs for querying raw clickstream data, provides some final thoughts and considerations.
- Immediate access to data: You can analyze clickstream events as soon as they're stored in the S3 bucket, without waiting for additional processing.
- Debugging capabilities: Operators can quickly debug clickstream events by directly querying the raw data.
- Overcoming compression challenges: The UDFs allow you to decompress and parse the data on-the-fly, making it easily queryable.
- Avoiding Lambda payload limitations: By using separate UDFs for different parts of the data, you can circumvent the Lambda payload size restrictions.