Deep Dive Clickstream Analytics Series: Data Pipeline Observability

Oct 21, 2024 · 5 min read · Clickstream Analytics AWS Observability Monitoring Logging Troubleshooting ·

Share on:

Overview

In this post, we will explore the observability features of our clickstream solution. Observability is crucial for understanding the health of your data pipeline, identifying issues promptly, and ensuring optimal performance. We'll cover the monitoring, logging, and troubleshooting capabilities built into the solution.

Overview

The clickstream analytics solution incorporates several observability features to help users monitor and maintain their data pipelines:

Logging
Custom CloudWatch dashboards
Automated alerting
Pipeline health checks
Troubleshooting tools

Let's delve into each of these components.

Logging

The solution utilizes Amazon CloudWatch Logs and Amazon S3 to centralize logs from various components of the data pipeline and store them cost-effectively. This includes:

Data ingestion service logs, generated from the containers in ECS. You can also enable access logs for the Application Load Balancer when configuring the data pipeline.
Data processing job logs, which use Amazon EMR Serverless to process the raw ingestion data and persist the job logs in an S3 bucket.
Data modeling workflow logs, which use AWS Step Functions and AWS Lambda to orchestrate the workflow. All these logs are stored in CloudWatch Logs. Lambda functions now support configuring CloudWatch log groups, allowing all logs to be stored in a single CloudWatch log group, facilitating easy searching and analysis across the entire workflow.
Web console application logs, which are stored in CloudWatch Logs.

Furthermore, if you need to search and analyze logs across the entire data pipeline, consider implementing the AWS solution Centralized Logging with OpenSearch.

Custom CloudWatch Dashboards

The solution automatically creates custom CloudWatch dashboards for each data pipeline. These dashboards provide at-a-glance views of key metrics, including:

Ingestion service health and performance
Data processing job health and key metrics statistics
Data modeling workflow health and performance
Redshift resource usage

The challenge lies in the modular design of the data pipeline. The pipeline has different components based on user choices, and even the data modeling on Redshift can be either Redshift Serverless or provisioned Redshift. Therefore, the dashboard is not fixed before the customer creates the data pipeline. The implementation approach is as follows:

A common metrics stack is deployed before all other components, monitoring the metrics configurations in AWS Systems Manager Parameter Store.
Each module (stack) creates one or more metric configurations as standard parameters (which are free but have a 4KB size limit) with the name pattern /Clickstream/metrics/<project id>/<stack id prefix>/<stack id>/<index>.
A function in the common metrics stack updates the dashboard based on the metrics specified in the parameters.

:inline CloudWatch Dashboard Example 1 — Figure 1: Example CloudWatch Dashboard for the Data Pipeline with KDS as sink, enabling data processing and data modeling with Redshift Serverless

:inline CloudWatch Dashboard Example 2 — Figure 2: Example CloudWatch Dashboard for the Data Pipeline with MSK as sink, enabling data processing only

You can refer to the FAQ How do I monitor the health of the data pipeline for my project? for guidance on using these metrics to monitor the health of your data pipeline.

Automated Alerting

To proactively notify users of potential issues, the solution sets up built-in CloudWatch Alarms for critical metrics. These alarms can trigger notifications via Amazon SNS, allowing users to respond quickly to problems. Some key alarms include:

ECS Cluster CPU Utilization, used for scaling out/in of the ingestion fleet
ECS Pending Task Count, which alarms when the fleet is scaling out
Data processing job failures
Failures in the workflow of loading data into Redshift
Abnormal situations where no data is loaded into Redshift
Maximum file age exceeding the data processing interval threshold, which often indicates a failure in the data loading workflow or insufficient Redshift capacity
Failures in the workflow of refreshing materialized views
Failures in the workflow of scanning metadata

Troubleshooting

To aid in troubleshooting, consider the following steps:

Review the Troubleshooting documentation to see if it addresses your issue.
Log Queries: Query CloudWatch Logs for common troubleshooting scenarios. When encountering an "Internal Server Error" in the solution web console, collect verbose error logs of the Lambda function running the solution web console as follows:
1. Navigate to the CloudFormation stack of the solution web console.
2. Click the Resources tab to search for the Lambda function with Logical ID ApiFunction68.
3. Click the link in the Physical ID column to open the Lambda function on the Function page.
4. Choose Monitor and then select View logs in CloudWatch. Refer to Accessing logs with the console for more details.
Error Pattern Detection: Use automated analysis of logs to identify common error patterns and suggest potential solutions.
Data Quality Checks: Implement regular checks on processed data to identify potential data quality issues early in the pipeline.
Performance Analyzer: Utilize tools to analyze and optimize query performance in Redshift, including recommendations for table design and query structure.
Contact Support: Get expert assistance with this solution if you have an active support plan.

Best Practices for Pipeline Observability

To make the most of these observability features, consider the following best practices:

Regular Monitoring: Check the CloudWatch dashboards regularly to familiarize yourself with normal patterns and quickly spot anomalies.
Alert Tuning: Adjust alert thresholds based on your specific use case to minimize false positives while ensuring you catch real issues. Add additional alerts as needed.
Log Analysis: Use CloudWatch Logs Insights to perform ad-hoc analysis on your logs when troubleshooting specific issues.
On-call Integration: Integrate alert notifications with your preferred tools, such as PagerDuty or AWS Chatbot.
Continuous Improvement: Use the insights gained from observability tools to continuously improve your data pipeline's performance and reliability.

Conclusion

Observability is a critical aspect of maintaining a healthy and efficient clickstream analytics pipeline. By leveraging the built-in monitoring, logging, and troubleshooting capabilities of the solution, you can ensure that your data pipeline remains robust and performant, allowing you to focus on deriving insights from your clickstream data.