Buisness Client Dashboard Alerts

Critical API Failure Alert

Description

Success rate of contacts API or messages API is low

Action Items

  1. Find the API error codes in the Requests/sec panels for the contacts or messages API.
  2. Check the Error Codes documentation.
  3. Check the CoreApp Requests/sec and DB Queries/sec panels to see if failures are correlated to Coreapp failures or database failures.
  4. Check the CoreApp Overview dashboard (fill the Node variable with the problematic Coreapp) and the MySQL Overview dashboard for more information.

No Stats Alert

Description

Missing data for monitoring

Action Items

  1. Access the Prometheus targets endpoint (i.e., http://your-monitoring-hostname:9090/targets) to verify that the webstats and appstats endpoint states are UP.
  2. If Prometheus fails to connect to the Webapp, run WADebug to troubleshoot errors.
  3. If the Webapp and Coreapp containers are running, check if WA_WEB_ENDPOINT, WA_WEB_USERNAME, and WA_WEB_PASSWORD in the .env file are valid.

Coreapp Overview Dashboard Alerts

Callback Failure Alert

Description

Success rate of sending callbacks to the Webhook URL specified in the application settings is low

Action Items

  1. Find the callback response error codes from the Callback Requests/sec panel.
  2. Grep the Coreapp logs for network error to see the actual error messages.
  3. Based on error codes and messages:
    • Verify if your Webhook is reachable by the Coreapp.
    • Verify if your Webhook always returns an HTTPS 200 OK response after processing notifications.
    • Verify if your Webhook takes a long time to respond.

High Pending Outgoing Message Alert

Description

Outgoing message queue is close to being full; API requests will fail with System overloaded error (1016) soon

Action Items

  1. Check the Outgoing Messages panel row for any unusual traffic increases. If there is unusual traffic increases, try to reduce the traffic load until the alert is clear.
  2. Verify if your database has failed over to another region recently. The WhatsApp Business API may not catch up with the load due to cross-region latency.
  3. If outgoing messages are queuing up slowly over time, you should report the bug to us.
  4. If a single WhatsApp Business API Client cannot meet your load requirements, set up Multiconnect to support much higher loads.

High Queuing Callback Alert

Description

Callback queue is close to being full; API requests will fail with System overloaded error (1016) soon

Action Items

  1. Check the Callback Error Rate panel to verify callbacks are processing successfully.
  2. Reduce the callback processing time for your Webhook.
  3. Configure max_concurrent_requests in the application settings to increase number of inflight callback requests (by default, it's 6).

Database Transactions Error Rate Alert

Description

Error rate of database transaction operations (transaction, commit, rollback) is high.

Action Items

  1. Find the problematic database and operation from the DB Transactions/sec panel.
  2. Grep the Coreapp logs for QSqlError to see the actual SQL error code and message.
  3. Based on the error code and message:
    1. Verify if your database is running healthy by checking the MySQL Overview Dashboard or your own database dashboard.
    2. If the error indicates a schema issue or a bad SQL query, submit a Direct Support Ticket for investigation.

Database Read Query Error Rate Alert

Description

Error rate of database read operations (select, prepare) is high .

Action Items

  1. Find the problematic database and operation from the DB Read Queries/sec panel.
  2. Grep the Coreapp logs for QSqlError to see the actual SQL error code and message.
  3. Based on the error code and message:
    1. Verify if your database is running healthy by checking the MySQL Overview Dashboard or your own database dashboard.
    2. If the error indicates a schema issue or a bad SQL query, submit a Direct Support Ticket for investigation.

Database Write Query Error Rate Alert

Description

Error rate of databse write operations (insert, update, delete, etc.) is high.

Action Items

  1. Find the problematic database and operation from the DB Write Queries/sec panel.
  2. Grep the Coreapp logs for QSqlError to see the actual SQL error code and message.
  3. Based on the error code and message:
    1. Verify if your database is running healthy by checking the MySQL Overview Dashboard or your own database dashboard.
    2. If the error indicates a schema issue or a bad SQL query, submit a Direct Support Ticket for investigation.

Average Database Transaction Latency(ms) Alert

Description

Average latency of database transaction operations (transaction,commit,rollback) is high.

Action Items

We recommend the database latency be less than 15ms to achieve high messaging throughput.

  1. Find the slow database from the Average DB Transaction Latency(ms) panel.
  2. Verify if your database is running healthy by checking the MySQL Overview Dashboard or your own database dashboard.
  3. Use mysqlslap or pgbench to measure XACT latency with concurrent clients.
  4. Follow the High Throughput Recommendations to set up your database with similar configurations.

Common issues

  • Database instance is running out of CPU/Memory/Disk/IOPS/Connections.
  • Database instance is running on top of magnetic disk instead of SSD.
  • Database instance is in different region from your Coreapp and has a high network round trip time.

Average Database Read Query Latency(ms) Alert

Description

Average latency of database read operations (select, prepare) is high.

Action Items

We recommend the database latency be less than 15ms to achieve high messaging throughput.

  1. Find the slow database from the Average DB Read Query Latency(ms) panel.
  2. Verify if your database is running healthy by checking the MySQL Overview Dashboard or your own database dashboard.
  3. Use mysqlslap or pgbench to measure read latency with concurrent clients.
  4. Follow the High Throughput Recommendations to set up your database with similar configurations.

Common issues

  • Database instance is running out of CPU/Memory/Connections.
  • Database instance is in different region from your Coreapp and has a high network round trip time.

Average Database Write Query Latency(ms) Alert

Description

Average latency of the database write operations (insert, update, delete, etc.) is high.

Action Items

We recommend the database latency be less than 15ms to achieve high messaging throughput.

  1. Find the slow database from the Average DB Write Query Latency(ms) panel.
  2. Verify if your database is running healthy by checking the MySQL Overview Dashboard or your own database dashboard.
  3. Use mysqlslap or pgbench to measure write latency with concurrent clients.
  4. Follow the High Throughput Recommendations to set up your database with similar configurations.

Common issues

  • Database instance is running out of CPU/Memory/Disk/IOPS/Connections.
  • Database instance is running on top of magnetic disk instead of SSD.
  • Database instance is in different region from your Coreapp and has a high network round trip time.

Average Callback Request Latency(ms) Alert

Description

Average latency of callback requests to Webhook URL specified in the application settings is high.

Action Items

We recommend the callback latency be less than 80ms to achieve high throughput.

  1. Run a benchmark against your Webhook server and perform profiling to identify bottlenecks.
  2. Perform non-critical operations asynchronously and return a HTTPS 200 OK response immediately.
  3. Increase the number of Webhook servers behind your load balancers if they run out of system resources.

Server Connection Requests Alert

Description

The Coreapp constantly loses connections to the WhatsApp servers. Unstable connections will impact messaging performance of the Coreapp and cause API faliures.

Action Items

  1. Grep the Coreapp logs for “Stream error” to see the actual connection lost error and message and frequency.
  2. If the connection is lost periodically for hours, a Coreapp restart may mitigate the issues.
  3. Maintain logs and submit a Direct Support Ticket for more investigation.

Messages Decrypted/sec Alert

Description

The Coreapp is unable to decrypt incoming messages from the WhatsApp server fast enough, which will trigger connection loss.

Action Items

  1. Verify the database is running well by checking the DB Read/Write/Transaction Latency panel. We recommend the database latency be less than 15ms to achieve high throughput. Follow the Average DB Write Query Latency(ms) Alert Action Items above to resolve database issues.
  2. Check if your Coreapp instance is running out of CPU. If so, upgrade to a larger instance.
  3. Maintain logs and submit a Direct Support Ticket to rate limit the incoming message from the WhatsApp server side.

Machine Overview Dashboard Alerts

High CPU Usage Alert

Description

CPU Utilization of a machine is too high

Action Items

  1. Check the CPU Detailed Util % panel to get utilization distribution.
  2. Run atop or top on the machine to find the most CPU consuming processes. It may also be worth checking out the Container Overview dashboard for container level CPU metrics by filling the Machine variable with the problematic machine.
  3. If the Webapp, Coreapp, or database consumes most of the CPU, find a more powerful machine to host them. For High Availability/Multiconnect mode, if the Webapp and Coreapp containers are running on the same machine, try to moving them to separate machines.

High Disk Usage Alert

Description

Disk Utilization of a device on a machine is too high

Action Items

  1. Run the du and df commands on the device to analyze disk usage. It may also be worth checking out the Container Overview dashboard for container level disk metrics by filling the Machine variable with the problematic machine.
  2. Clean up unnecessary space-consuming data on the device; if there are media files or logs, set up a cron job to clean up old data periodically.

High Memory Usage Alert

Description

Memory Utilization of a machine is too high

Action Items

  1. Check the Memory Details panel to get utilization distribution.
  2. Run atop or top on the machine to find the most memory consuming process. It may also be worth checking out the Container Overview dashboard for container level memory metrics by filling the Machine variable with the problematic machine.
  3. If the Webapp, Coreapp, or database consumes most of the memory, find a more powerful machine to host them.
  4. If the Coreapp's memory usage is increasing slowly over time, it's probably due to a memory leak; you should report the bug to us. Restart the Coreapp to mitigate the memory issues.

Too Many Open Files Alert

Description

Machine is going to run out of file descriptors soon

Action Items

  1. Check the File Descriptor panel for the open file limit.
  2. Configure a higher value (e.g., fs.file-max = 600000) in the /etc/sysctl.conf file to increase the open file limit.
  3. Run sysctl -p to apply changes.

MySQL Overview Dashboard Alerts

Too Many DB Connections Alert

Description

DB connection pool utilization is high; new DB requests may fail with Too many connections errors soon

Action Items

  1. Check the Connections panel for the current connection limit.
  2. Increase the MySQL system variables max_connections (by default, it's 151) in my.cnf and restart the MySQL server. See the MySQL Server System Variables documentation for more information.
  3. For AWS RDS, you need to migrate to a larger RDS instance. See the RDS Instance Sizing section of the AWS Deployment Details for guidance.

WebApp Overview Dashboard Alerts

HTTP Server High Pending Connections Alert

Description

Webapp internal HTTP server connection queue is close to full

Action Items

  1. Check the Business Client dashboard for unusual API traffic or high API request latency.
  2. Check the Webapp logs for more information.
  3. Check if the Webapp CPU utilization is high, and if so, find a more powerful machine for the Webapp.