Buisness Client Dashboard Alerts

Critical API Failure Alert

Description

Success rate of contacts API or messages API is low

Action Items

  1. Find the API error codes in the Requests/sec panels for the contacts or messages API.
  2. Check the Error Codes documentation.
  3. Check the CoreApp Requests/sec and DB Queries/sec panels to see if failures are correlated to Coreapp failures or database failures.
  4. Check the CoreApp Overview dashboard (fill the Node variable with the problematic Coreapp) and the MySQL Overview dashboard for more information.

No Stats Alert

Description

Missing data for monitoring

Action Items

  1. Access the Prometheus targets endpoint (i.e., http://your-monitoring-hostname:9090/targets) to verify that the webstats and appstats endpoint states are UP.
  2. If Prometheus fails to connect to the Webapp, run WADebug to troubleshoot errors.
  3. If the Webapp and Coreapp containers are running, check if WA_WEB_ENDPOINT, WA_WEB_USERNAME, and WA_WEB_PASSWORD in the .env file are valid.

CoreApp Overview Dashboard Alerts

Callback Failure Alert

Description

Success rate of sending callbacks to the Webhook URL specified in the application settings is low

Action Items

  1. Find the callback response error codes from the Callback Requests/sec panel.
  2. Grep the Coreapp logs for network error to see the actual error messages.
  3. Based on error codes and messages:
    • Verify if your Webhook is reachable by the Coreapp.
    • Verify if your Webhook always returns an HTTPS 200 OK response after processing notifications.
    • Verify if your Webhook takes a long time to respond.

High Pending Outgoing Message Alert

Description

Outgoing message queue is close to being full; API requests will fail with System overloaded error (1016) soon

Action Items

  1. Check the Outgoing Messages panel row for any unusual traffic increases. If there is unusual traffic increases, try to reduce the traffic load until the alert is clear.
  2. Verify if your database has failed over to another region recently. The WhatsApp Business API may not catch up with the load due to cross-region latency.
  3. If outgoing messages are queuing up slowly over time, you should report the bug to us.
  4. If a single WhatsApp Business API Client cannot meet your load requirements, set up Multiconnect to support much higher loads.

High Queuing Callback Alert

Description

Callback queue is close to being full; API requests will fail with System overloaded error (1016) soon

Action Items

  1. Check the Callback Error Rate panel to verify callbacks are processing successfully.
  2. Reduce the callback processing time for your Webhook.
  3. Configure max_concurrent_requests in the application settings to increase number of inflight callback requests (by default, it's 6).

Machine Overview Dashboard Alerts

High CPU Usage Alert

Description

CPU Utilization of a machine is too high

Action Items

  1. Check the CPU Detailed Util % panel to get utilization distribution.
  2. Run atop or top on the machine to find the most CPU consuming processes. It may also be worth checking out the Container Overview dashboard for container level CPU metrics by filling the Machine variable with the problematic machine.
  3. If the Webapp, Coreapp, or database consumes most of the CPU, find a more powerful machine to host them. For High Availability/Multiconnect mode, if the Webapp and Coreapp containers are running on the same machine, try to moving them to separate machines.

High Disk Usage Alert

Description

Disk Utilization of a device on a machine is too high

Action Items

  1. Run the du and df commands on the device to analyze disk usage. It may also be worth checking out the Container Overview dashboard for container level disk metrics by filling the Machine variable with the problematic machine.
  2. Clean up unnecessary space-consuming data on the device; if there are media files or logs, set up a cron job to clean up old data periodically.

High Memory Usage Alert

Description

Memory Utilization of a machine is too high

Action Items

  1. Check the Memory Details panel to get utilization distribution.
  2. Run atop or top on the machine to find the most memory consuming process. It may also be worth checking out the Container Overview dashboard for container level memory metrics by filling the Machine variable with the problematic machine.
  3. If the Webapp, Coreapp, or database consumes most of the memory, find a more powerful machine to host them.
  4. If the Coreapp's memory usage is increasing slowly over time, it's probably due to a memory leak; you should report the bug to us. Restart the Coreapp to mitigate the memory issues.

Too Many Open Files Alert

Description

Machine is going to run out of file descriptors soon

Action Items

  1. Check the File Descriptor panel for the open file limit.
  2. Configure a higher value (e.g., fs.file-max = 600000) in the /etc/sysctl.conf file to increase the open file limit.
  3. Run sysctl -p to apply changes.

MySQL Overview Dashboard Alerts

Too Many DB Connections Alert

Description

DB connection pool utilization is high; new DB requests may fail with Too many connections errors soon

Action Items

  1. Check the Connections panel for the current connection limit.
  2. Increase the MySQL system variables max_connections (by default, it's 151) in my.cnf and restart the MySQL server. See the MySQL Server System Variables documentation for more information.
  3. For AWS RDS, you need to migrate to a larger RDS instance. See the RDS Instance Sizing section of the AWS Deployment Details for guidance.

WebApp Overview Dashboard Alerts

HTTP Server High Pending Connections Alert

Description

Webapp internal HTTP server connection queue is close to full

Action Items

  1. Check the Business Client dashboard for unusual API traffic or high API request latency.
  2. Check the Webapp logs for more information.
  3. Check if the Webapp CPU utilization is high, and if so, find a more powerful machine for the Webapp.