Incident - Resolved, all systems operational
Postmortem

At about 7:30pm (12/05) internal issues with the time series database occurred which resulted in very long processing times for time series queries. As the backend wasn't able to update the database any more, incoming MQTT messages were put into the brokers error queue for later processing. Unfortunately, when the queue size of the broker reached a certain limit, the broker refused to process incoming messages. At that time dev-ops team was alerted and began to work on that issue.

The time series database issue was resolved, but the broker still was unresponsive due to the internal queue handling. The broker was restarted and all messages stored in the error queue were re-injected for further processing. At that point in time, the cloud resumed a normal operating state. This process ended around 2am (13/05).

During the next work day (13/05), the consulting team informed that data was lost. Investigating the site controllers we discovered that due to the AMQ becoming slow/irresponsible, some messages from the Site Controller were marked as sent internally but the system never received acknowledgment from the AMQ. With the Broker restart, the messages in this scenario (in transit) got dropped by AMQ and therefore, no acknowledgment were sent, which made the site controller be in a state of waiting forever for them before sending the next message to the external broker.

This scenario, in normal circumstances, would not cause the data to be lost since the site controllers store the messages in a local database when the connection with the broker is lost, which would be sent as soon as the connection is back online. Unfortunately a second complication happen with the affected site controllers. The site controller sends an event to the local broker each time the connection state of the external broker changes, this is directly done inside of a callback from the external broker, a similar situation happens with the external broker. In this case both of them hold a lock on the connection to their respective brokers, so internal client blocked the external one and the external one blocked the internal one. In this situation some of the messages got dropped, causing the data loss. Unfortunately to return the site controller to normal activities they need to be restarted through the console.

We want you to rest assured that we took extensive measures to mitigate this problems, such as, even more extensive monitoring in place, a dedicated developer debugging the site controller behavior and high priority bug reports related to this issue in our backlog. Updates related to this issue will make available as soon as possible. Please feel free to send any further inquiries to us.

Posted May 14, 2020 - 20:41 CEST

Resolved
Technical issues in our systems have been mitigated and monitored during an incubation period. All services are back to normal and fully operational.

Please excuse the degraded operations and thank you very much for your patience. Our team is working hard on the mitigation and investigation of the exact root causes and necessary changes will be applied swiftly.

Feel free to contact us if you've got questions.
E-Mail: consulting@azeti.net
Web: https://customer.azeti.net

Sincerely,
Your azeti Team
Posted May 13, 2020 - 12:52 CEST
This incident affected: Frontend, Backend, MQTT Broker, and APIs.