Momentary disconnections can lock streams in a zombie state #9

Closed
opened 2025-02-22 06:51:31 +00:00 by trysdyn · 3 comments
Owner

If a disconnection is short enough, the middleware can process the reconnection before the disconnection, resulting in the status endpoint not reporting the stream is live until the streamer disconnects and reconnects again.

This is likely exacerbated by setting BlockDuplicateStreamName to false in OvenMediaEngine as a stream key re-use will process a disconnect and a reconnect at the same time, possibly out of order. Need to confirm.

One potential fix for this is to ask the API for a canonical stream list every state change, but that's presented problems in the past too where OvenMediaEngine won't actually report a stream is live until it receives the first video frames, so a handshake that's slow enough can result in the middleware asking for the stream list and getting an incorrect one.

Another possibility is a regular heartbeat to keep the stream list correct. This has the bonus of being able to detect in the middleware if the stream server dies, and that can be used for alerting/status endpoint stuff.

A third option if this is caused by stream key re-use is to detect a stream trying to be opened while it's already open, then applying some logic to anticipate the immediate disconnection after.

If a disconnection is short enough, the middleware can process the reconnection before the disconnection, resulting in the status endpoint not reporting the stream is live until the streamer disconnects and reconnects again. This is likely exacerbated by setting `BlockDuplicateStreamName` to `false` in OvenMediaEngine as a stream key re-use will process a disconnect and a reconnect at the same time, possibly out of order. Need to confirm. One potential fix for this is to ask the API for a canonical stream list every state change, but that's presented problems in the past too where OvenMediaEngine won't actually report a stream is live until it receives the first video frames, so a handshake that's slow enough can result in the middleware asking for the stream list and getting an incorrect one. Another possibility is a regular heartbeat to keep the stream list correct. This has the bonus of being able to detect in the middleware if the stream server dies, and that can be used for alerting/status endpoint stuff. A third option if this is caused by stream key re-use is to detect a stream trying to be opened while it's already open, then applying some logic to anticipate the immediate disconnection after.
Author
Owner

Part of this might be an upstream issue.

Going to grab 0.18.0 and try to replicate against that.

Part of this might be [an upstream issue](https://github.com/AirenSoft/OvenMediaEngine/issues/1749). Going to grab 0.18.0 and try to replicate against that.
Author
Owner

Still seeing this in 0.18.0

So I need to dive into this.

Still seeing this in 0.18.0 So I need to dive into this.
Author
Owner

Found the issue. The admission webhook that manages the stream list in cache has a bug here

Lines 90 to 95 in a0b5b11
# Remove or add the changed stream, as appropriate
if cherrypy.request.update_opening and cherrypy.request.update_stream not in stream_list:
stream_list.append(cherrypy.request.update_stream)
if not cherrypy.request.update_opening and cherrypy.request.update_stream in stream_list:
stream_list.remove(cherrypy.request.update_stream)

We check for dupes on add, but not remove. As a result we'll only ever have one of a stream in the list at a time, and will remove it on any closure.

Any connection attempt to OME causes an open, then a subsequent close if it fails. So we need to store multiple opens and closes, then collapse this list to a set when evaluating it.

Found the issue. The admission webhook that manages the stream list in cache has a bug here https://git.voidfox.com/trysdyn/ovenemprex/src/commit/a0b5b11e4aaa87f190c7cc3904d0eb294544e180/admission.py#L90-L95 We check for dupes on add, but not remove. As a result we'll only ever have one of a stream in the list at a time, and will remove it on any closure. Any connection attempt to OME causes an open, then a subsequent close if it fails. So we need to store multiple opens and closes, then collapse this list to a set when evaluating it.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: trysdyn/ovenemprex#9
No description provided.