...
We have a working implementation, but are still unclear on some failure modes and what we want as error handling strategies:
Producers:
Compression is off by default. We could turn it on, but it might not be worth it if we’re using
linger.ms=0
and therefore probably have batches of 1 most of the time.
Consumer error handling:
Do we want to use a backoff (delay before next poll) when we encounter certain kinds of errors?
In event-bus-kafka 1.6.0 we did add
EVENT_BUS_KAFKA_CONSUMER_POLL_FAILURE_SLEEP
to prevent polling in a tight loop when the broker is failing
Are there errors where we would want to push to a dead letter queue (DLQ)? What then?
Are there errors where we would want to just not mark the offset as consumed, and just exit the consumer (restart)? This would require somehow knowing that the consumer is entirely broken.
What happens if there are multiple receivers and only one errors out? Would we want to retry just that one later?
event-bus-kafka 1.8.0 at least logs failing receivers
Consumer transactionality:
Should we commit offsets only on-commit of the DB transaction? (Can we assume that all consumers will be on Django with a DB, though? Don't have general knowledge of side effects of signals.)
Can we easily rely on re-running events after a fix? What would that take?Recovery:
If we have a bug that causes us to drop/mis-process events, how can we re-run events?
Probably involves resetting consumer offsets. event-bus-kafka does include offsets in its error logs.
Unexplored settings:
partition.assignment.strategy
– may want to useCooperativeStickyAssignor
since it vaguely sounds like it might give the best cache locality
Topics:
We should enable log compaction at some point.
delete.retention.ms
seems low (1 day) and we may want to increase it, but we’re not even sending deletes yet.
…and anything from below that is still relevant:
Older notes
Note |
---|
These are notes from July 2022 and earlier, and many of the notes and questions are outdated. |
...