Event Bus Notes and Questions
This is a place for notes, resource links, etc. that may be fixed and moved as part of more permanent documentation in the future.
Kafka Uncertainties
We have a working implementation, but are still unclear on some failure modes and what we want as error handling strategies:
Producers:
Compression is off by default. We could turn it on, but it might not be worth it if we’re using
linger.ms=0
and therefore probably have batches of 1 most of the time.
Consumer error handling:
Do we want to use a backoff (delay before next poll) when we encounter certain kinds of errors?
In event-bus-kafka 1.6.0 we did add
EVENT_BUS_KAFKA_CONSUMER_POLL_FAILURE_SLEEP
to prevent polling in a tight loop when the broker is failing
Are there errors where we would want to push to a dead letter queue (DLQ)? What then?
Are there errors where we would want to just not mark the offset as consumed, and just exit the consumer (restart)? This would require somehow knowing that the consumer is entirely broken.
What happens if there are multiple receivers and only one errors out? Would we want to retry just that one later?
event-bus-kafka 1.8.0 at least logs failing receivers
Consumer transactionality:
Should we commit offsets only on-commit of the DB transaction? (Can we assume that all consumers will be on Django with a DB, though? Don't have general knowledge of side effects of signals.)
Recovery:
If we have a bug that causes us to drop/mis-process events, how can we re-run events?
Probably involves resetting consumer offsets. event-bus-kafka does include offsets in its error logs.
Unexplored settings:
partition.assignment.strategy
– may want to useCooperativeStickyAssignor
since it vaguely sounds like it might give the best cache locality
Topics:
We should enable log compaction at some point.
delete.retention.ms
seems low (1 day) and we may want to increase it, but we’re not even sending deletes yet.
…and anything from below that is still relevant:
Older notes
These are notes from July 2022 and earlier, and many of the notes and questions are outdated.
Discovery Questions and Notes
Requirements for Enabling Squads
Infrastructure ready
Onboarding Documentation ready
How to create a new event
Requirements for a new event
(Initially) Must have the feature using it behind a feature flag.
Must have a schema
How to consume an event
How can I learn what events exist?
How to not break the abstraction layer (if we have one)
Schemas
Do we start with FULL_TRANSITIVE as a compatibility level?
Do we need to determine what changes we can’t make, if any, if transitive?
Or does this just slow performance?
Producing Events
If we use an outbox:
Do we need to have back-pressure and fail when the outbox is too large?
Do we need to deserialize/serialize again when pulling from outbox?
Is outbox needed? Does it solve ensuring we get into outbox as part of db transaction in original request?
May we just have an emergency outbox for when the bus is down but just use synchronous send by default?
Abstraction Layer
We’ll need to both keep this in mind, and looking into this more as we know more about what we wish to abstract.
General
When updating OEP, also see Event Bus POC | Findingsarchived
Is it a problem that we are using the term Event Bus and Message Bus interchangeably?
Our Architecture Manifesto (WIP) more generally documents Asynchronous over Synchronous.
Our manifesto was loosely based on the Reactive Manifesto, which highlights Message Driven (in contrast to Event Driven).
Answer: No. Event Bus and Message Bus can be used interchangeably because the term Bus implies the pub/sub messaging pattern. Message Driven vs Event Driven may have different meanings, but it is unclear if that is universal or just according to the Reactive Manifesto.
Do we need to get more clear on use cases, or do we wish to find a one-size fits all technology?
Answer: We will be starting with pub/sub.
Documenting event definitions and event implementations (sending).
See annotations introduced in openedx-events for event definitions.
DE is collecting ownership information for events defined/emitted.
When and how do we handle back-filling data for a new event stream with legacy data?
When might we address questions around reliable data synchronization?
Where would this fall in terms of discovery work?
For example:
Ensuring event matches data committed to database.
Ensuring events are not lost.
Ultimately, will we want how-tos.
Ownership and rollout questions regarding this infrastructure work.
How to enable early wild-west-like learning with a small subset, and open more widely as the path gets better paved.
Reminder to get feedback early and often from anyone, as possible.
When is the right time for a Consumer Review? Is this inform only?
Potential Materials
Kafka documentation: Apache Kafka (very thorough overall)
librdkafka FAQ: FAQ (implementation we’re using, indirectly)
Types of Events- At edX
Signal Events: GitHub - eduNEXT/openedx-events: Open edX events from the Hooks Extensions Framework
OEP-41: Asynchronous Server Event Message Format — Open edX Proposals 1.0 documentation
Other edX Initiatives
https://openedx.atlassian.net/wiki/pages/createpage.action?spaceKey=RS&title=Message%20Bus%20Discovery%3A%20Apache%20Kafka (Revenue Team)
Additional Notes
Kafka Discovery
See 3 Python Libraries for Kafka Compared
In addition to choosing a Python library, we need to determine if messages are sent synchronously.
Kafka Connect or Kafka Streams
Kafka Connect Deep Dive – Error Handling and Dead Letter Queues | Confluent
Note: Connectors are not managed by Amazon MSK
Kafka Hosting
Configuring to not lose data
Required discovery
Runbooks
Observability
Backups/disaster recovery
Future:
Zookeeper removal: https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
Addition of Tiered Storage
Schemas
Event-Driven Architecture
Medium: Benefits and challenges of Event-Driven architecture
“It is also important to have strong monitoring and observability solutions in place. You need to know which service sent which events, and who is subscribed to these events. Having good visibility into the flow of events will let you understand the system and troubleshoot it with more confidence and less guessing.”
Miscellaneous discovery for message bus:
Leading book club
Pulsar vs Kafka
Naming and name-spacing
Event observability
Types of events?
Log Events
New Relic Events
Tracking Events
Data Analysis
Segment Events
Realtime Events (Partners)
xAPI and Caliper
Missing unique ids?
Intraprocess Events
(Backend) Django Signals
What changes for inter/intra?
(Frontend) Javascript Events