Event Bus Notes and Questions

This is a place for notes, resource links, etc. that may be fixed and moved as part of more permanent documentation in the future.

Kafka Uncertainties

We have a working implementation, but are still unclear on some failure modes and what we want as error handling strategies:

  • Producers:

    • Compression is off by default. We could turn it on, but it might not be worth it if we’re using linger.ms=0 and therefore probably have batches of 1 most of the time.

  • Consumer error handling:

    • Do we want to use a backoff (delay before next poll) when we encounter certain kinds of errors?

    • Are there errors where we would want to push to a dead letter queue (DLQ)? What then?

    • Are there errors where we would want to just not mark the offset as consumed, and just exit the consumer (restart)? This would require somehow knowing that the consumer is entirely broken.

    • What happens if there are multiple receivers and only one errors out? Would we want to retry just that one later?

  • Consumer transactionality:

    • Should we commit offsets only on-commit of the DB transaction? (Can we assume that all consumers will be on Django with a DB, though? Don't have general knowledge of side effects of signals.)

  • Recovery:

    • If we have a bug that causes us to drop/mis-process events, how can we re-run events?

      • Probably involves resetting consumer offsets. event-bus-kafka does include offsets in its error logs.

  • Unexplored settings:

    • partition.assignment.strategy – may want to use CooperativeStickyAssignor since it vaguely sounds like it might give the best cache locality

  • Topics:

    • We should enable log compaction at some point. delete.retention.ms seems low (1 day) and we may want to increase it, but we’re not even sending deletes yet.

…and anything from below that is still relevant:

Older notes

These are notes from July 2022 and earlier, and many of the notes and questions are outdated.

Discovery Questions and Notes

Requirements for Enabling Squads

  • Infrastructure ready

  • Onboarding Documentation ready

    • How to create a new event

      • Requirements for a new event

        • (Initially) Must have the feature using it behind a feature flag.

        • Must have a schema

    • How to consume an event

    • How can I learn what events exist?

    • How to not break the abstraction layer (if we have one)

Schemas

  • Do we start with FULL_TRANSITIVE as a compatibility level?

    • Do we need to determine what changes we can’t make, if any, if transitive?

    • Or does this just slow performance?

Producing Events

  • If we use an outbox:

    • Do we need to have back-pressure and fail when the outbox is too large?

    • Do we need to deserialize/serialize again when pulling from outbox?

    • Is outbox needed? Does it solve ensuring we get into outbox as part of db transaction in original request?

    • May we just have an emergency outbox for when the bus is down but just use synchronous send by default?

Abstraction Layer

  • We’ll need to both keep this in mind, and looking into this more as we know more about what we wish to abstract.

General

Potential Materials

 

Other edX Initiatives

Additional Notes

Miscellaneous discovery for message bus:

  • Leading book club

  • Pulsar vs Kafka

  • Naming and name-spacing

  • Event observability

  • Types of events?

    • Log Events

    • New Relic Events

    • Tracking Events

    • Data Analysis

    • Segment Events

    • Realtime Events (Partners)

      • xAPI and Caliper

      • Missing unique ids?

    • Intraprocess Events

      • (Backend) Django Signals

      • What changes for inter/intra?

    • (Frontend) Javascript Events