Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

This is a place for notes, resource links, etc. that may be fixed and moved as part of more permanent documentation in the future.

Kafka Uncertainties

We have a working implementation, but are still unclear on some failure modes and what we want as error handling strategies:

  • Consumer error handling:

    • Do we want to use a backoff (delay before next poll) when we encounter certain kinds of errors?

    • Are there errors where we would want to push to a dead letter queue (DLQ)? What then?

    • Are there errors where we would want to just not mark the offset as consumed, and just exit the consumer (restart)? This would require somehow knowing that the consumer is entirely broken.

    • What happens if there are multiple receivers and only one errors out? Would we want to retry just that one later?

  • Consumer transactionality:

    • Should we commit offsets only on-commit of the DB transaction? (Can we assume that all consumers will be on Django with a DB, though? Don't have general knowledge of side effects of signals.)

  • Can we easily rely on re-running events after a fix? What would that take?

Older notes

Note

These are notes from July 2022 and earlier, and many of the notes and questions are outdated.

Discovery Questions and Notes

...

  • Do we start with FULL_TRANSITIVE as a compatability compatibility level?

    • Do we need to determine what changes we can’t make, if any, if transitive?

    • Or does this just slow performance?

Producing Events

  • When and how do we handle back-filling data for a new event stream with legacy data?

  • If we use an outbox:

    • Do we need to have back-pressure and fail when the outbox is too large?

    • Do we need to deserialize/serialize again when pulling from outbox?

    • Is outbox needed? Does it solve ensuring we get into outbox as part of db transaction in original request?

    • May we just have an emergency outbox for when the bus is down but just use synchronous send by default?

...

  • When updating OEP, also see https://openedx.atlassian.net/wiki/spaces/AT/pages/3133407254/Event+Bus+POC#Findings

  • Is it a problem that we are using the term Event Bus and Message Bus interchangeably?

    • Our Architecture Manifesto (WIP) more generally documents Asynchronous over Synchronous.

    • Our manifesto was loosely based on the Reactive Manifesto, which highlights Message Driven (in contrast to Event Driven).

      • Answer: No. Event Bus and Message Bus can be used interchangeably because the term Bus implies the pub/sub messaging pattern. Message Driven vs Event Driven may have different meanings, but it is unclear if that is universal or just according to the Reactive Manifesto.

  • Do we need to get more clear on use cases, or do we wish to find a one-size fits all technology?

    • Answer: We will be starting with pub/sub.

  • Documenting event definitions and event implementations (sending).

  • How will When and how do we handle back-filling of events workdata for a new event stream with legacy data?

  • When might we address questions around reliable data synchronization?

    • Where would this fall in terms of discovery work?

    • For example:

      • Ensuring event matches data committed to database.

      • Ensuring events are not lost.

    • Ultimately, will we want how-tos.

  • Ownership and rollout questions regarding this infrastructure work.

    • How to enable early wild-west-like learning with a small subset, and open more widely as the path gets better paved.

    • Reminder to get feedback early and often from anyone, as possible.

  • When is the right time for a Consumer Review? Is this inform only?

...