Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • At scale, concerns over failures (statistically likely) and performance
  • Can over-optimize unless know requirements for:
    • response time/latency
    • availability
    • durability of data
  • Graceful degradation
  • Architectural safety measures
    • Anti-fragile organization by Nassim Taleb
      • intentionally causing failures at Netflix and Google
    • Timeouts
      • too long slows down whole system
      • too quick creates false negatives
      • choose defaults and log → monitor → adjust
    • Circuit Breakers
      • Fail fast after a certain number of failures
        • gracefully degrade or error
        • queue for later if async
      • Restart after certain threshold
    • Bulkheads
      • Lose a part of the ship but rest remains intact
      • Separation of concerns via separate microservices
    • Timeouts and circuit breakers help free up resources when they become constrained
    • Bulkheads ensure they don't become constrained in the first place
  • Idempotent operations allow repeatable/replayable messages
  • Scaling techniques
    • Vertical scaling with bigger boxes
    • Splitting workloads with microservices on their own hosts
    • Spreading risk
      • Multiple hosts
      • Across regions/data centers
    • Load balancing
      • Avoids single point of failure
      • Distributes calls to multiple instances to handle load
    • Worker-based systems
      • Message broker is resilient, allowing worker processes to scale to match load and handle failures without loss of work
    • Rewrite/redesign/reimplement as scaling needs change
      • "The need to change our systems to deal with scale isn't a sign of failure. It is a sign of success."
  • Scaling Databases
    • Availability vs Durability of data
    • Scaling for Reads - caching, read replica
    • Scaling for Writes - sharding, though querying across shards become difficult - may scale for write volume, but not improve resiliency
    • CQRS pattern separates state modification from state querying
  • Caching
    • Eliminates round-trips to databases and other services for faster results.
    • Client-side caching with possible hints from the service
      • Reduces network calls
      • Invalidation is trickier
    • Proxy caching in between client and server (e.g., CDN)
      • Generic caching easier to add to existing system
    • Server-side caching via Redis or Memcache
      • Opaque to clients
      • Easier to invalidate, track, and optimize across client-types
    • Knowing requirements for load, freshness, and current metrics
    • HTTP caching with cache-control TTL and ETags
    • Caching for writes, such as write-behind cache
    • Caching for resilience in case of failure by providing stale data
    • Hiding the origin by having the origin populate the cache asynchronously - to protect the origin from cascading load
    • Keep it simple - by sticking to one only if needed
  • Autoscaling
  • CAP theorem
    • Consistency vs Availability vs Partition Tolerance
    • Can mix and match
  • Service Discovery
    • DNS
    • Zookeeper, Consul, Eureka
  • Documenting Services
    • Swagger, HAL

Ch 12. Final

  • Summary
    • Model around business concepts - bounded contexts
    • Automation culture - automated tests, continuous delivery, environment defs, custom images
    • Hide internal implementation details - bounded contexts, hide databases, event data pumps, technology-agnostic APIs
    • Decentralize all things - self-service, team ownership of services, Conway's law, shared governance, choreography over orchestration, dump middleware with smart endpoints
    • Deploy independently - coexisting versioned endpoints, one-service-per-host model, blue/green deployment, canary release, consumer-driven contracts, eliminate lock-step
    • Isolate failure - treat remote calls differently from local calls, anti-fragility, timeouts, bulkheads, circuit breakers, CAP
    • Highly observable - semantic monitoring, synthetic transactions, aggregate logs/stats, use correlation IDs
  • Advice
    • Go incrementally - make each decision small in scope.
    • Change is inevitable. Embrace it.