Page Comparison

...

At scale, concerns over failures (statistically likely) and performance
Can over-optimize unless know requirements for:
- response time/latency
- availability
- durability of data
Graceful degradation
Architectural safety measures
- Anti-fragile organization by Nassim Taleb
  - intentionally causing failures at Netflix and Google
- Timeouts
  - too long slows down whole system
  - too quick creates false negatives
  - choose defaults and log → monitor → adjust
- Circuit Breakers
  - Fail fast after a certain number of failures
    - gracefully degrade or error
    - queue for later if async
  - Restart after certain threshold
- Bulkheads
  - Lose a part of the ship but rest remains intact
  - Separation of concerns via separate microservices
- Timeouts and circuit breakers help free up resources when they become constrained
- Bulkheads ensure they don't become constrained in the first place
Idempotent operations allow repeatable/replayable messages
Scaling techniques
- Vertical scaling with bigger boxes
- Splitting workloads with microservices on their own hosts
- Spreading risk
  - Multiple hosts
  - Across regions/data centers
- Load balancing
  - Avoids single point of failure
  - Distributes calls to multiple instances to handle load
- Worker-based systems
  - Message broker is resilient, allowing worker processes to scale to match load and handle failures without loss of work
- Rewrite/redesign/reimplement as scaling needs change
  - "The need to change our systems to deal with scale isn't a sign of failure. It is a sign of success."
Scaling Databases
- Availability vs Durability of data
- Scaling for Reads - caching, read replica
- Scaling for Writes - sharding, though querying across shards become difficult - may scale for write volume, but not improve resiliency
- CQRS pattern separates state modification from state querying
Caching
- Eliminates round-trips to databases and other services for faster results.
- Client-side caching with possible hints from the service
  - Reduces network calls
  - Invalidation is trickier
- Proxy caching in between client and server (e.g., CDN)
  - Generic caching easier to add to existing system
- Server-side caching via Redis or Memcache
  - Opaque to clients
  - Easier to invalidate, track, and optimize across client-types
- Knowing requirements for load, freshness, and current metrics
- HTTP caching with cache-control TTL and ETags
- Caching for writes, such as write-behind cache
- Caching for resilience in case of failure by providing stale data
- Hiding the origin by having the origin populate the cache asynchronously - to protect the origin from cascading load
- Keep it simple - by sticking to one only if needed
Autoscaling
CAP theorem
- Consistency vs Availability vs Partition Tolerance
- Can mix and match
Service Discovery
- DNS
- Zookeeper, Consul, Eureka
Documenting Services
- Swagger, HAL

Ch 12. Final

Summary
- Model around business concepts - bounded contexts
- Automation culture - automated tests, continuous delivery, environment defs, custom images
- Hide internal implementation details - bounded contexts, hide databases, event data pumps, technology-agnostic APIs
- Decentralize all things - self-service, team ownership of services, Conway's law, shared governance, choreography over orchestration, dump middleware with smart endpoints
- Deploy independently - coexisting versioned endpoints, one-service-per-host model, blue/green deployment, canary release, consumer-driven contracts, eliminate lock-step
- Isolate failure - treat remote calls differently from local calls, anti-fragility, timeouts, bulkheads, circuit breakers, CAP
- Highly observable - semantic monitoring, synthetic transactions, aggregate logs/stats, use correlation IDs
Advice
- Go incrementally - make each decision small in scope.
- Change is inevitable. Embrace it.

Versions Compared

Old Version 6

New Version 7

Key

Ch 12. Final