Building Microservices

Notes on Building Microservices, Designing Fine-grained Systems by Sam Newman

Ch 3. Bounded Contexts

  • Seams with Loose coupling and High cohesion
  • Beware of premature decomposition before domains and usages are solidified.

Ch 4. Integration

  • Avoid database integration at all cost
  • DRY and code-reuse can couple microservices
  • Sync vs Async
    • request/response vs event-based vs reactive (observe for results)
    • request/response can be sync or async - async by registering a callback
      • "orchestration" pattern leads to centralized authorities with anemic CRUD-based services
      • these systems are more brittle with a higher cost of change
      • technology choices
        • RPC
          • easy to use
          • watch out for: technology coupling, incorrectly treating remote calls as local calls, lock-step releases
        • REST over HTTP
          • more resilient to changes than RPC - sensible default choice
          • not suited for low latency and small message size (consider WebSockets) 
    • event-based - better decoupling; intelligence is distributed
      • "choreographed" pattern leads to implicit view of the business process (since no centralized workflow)
      • additional work to monitor/track - can create a monitoring system that matches the view of the business process - validates flowchart expectations
      • these systems are more loosely coupled, flexible, and amenable to change
      • technology choices
        • RabbitMQ
        • HTTP + ATOM
      • managing complexities
        • maximum retry limits
        • dead letter queue (for failed messages) - with UI to view and retry
        • good monitoring
        • correlation IDs to trace requests
  • Versioning
    • Postel's Law, a.k.a. Robustness principle: "Be conservative in what you do, be liberal in what you accept from others."
    • consumer-driven contracts to catch breaking changes early
    • semantic versioning - self-documented impact
    • expand and contract pattern when versioning breaking changes.
      • co-existing versions needed for blue-green deployments and canary releases
  • UI as the compositional layer
    • UI Fragment composition
      • server-side rendering of course-grained fragments work well when aligned with team ownership
      • problem: consistency of UX - mitigated with shared CSS/images/etc
      • problem: doesn't support native applications / thick clients
      • problem: doesn't work for cross-cutting features that don't fit into a widget/page
    • Backends for Frontends
      • aggregated backend layers, with dedicated backends serving UI/APIs for dedicated frontend experiences
      • danger: keep business logic within the underlying services.  Aggregated backends should contain only front-end specific behavior
    • Hybrid of both approaches above 
  • Third-party software
    • Build or Buy commercial-of-the-shelf?
      • "Build if it is unique to what you do, and can be considered a strategic asset; buy if your use of the tool isn't that special" - Build if core to your business
    • Problems with COTS
      • lack of control
      • customization - avoid complex customizations - rather, change your organization's functions
      • integration spaghetti - with different protocols, etc
    • On your own terms
      • Hide COTS CMS behind your own web frontend, putting the COTS within your own service facade
    • Use Strangler Application Pattern to capture and intercept calls to the old system

Ch 5. Splitting the Monolith

  • Seams
    • Identify seams that can become service boundaries (not for the purpose of cleaning up the codebase).
    • Bounded contexts as seams, as they are cohesive and loosely coupled boundaries.
  • Reasons to split
    • Pace of change is faster with autonomous units.
    • Team autonomy
    • Replaceable with alternative implementation
  • Tangled dependencies - use a dependency analysis tool to view the seams as a DAG to find the least depended on seam.
  • Coupling at the database layer
    • Examples of database refactoring to separate schemas
    • Transactional boundaries - split across databases
      • Design patterns for failures (success in one db, but failure in another)
        • Eventually consistency - try again later
        • Compensating transaction - abort entire operation
          • more complex to recover
          • need other compensating behavior to fix up inconsistencies
        • Distributed transaction using a transaction manager
          • 2-phase commit
          • Locks on resources can lead to contention, inhibiting scaling
      • Rather than requiring distributed transactions, actually create a higher-level concept that represents the transaction.
        • Gives a natural place to focus logic around the end-to-end process and to handle exceptions
  • Reporting Systems
    • Can use a read-replica to access the data - but couples database technology
    • APIs - don't scale
    • Data pumps to push the data, rather than have reporting system pull the data
      • service owners write their own pump so not coupled to service schemas
      • reporting schema treated as a published API
      • aggregated view of all service pumps within the reporting system
      • need to deal with complexity of segmented schema
    • Event data pump
      • Reporting service just binds to the events emitted by upstream services
      • Looser coupling and fresher data
      • May not scale as well as data pumps though
    • Backup data pump
      • Variant of data pump used by Netflix using Hadoop off of S3-backed Cassandra data
  • Cost of change
    • Make small, incremental changes to understand the impact of each alteration - mitigates cost of mistakes
    • Small cost: Moving code around within a codebase
    • Large cost: Splitting apart a database
    • Whiteboard
      • Make mistakes where the impact will be the lowest: on the whiteboard
      • Go through use cases
      • Class-responsibility cards (CRC) - borrowing from OOP - each card includes name of the class, its responsibility, and its collaborators

Ch 6. Deployment

  • Continuous Integration
    • CI server detects code is committed, verifies code and runs tests
    • Versioned Artifacts are also created for further validation and usage in downstream deployments
      • Confirms that the artifacts deployed are the ones tested
      • Reused without continual recreation
      • Traceability back to the commit
    • 3 questions from Jez Humble on whether you're really doing it
      • Do you check in to mainline once per day?
        • Even if you are using short-lived branches, integrate frequently
      • Do you have a suite of tests to validation your changes?
      • When the build it broken, is it the #1 priority of the team to fix it?
    • Repo
      • Since repo and single CI build for all microservices
        • requires lock-step releases
        • Ok for early stage and short-period of time
        • Cycle time impacted - speed of moving a single change to being live
        • Ownership issues
      • Single repo with separate CI builds mapping to different parts of the source tree
        • Better than the above
        • Can get into the habit of slipping changes that couple services together.
      • Separate repo with separate CI builds for each microservice
        • Faster development
        • Clearer ownership
        • More difficult to make changes across repos - can be easier via command-line scripts
  • Continuous Delivery
    • Treat each check-in as a release candidate, getting constant feedback on its production readiness
    • Build pipeline
      • One stage for faster tests and another stage for slower tests
      • Feel more confident about the change as it goes through the pipeline
      • Fast tests → Slow tests → User acceptance tests → Performance tests → Production
  • One microservice per build
    • Is the goal
    • However, while service boundaries are still being defined, a single build for all services reduces the cost of cross-service changes
  • Deployable Artifacts
    • Platform-specific
    • OS-specific
    • Images
  • Environments for each pipeline stage and different deployments
    • different collection of configurations and hosts
    • Service configuration
      • Keep configuration decoupled from artifacts
      • Or use a dedicated config system
  • Service to host mapping
    • Multiple services per host
      • Coupling, even with Application Containers
    • Single service per host
      • Monitoring and remediation much easier
      • Isolation of failure
      • Independent scaling
      • Automation mitigates amount of overhead
  • Virtualization
    • Vagrant - can be taxing on a developer machine when running a lot of VMs at once
    • Linux Containers
    • Docker - Vagrant can host a Docker instance

Ch 7. Testing

  • Brian Marick's Testing Quadrant from Agile Testing by Lisa Crispin and Janet Gregory
  • Mike Cohn's Test Pyramid from Succeeding with Agile
    • Go up 
      • scope increases, confidence increases, test-time increases; 
      • when test breaks harder to diagnose cause
      • order of magnitude less than previous
    • When broader-scoped test fails, write unit regression test
    • Test Snow cone - or inverted pyramid - doesn't work well with continuous integration due to slow test cycles
  • Unit tests
    • fast feedback on functionality
    • thousands run in less than a minute
    • limiting use of external files or network connections
    • outcomes of test-driven design and property-based testing
  • Service tests
    • bypass UI
    • At the API layer for web services
    • Can stub out external collaborators to decrease test times
    • Cover more scope than unit tests, but less brittle than larger-scoped tests
    • Mocking or stubbing?
      • Martin Fowler's Test Doubles
      • Stubbing is preferable
      • Mocking is brittle as it requires more coupling with the fake collaborators
  • End-to-end (E2E) journey tests
    • GUI/browser-driven tests
    • Higher degree of confidence when they pass, but slower and trickier - especially with microservices
    • Tricky
      • Which versions of the services to test against?
      • Duplicate of effort by different service owners
      • => Single suite of end-to-end tests - fan-in from individual pipelines
    • Downsides
      • Flaky and brittle
        • => Remove flaky tests to retain faith in them
        • => See if they can be replaced with smaller-scoped tests
      • Lack of ownership
        • => Shared codebase with joint ownership
      • Long duration
        • Large end-to-end test suites take days to run
        • Take hours to diagnose
        • => Can run in parallel
        • => Remove unneeded tests
      • Pile-up of issues
        • => release small changes more frequently
      • Don't create a meta-version of all services that were tested together - results in coupling of services again
    • Establish agreed-upon core journeys that are to be tested
      • Each user story should NOT result in an end-to-end test
      • Very small number (low double digits even for complex systems)
  • Consumer-driven contract (CDC) tests
    • Interfaces/expectations by consumers are codified
    • Pact - open-source consumer-driven testing tool
    • Codifying discussions between consumers and producers - a test failure triggers a conversation
    • E2E tests are training wheels for CDC
      • E2E tests can be a useful safety net, trading off cycle time for decreased risk
      • Slowly reduce reliance on E2E tests so they are no longer needed
  • Post-deployment Testing
    • Acknowledge that we cannot find all errors before production
    • Smoke test suite runs once code is on prod
    • Blue/green deployment 
      • allows quick rollback if needed
      • zero downtime deployments
    • Canary releasing
      • to verify new version is performing as expected (error rates, response time, etc) under real traffic
      • version co-exist for longer periods of time
      • either a subset of production traffic or shadow of full production traffic
    • Mean time between failures (MTBF) versus Mean time to repair (MTTR)
      • Optimizing for MTTR over MTBF allows for less time spent on functional test suites and faster time to customer value/testing an idea
  • Cross-functional Testing (a.k.a., nonfunctional requirements)
    • Fall into the Property Testing quadrant
    • Performance Tests
      • Follow pyramid as well - unit test level as well as e2e via load tests
      • Have a mix - of isolated tests as well as journey tests
      • load tests
        • gradually increase number of simulated customers
        • system matches prod as much as possible
        • may get false positives in addition to false negatives
      • Test runs
        • Run a subset of tests every day; larger set every week
        • Regularly as possible - so can isolate commits
        • Have targets with clear call-to-action, otherwise results may be ignored

Ch 8. Monitoring

  • With microservices, multiple servers, multiple log files, etc.
  • Make log files centrally available/searchable
  • Metric tracking
    • aggregated sampling across time, across services
    • but still see data for individual services and individual instances
  • Synthetic transaction (a.k.a., semantic monitoring)
    • fake events to ensure behavior
  • Correlation IDs
    • unique identifier (e.g., GUID) used to track a transaction across services/boundaries
    • one path missing a correlation ID will break the monitoring
      • having a thin shared client wrapper library can help
  • Cascading failures
    • monitor integration points between systems 
  • Service
    • response time, error rates, application-level metrics
    • health of downstream services
    • monitor OS to track rogue processes and for capacity planning
  • System
    • aggregate host-level metrics with application-level metrics, but can drill down to individual hosts
    • maintain long-term data to measure trends
    • standardize on 
      • tools
      • log formats
      • correlation IDs
      • call to action with alerts and dashboards
  • Future - common, generic system for business metrics and operation metrics to monitor system in a more holistic way

Ch 9. Security

Ch 10. Conway's Law and System Design

  • Organizational structure is a strong influence on the structure of the system
  • Microservices are modeled after business domains, not technical ones
    • Teams aligned along bounded contexts, as services are.
  • Align service ownership to co-located teams, which are aligned around the same bounded contexts of the organization.

Ch 11. Microservices at Scale

  • At scale, concerns over failures (statistically likely) and performance
  • Can over-optimize unless know requirements for:
    • response time/latency
    • availability
    • durability of data
  • Graceful degradation
  • Architectural safety measures
    • Anti-fragile organization by Nassim Taleb
      • intentionally causing failures at Netflix and Google
    • Timeouts
      • too long slows down whole system
      • too quick creates false negatives
      • choose defaults and log → monitor → adjust
    • Circuit Breakers
      • Fail fast after a certain number of failures
        • gracefully degrade or error
        • queue for later if async
      • Restart after certain threshold
    • Bulkheads
      • Lose a part of the ship but rest remains intact
      • Separation of concerns via separate microservices
    • Timeouts and circuit breakers help free up resources when they become constrained
    • Bulkheads ensure they don't become constrained in the first place
  • Idempotent operations allow repeatable/replayable messages
  • Scaling techniques
    • Vertical scaling with bigger boxes
    • Splitting workloads with microservices on their own hosts
    • Spreading risk
      • Multiple hosts
      • Across regions/data centers
    • Load balancing
      • Avoids single point of failure
      • Distributes calls to multiple instances to handle load
    • Worker-based systems
      • Message broker is resilient, allowing worker processes to scale to match load and handle failures without loss of work
    • Rewrite/redesign/reimplement as scaling needs change
      • "The need to change our systems to deal with scale isn't a sign of failure. It is a sign of success."
  • Scaling Databases
    • Availability vs Durability of data
    • Scaling for Reads - caching, read replica
    • Scaling for Writes - sharding, though querying across shards become difficult - may scale for write volume, but not improve resiliency
    • CQRS pattern separates state modification from state querying
  • Caching
    • Eliminates round-trips to databases and other services for faster results.
    • Client-side caching with possible hints from the service
      • Reduces network calls
      • Invalidation is trickier
    • Proxy caching in between client and server (e.g., CDN)
      • Generic caching easier to add to existing system
    • Server-side caching via Redis or Memcache
      • Opaque to clients
      • Easier to invalidate, track, and optimize across client-types
    • Knowing requirements for load, freshness, and current metrics
    • HTTP caching with cache-control TTL and ETags
    • Caching for writes, such as write-behind cache
    • Caching for resilience in case of failure by providing stale data
    • Hiding the origin by having the origin populate the cache asynchronously - to protect the origin from cascading load
    • Keep it simple - by sticking to one only if needed
  • Autoscaling
  • CAP theorem
    • Consistency vs Availability vs Partition Tolerance
    • Can mix and match
  • Service Discovery
    • DNS
    • Zookeeper, Consul, Eureka
  • Documenting Services
    • Swagger, HAL

Ch 12. Final

  • Summary
    • Model around business concepts - bounded contexts
    • Automation culture - automated tests, continuous delivery, environment defs, custom images
    • Hide internal implementation details - bounded contexts, hide databases, event data pumps, technology-agnostic APIs
    • Decentralize all things - self-service, team ownership of services, Conway's law, shared governance, choreography over orchestration, dump middleware with smart endpoints
    • Deploy independently - coexisting versioned endpoints, one-service-per-host model, blue/green deployment, canary release, consumer-driven contracts, eliminate lock-step
    • Isolate failure - treat remote calls differently from local calls, anti-fragility, timeouts, bulkheads, circuit breakers, CAP
    • Highly observable - semantic monitoring, synthetic transactions, aggregate logs/stats, use correlation IDs
  • Advice
    • Go incrementally - make each decision small in scope.
    • Change is inevitable. Embrace it.