Building Microservices

Building Microservices

Notes on Building Microservices, Designing Fine-grained Systems by Sam Newman

Ch 3. Bounded Contexts

  • Seams with Loose coupling and High cohesion

  • Beware of premature decomposition before domains and usages are solidified.

Ch 4. Integration

  • Avoid database integration at all cost

  • DRY and code-reuse can couple microservices

  • Sync vs Async

    • request/response vs event-based vs reactive (observe for results)

    • request/response can be sync or async - async by registering a callback

      • "orchestration" pattern leads to centralized authorities with anemic CRUD-based services

      • these systems are more brittle with a higher cost of change

      • technology choices

        • RPC

          • easy to use

          • watch out for: technology coupling, incorrectly treating remote calls as local calls, lock-step releases

        • REST over HTTP

          • more resilient to changes than RPC - sensible default choice

          • not suited for low latency and small message size (consider WebSockets) 

    • event-based - better decoupling; intelligence is distributed

      • "choreographed" pattern leads to implicit view of the business process (since no centralized workflow)

      • additional work to monitor/track - can create a monitoring system that matches the view of the business process - validates flowchart expectations

      • these systems are more loosely coupled, flexible, and amenable to change

      • technology choices

        • RabbitMQ

        • HTTP + ATOM

      • managing complexities

        • maximum retry limits

        • dead letter queue (for failed messages) - with UI to view and retry

        • good monitoring

        • correlation IDs to trace requests

  • Versioning

    • Postel's Law, a.k.a. Robustness principle: "Be conservative in what you do, be liberal in what you accept from others."

    • consumer-driven contracts to catch breaking changes early

    • semantic versioning - self-documented impact

    • expand and contract pattern when versioning breaking changes.

      • co-existing versions needed for blue-green deployments and canary releases

  • UI as the compositional layer

    • UI Fragment composition

      • server-side rendering of course-grained fragments work well when aligned with team ownership

      • problem: consistency of UX - mitigated with shared CSS/images/etc

      • problem: doesn't support native applications / thick clients

      • problem: doesn't work for cross-cutting features that don't fit into a widget/page

    • Backends for Frontends

      • aggregated backend layers, with dedicated backends serving UI/APIs for dedicated frontend experiences

      • danger: keep business logic within the underlying services.  Aggregated backends should contain only front-end specific behavior

    • Hybrid of both approaches above 

  • Third-party software

    • Build or Buy commercial-of-the-shelf?

      • "Build if it is unique to what you do, and can be considered a strategic asset; buy if your use of the tool isn't that special" - Build if core to your business

    • Problems with COTS

      • lack of control

      • customization - avoid complex customizations - rather, change your organization's functions

      • integration spaghetti - with different protocols, etc

    • On your own terms

      • Hide COTS CMS behind your own web frontend, putting the COTS within your own service facade

    • Use Strangler Application Pattern to capture and intercept calls to the old system

Ch 5. Splitting the Monolith

  • Seams

    • Identify seams that can become service boundaries (not for the purpose of cleaning up the codebase).

    • Bounded contexts as seams, as they are cohesive and loosely coupled boundaries.

  • Reasons to split

    • Pace of change is faster with autonomous units.

    • Team autonomy

    • Replaceable with alternative implementation

  • Tangled dependencies - use a dependency analysis tool to view the seams as a DAG to find the least depended on seam.

  • Coupling at the database layer

    • Examples of database refactoring to separate schemas

    • Transactional boundaries - split across databases

      • Design patterns for failures (success in one db, but failure in another)

        • Eventually consistency - try again later

        • Compensating transaction - abort entire operation

          • more complex to recover

          • need other compensating behavior to fix up inconsistencies

        • Distributed transaction using a transaction manager

          • 2-phase commit

          • Locks on resources can lead to contention, inhibiting scaling

      • Rather than requiring distributed transactions, actually create a higher-level concept that represents the transaction.

        • Gives a natural place to focus logic around the end-to-end process and to handle exceptions

  • Reporting Systems

    • Can use a read-replica to access the data - but couples database technology

    • APIs - don't scale

    • Data pumps to push the data, rather than have reporting system pull the data

      • service owners write their own pump so not coupled to service schemas

      • reporting schema treated as a published API

      • aggregated view of all service pumps within the reporting system

      • need to deal with complexity of segmented schema

    • Event data pump

      • Reporting service just binds to the events emitted by upstream services

      • Looser coupling and fresher data

      • May not scale as well as data pumps though

    • Backup data pump

      • Variant of data pump used by Netflix using Hadoop off of S3-backed Cassandra data

  • Cost of change

    • Make small, incremental changes to understand the impact of each alteration - mitigates cost of mistakes

    • Small cost: Moving code around within a codebase

    • Large cost: Splitting apart a database

    • Whiteboard

      • Make mistakes where the impact will be the lowest: on the whiteboard

      • Go through use cases

      • Class-responsibility cards (CRC) - borrowing from OOP - each card includes name of the class, its responsibility, and its collaborators

Ch 6. Deployment

  • Continuous Integration

    • CI server detects code is committed, verifies code and runs tests

    • Versioned Artifacts are also created for further validation and usage in downstream deployments

      • Confirms that the artifacts deployed are the ones tested

      • Reused without continual recreation

      • Traceability back to the commit

    • 3 questions from Jez Humble on whether you're really doing it

      • Do you check in to mainline once per day?

        • Even if you are using short-lived branches, integrate frequently

      • Do you have a suite of tests to validation your changes?

      • When the build it broken, is it the #1 priority of the team to fix it?

    • Repo

      • Since repo and single CI build for all microservices

        • requires lock-step releases

        • Ok for early stage and short-period of time

        • Cycle time impacted - speed of moving a single change to being live

        • Ownership issues

      • Single repo with separate CI builds mapping to different parts of the source tree

        • Better than the above

        • Can get into the habit of slipping changes that couple services together.

      • Separate repo with separate CI builds for each microservice

        • Faster development

        • Clearer ownership

        • More difficult to make changes across repos - can be easier via command-line scripts

  • Continuous Delivery

    • Treat each check-in as a release candidate, getting constant feedback on its production readiness

    • Build pipeline

      • One stage for faster tests and another stage for slower tests

      • Feel more confident about the change as it goes through the pipeline

      • Fast tests → Slow tests → User acceptance tests → Performance tests → Production

  • One microservice per build

    • Is the goal

    • However, while service boundaries are still being defined, a single build for all services reduces the cost of cross-service changes

  • Deployable Artifacts

    • Platform-specific

    • OS-specific

    • Images

  • Environments for each pipeline stage and different deployments

    • different collection of configurations and hosts

    • Service configuration

      • Keep configuration decoupled from artifacts

      • Or use a dedicated config system

  • Service to host mapping

    • Multiple services per host

      • Coupling, even with Application Containers

    • Single service per host

      • Monitoring and remediation much easier

      • Isolation of failure

      • Independent scaling

      • Automation mitigates amount of overhead

  • Virtualization

    • Vagrant - can be taxing on a developer machine when running a lot of VMs at once

    • Linux Containers

    • Docker - Vagrant can host a Docker instance

Ch 7. Testing

  • Brian Marick's Testing Quadrant from Agile Testing by Lisa Crispin and Janet Gregory

  • Mike Cohn's Test Pyramid from Succeeding with Agile

    • Go up 

      • scope increases, confidence increases, test-time increases; 

      • when test breaks harder to diagnose cause

      • order of magnitude less than previous

    • When broader-scoped test fails, write unit regression test

    • Test Snow cone - or inverted pyramid - doesn't work well with continuous integration due to slow test cycles

  • Unit tests

    • fast feedback on functionality

    • thousands run in less than a minute

    • limiting use of external files or network connections

    • outcomes of test-driven design and property-based testing

  • Service tests

    • bypass UI

    • At the API layer for web services

    • Can stub out external collaborators to decrease test times

    • Cover more scope than unit tests, but less brittle than larger-scoped tests

    • Mocking or stubbing?

      • Martin Fowler's Test Doubles

      • Stubbing is preferable

      • Mocking is brittle as it requires more coupling with the fake collaborators

  • End-to-end (E2E) journey tests

    • GUI/browser-driven tests

    • Higher degree of confidence when they pass, but slower and trickier - especially with microservices

    • Tricky

      • Which versions of the services to test against?

      • Duplicate of effort by different service owners

      • => Single suite of end-to-end tests - fan-in from individual pipelines

    • Downsides

      • Flaky and brittle

        • => Remove flaky tests to retain faith in them

        • => See if they can be replaced with smaller-scoped tests

      • Lack of ownership

        • => Shared codebase with joint ownership

      • Long duration

        • Large end-to-end test suites take days to run

        • Take hours to diagnose

        • => Can run in parallel

        • => Remove unneeded tests

      • Pile-up of issues

        • => release small changes more frequently

      • Don't create a meta-version of all services that were tested together - results in coupling of services again

    • Establish agreed-upon core journeys that are to be tested

      • Each user story should NOT result in an end-to-end test

      • Very small number (low double digits even for complex systems)

  • Consumer-driven contract (CDC) tests

    • Interfaces/expectations by consumers are codified

    • Pact - open-source consumer-driven testing tool

    • Codifying discussions between consumers and producers - a test failure triggers a conversation

    • E2E tests are training wheels for CDC

      • E2E tests can be a useful safety net, trading off cycle time for decreased risk

      • Slowly reduce reliance on E2E tests so they are no longer needed

  • Post-deployment Testing

    • Acknowledge that we cannot find all errors before production

    • Smoke test suite runs once code is on prod

    • Blue/green deployment 

      • allows quick rollback if needed

      • zero downtime deployments

    • Canary releasing

      • to verify new version is performing as expected (error rates, response time, etc) under real traffic

      • version co-exist for longer periods of time

      • either a subset of production traffic or shadow of full production traffic

    • Mean time between failures (MTBF) versus Mean time to repair (MTTR)

      • Optimizing for MTTR over MTBF allows for less time spent on functional test suites and faster time to customer value/testing an idea

  • Cross-functional Testing (a.k.a., nonfunctional requirements)

    • Fall into the Property Testing quadrant

    • Performance Tests

      • Follow pyramid as well - unit test level as well as e2e via load tests

      • Have a mix - of isolated tests as well as journey tests

      • load tests

        • gradually increase number of simulated customers

        • system matches prod as much as possible

        • may get false positives in addition to false negatives

      • Test runs

        • Run a subset of tests every day; larger set every week

        • Regularly as possible - so can isolate commits

        • Have targets with clear call-to-action, otherwise results may be ignored

Ch 8. Monitoring

  • With microservices, multiple servers, multiple log files, etc.

  • Make log files centrally available/searchable

  • Metric tracking

    • aggregated sampling across time, across services

    • but still see data for individual services and individual instances

  • Synthetic transaction (a.k.a., semantic monitoring)

    • fake events to ensure behavior

  • Correlation IDs

    • unique identifier (e.g., GUID) used to track a transaction across services/boundaries

    • one path missing a correlation ID will break the monitoring

      • having a thin shared client wrapper library can help

  • Cascading failures

    • monitor integration points between systems 

  • Service

    • response time, error rates, application-level metrics

    • health of downstream services

    • monitor OS to track rogue processes and for capacity planning

  • System

    • aggregate host-level metrics with application-level metrics, but can drill down to individual hosts

    • maintain long-term data to measure trends

    • standardize on 

      • tools

      • log formats

      • correlation IDs

      • call to action with alerts and dashboards

  • Future - common, generic system for business metrics and operation metrics to monitor system in a more holistic way

Ch 9. Security

Ch 10. Conway's Law and System Design

  • Organizational structure is a strong influence on the structure of the system

  • Microservices are modeled after business domains, not technical ones

    • Teams aligned along bounded contexts, as services are.

  • Align service ownership to co-located teams, which are aligned around the same bounded contexts of the organization.

Ch 11. Microservices at Scale

  • At scale, concerns over failures (statistically likely) and performance

  • Can over-optimize unless know requirements for:

    • response time/latency

    • availability

    • durability of data

  • Graceful degradation

  • Architectural safety measures

    • Anti-fragile organization by Nassim Taleb

      • intentionally causing failures at Netflix and Google

    • Timeouts

      • too long slows down whole system

      • too quick creates false negatives

      • choose defaults and log → monitor → adjust

    • Circuit Breakers

      • Fail fast after a certain number of failures

        • gracefully degrade or error

        • queue for later if async

      • Restart after certain threshold

    • Bulkheads

      • Lose a part of the ship but rest remains intact

      • Separation of concerns via separate microservices

    • Timeouts and circuit breakers help free up resources when they become constrained

    • Bulkheads ensure they don't become constrained in the first place

  • Idempotent operations allow repeatable/replayable messages

  • Scaling techniques

    • Vertical scaling with bigger boxes

    • Splitting workloads with microservices on their own hosts

    • Spreading risk