Building Microservices
Notes on Building Microservices, Designing Fine-grained Systems by Sam Newman
Ch 3. Bounded Contexts
- Seams with Loose coupling and High cohesion
- Beware of premature decomposition before domains and usages are solidified.
Ch 4. Integration
- Avoid database integration at all cost
- DRY and code-reuse can couple microservices
- Sync vs Async
- request/response vs event-based vs reactive (observe for results)
- request/response can be sync or async - async by registering a callback
- "orchestration" pattern leads to centralized authorities with anemic CRUD-based services
- these systems are more brittle with a higher cost of change
- technology choices
- RPC
- easy to use
- watch out for: technology coupling, incorrectly treating remote calls as local calls, lock-step releases
- REST over HTTP
- more resilient to changes than RPC - sensible default choice
- not suited for low latency and small message size (consider WebSockets)
- RPC
- event-based - better decoupling; intelligence is distributed
- "choreographed" pattern leads to implicit view of the business process (since no centralized workflow)
- additional work to monitor/track - can create a monitoring system that matches the view of the business process - validates flowchart expectations
- these systems are more loosely coupled, flexible, and amenable to change
- technology choices
- RabbitMQ
- HTTP + ATOM
- managing complexities
- maximum retry limits
- dead letter queue (for failed messages) - with UI to view and retry
- good monitoring
- correlation IDs to trace requests
- Versioning
- Postel's Law, a.k.a. Robustness principle: "Be conservative in what you do, be liberal in what you accept from others."
- consumer-driven contracts to catch breaking changes early
- semantic versioning - self-documented impact
- expand and contract pattern when versioning breaking changes.
- co-existing versions needed for blue-green deployments and canary releases
- UI as the compositional layer
- UI Fragment composition
- server-side rendering of course-grained fragments work well when aligned with team ownership
- problem: consistency of UX - mitigated with shared CSS/images/etc
- problem: doesn't support native applications / thick clients
- problem: doesn't work for cross-cutting features that don't fit into a widget/page
- Backends for Frontends
- aggregated backend layers, with dedicated backends serving UI/APIs for dedicated frontend experiences
- danger: keep business logic within the underlying services. Aggregated backends should contain only front-end specific behavior
- Hybrid of both approaches above
- UI Fragment composition
- Third-party software
- Build or Buy commercial-of-the-shelf?
- "Build if it is unique to what you do, and can be considered a strategic asset; buy if your use of the tool isn't that special" - Build if core to your business
- Problems with COTS
- lack of control
- customization - avoid complex customizations - rather, change your organization's functions
- integration spaghetti - with different protocols, etc
- On your own terms
- Hide COTS CMS behind your own web frontend, putting the COTS within your own service facade
- Use Strangler Application Pattern to capture and intercept calls to the old system
- Build or Buy commercial-of-the-shelf?
Ch 5. Splitting the Monolith
- Seams
- Identify seams that can become service boundaries (not for the purpose of cleaning up the codebase).
- Bounded contexts as seams, as they are cohesive and loosely coupled boundaries.
- Reasons to split
- Pace of change is faster with autonomous units.
- Team autonomy
- Replaceable with alternative implementation
- Tangled dependencies - use a dependency analysis tool to view the seams as a DAG to find the least depended on seam.
- Coupling at the database layer
- Examples of database refactoring to separate schemas
- Transactional boundaries - split across databases
- Design patterns for failures (success in one db, but failure in another)
- Eventually consistency - try again later
- Compensating transaction - abort entire operation
- more complex to recover
- need other compensating behavior to fix up inconsistencies
- Distributed transaction using a transaction manager
- 2-phase commit
- Locks on resources can lead to contention, inhibiting scaling
- Rather than requiring distributed transactions, actually create a higher-level concept that represents the transaction.
- Gives a natural place to focus logic around the end-to-end process and to handle exceptions
- Design patterns for failures (success in one db, but failure in another)
- Reporting Systems
- Can use a read-replica to access the data - but couples database technology
- APIs - don't scale
- Data pumps to push the data, rather than have reporting system pull the data
- service owners write their own pump so not coupled to service schemas
- reporting schema treated as a published API
- aggregated view of all service pumps within the reporting system
- need to deal with complexity of segmented schema
- Event data pump
- Reporting service just binds to the events emitted by upstream services
- Looser coupling and fresher data
- May not scale as well as data pumps though
- Backup data pump
- Variant of data pump used by Netflix using Hadoop off of S3-backed Cassandra data
- Cost of change
- Make small, incremental changes to understand the impact of each alteration - mitigates cost of mistakes
- Small cost: Moving code around within a codebase
- Large cost: Splitting apart a database
- Whiteboard
- Make mistakes where the impact will be the lowest: on the whiteboard
- Go through use cases
- Class-responsibility cards (CRC) - borrowing from OOP - each card includes name of the class, its responsibility, and its collaborators
Ch 6. Deployment
- Continuous Integration
- CI server detects code is committed, verifies code and runs tests
- Versioned Artifacts are also created for further validation and usage in downstream deployments
- Confirms that the artifacts deployed are the ones tested
- Reused without continual recreation
- Traceability back to the commit
- 3 questions from Jez Humble on whether you're really doing it
- Do you check in to mainline once per day?
- Even if you are using short-lived branches, integrate frequently
- Do you have a suite of tests to validation your changes?
- When the build it broken, is it the #1 priority of the team to fix it?
- Do you check in to mainline once per day?
- Repo
- Since repo and single CI build for all microservices
- requires lock-step releases
- Ok for early stage and short-period of time
- Cycle time impacted - speed of moving a single change to being live
- Ownership issues
- Single repo with separate CI builds mapping to different parts of the source tree
- Better than the above
- Can get into the habit of slipping changes that couple services together.
- Separate repo with separate CI builds for each microservice
- Faster development
- Clearer ownership
- More difficult to make changes across repos - can be easier via command-line scripts
- Since repo and single CI build for all microservices
- Continuous Delivery
- Treat each check-in as a release candidate, getting constant feedback on its production readiness
- Build pipeline
- One stage for faster tests and another stage for slower tests
- Feel more confident about the change as it goes through the pipeline
- Fast tests → Slow tests → User acceptance tests → Performance tests → Production
- One microservice per build
- Is the goal
- However, while service boundaries are still being defined, a single build for all services reduces the cost of cross-service changes
- Deployable Artifacts
- Platform-specific
- OS-specific
- Images
- Environments for each pipeline stage and different deployments
- different collection of configurations and hosts
- Service configuration
- Keep configuration decoupled from artifacts
- Or use a dedicated config system
- Service to host mapping
- Multiple services per host
- Coupling, even with Application Containers
- Single service per host
- Monitoring and remediation much easier
- Isolation of failure
- Independent scaling
- Automation mitigates amount of overhead
- Multiple services per host
- Virtualization
- Vagrant - can be taxing on a developer machine when running a lot of VMs at once
- Linux Containers
- Docker - Vagrant can host a Docker instance
Ch 7. Testing
- Brian Marick's Testing Quadrant from Agile Testing by Lisa Crispin and Janet Gregory
- Mike Cohn's Test Pyramid from Succeeding with Agile
- Go up
- scope increases, confidence increases, test-time increases;
- when test breaks harder to diagnose cause
- order of magnitude less than previous
- When broader-scoped test fails, write unit regression test
- Test Snow cone - or inverted pyramid - doesn't work well with continuous integration due to slow test cycles
- Go up
- Unit tests
- fast feedback on functionality
- thousands run in less than a minute
- limiting use of external files or network connections
- outcomes of test-driven design and property-based testing
- Service tests
- bypass UI
- At the API layer for web services
- Can stub out external collaborators to decrease test times
- Cover more scope than unit tests, but less brittle than larger-scoped tests
- Mocking or stubbing?
- Martin Fowler's Test Doubles
- Stubbing is preferable
- Mocking is brittle as it requires more coupling with the fake collaborators
- End-to-end (E2E) journey tests
- GUI/browser-driven tests
- Higher degree of confidence when they pass, but slower and trickier - especially with microservices
- Tricky
- Which versions of the services to test against?
- Duplicate of effort by different service owners
- => Single suite of end-to-end tests - fan-in from individual pipelines
- Downsides
- Flaky and brittle
- => Remove flaky tests to retain faith in them
- => See if they can be replaced with smaller-scoped tests
- Lack of ownership
- => Shared codebase with joint ownership
- Long duration
- Large end-to-end test suites take days to run
- Take hours to diagnose
- => Can run in parallel
- => Remove unneeded tests
- Pile-up of issues
- => release small changes more frequently
- Don't create a meta-version of all services that were tested together - results in coupling of services again
- Flaky and brittle
- Establish agreed-upon core journeys that are to be tested
- Each user story should NOT result in an end-to-end test
- Very small number (low double digits even for complex systems)
- Consumer-driven contract (CDC) tests
- Interfaces/expectations by consumers are codified
- Pact - open-source consumer-driven testing tool
- Codifying discussions between consumers and producers - a test failure triggers a conversation
- E2E tests are training wheels for CDC
- E2E tests can be a useful safety net, trading off cycle time for decreased risk
- Slowly reduce reliance on E2E tests so they are no longer needed
- Post-deployment Testing
- Acknowledge that we cannot find all errors before production
- Smoke test suite runs once code is on prod
- Blue/green deployment
- allows quick rollback if needed
- zero downtime deployments
- Canary releasing
- to verify new version is performing as expected (error rates, response time, etc) under real traffic
- version co-exist for longer periods of time
- either a subset of production traffic or shadow of full production traffic
- Mean time between failures (MTBF) versus Mean time to repair (MTTR)
- Optimizing for MTTR over MTBF allows for less time spent on functional test suites and faster time to customer value/testing an idea
- Cross-functional Testing (a.k.a., nonfunctional requirements)
- Fall into the Property Testing quadrant
- Performance Tests
- Follow pyramid as well - unit test level as well as e2e via load tests
- Have a mix - of isolated tests as well as journey tests
- load tests
- gradually increase number of simulated customers
- system matches prod as much as possible
- may get false positives in addition to false negatives
- Test runs
- Run a subset of tests every day; larger set every week
- Regularly as possible - so can isolate commits
- Have targets with clear call-to-action, otherwise results may be ignored
Ch 8. Monitoring
- With microservices, multiple servers, multiple log files, etc.
- Make log files centrally available/searchable
- Metric tracking
- aggregated sampling across time, across services
- but still see data for individual services and individual instances
- Synthetic transaction (a.k.a., semantic monitoring)
- fake events to ensure behavior
- Correlation IDs
- unique identifier (e.g., GUID) used to track a transaction across services/boundaries
- one path missing a correlation ID will break the monitoring
- having a thin shared client wrapper library can help
- Cascading failures
- monitor integration points between systems
- Service
- response time, error rates, application-level metrics
- health of downstream services
- monitor OS to track rogue processes and for capacity planning
- System
- aggregate host-level metrics with application-level metrics, but can drill down to individual hosts
- maintain long-term data to measure trends
- standardize on
- tools
- log formats
- correlation IDs
- call to action with alerts and dashboards
- Future - common, generic system for business metrics and operation metrics to monitor system in a more holistic way
Ch 9. Security
Ch 10. Conway's Law and System Design
- Organizational structure is a strong influence on the structure of the system
- Microservices are modeled after business domains, not technical ones
- Teams aligned along bounded contexts, as services are.
- Align service ownership to co-located teams, which are aligned around the same bounded contexts of the organization.
Ch 11. Microservices at Scale
- At scale, concerns over failures (statistically likely) and performance
- Can over-optimize unless know requirements for:
- response time/latency
- availability
- durability of data
- Graceful degradation
- Architectural safety measures
- Anti-fragile organization by Nassim Taleb
- intentionally causing failures at Netflix and Google
- Timeouts
- too long slows down whole system
- too quick creates false negatives
- choose defaults and log → monitor → adjust
- Circuit Breakers
- Fail fast after a certain number of failures
- gracefully degrade or error
- queue for later if async
- Restart after certain threshold
- Fail fast after a certain number of failures
- Bulkheads
- Lose a part of the ship but rest remains intact
- Separation of concerns via separate microservices
- Timeouts and circuit breakers help free up resources when they become constrained
- Bulkheads ensure they don't become constrained in the first place
- Anti-fragile organization by Nassim Taleb
- Idempotent operations allow repeatable/replayable messages
- Scaling techniques
- Vertical scaling with bigger boxes
- Splitting workloads with microservices on their own hosts
- Spreading risk
- Multiple hosts
- Across regions/data centers
- Load balancing
- Avoids single point of failure
- Distributes calls to multiple instances to handle load
- Worker-based systems
- Message broker is resilient, allowing worker processes to scale to match load and handle failures without loss of work
- Rewrite/redesign/reimplement as scaling needs change
- "The need to change our systems to deal with scale isn't a sign of failure. It is a sign of success."
- Scaling Databases
- Availability vs Durability of data
- Scaling for Reads - caching, read replica
- Scaling for Writes - sharding, though querying across shards become difficult - may scale for write volume, but not improve resiliency
- CQRS pattern separates state modification from state querying
- Caching
- Eliminates round-trips to databases and other services for faster results.
- Client-side caching with possible hints from the service
- Reduces network calls
- Invalidation is trickier
- Proxy caching in between client and server (e.g., CDN)
- Generic caching easier to add to existing system
- Server-side caching via Redis or Memcache
- Opaque to clients
- Easier to invalidate, track, and optimize across client-types
- Knowing requirements for load, freshness, and current metrics
- HTTP caching with cache-control TTL and ETags
- Caching for writes, such as write-behind cache
- Caching for resilience in case of failure by providing stale data
- Hiding the origin by having the origin populate the cache asynchronously - to protect the origin from cascading load
- Keep it simple - by sticking to one only if needed
- Autoscaling
- CAP theorem
- Consistency vs Availability vs Partition Tolerance
- Can mix and match
- Service Discovery
- DNS
- Zookeeper, Consul, Eureka
- Documenting Services
- Swagger, HAL
Ch 12. Final
- Summary
- Model around business concepts - bounded contexts
- Automation culture - automated tests, continuous delivery, environment defs, custom images
- Hide internal implementation details - bounded contexts, hide databases, event data pumps, technology-agnostic APIs
- Decentralize all things - self-service, team ownership of services, Conway's law, shared governance, choreography over orchestration, dump middleware with smart endpoints
- Deploy independently - coexisting versioned endpoints, one-service-per-host model, blue/green deployment, canary release, consumer-driven contracts, eliminate lock-step
- Isolate failure - treat remote calls differently from local calls, anti-fragility, timeouts, bulkheads, circuit breakers, CAP
- Highly observable - semantic monitoring, synthetic transactions, aggregate logs/stats, use correlation IDs
- Advice
- Go incrementally - make each decision small in scope.
- Change is inevitable. Embrace it.