Building Microservices
Notes on Building Microservices, Designing Fine-grained Systems by Sam Newman
Ch 3. Bounded Contexts
Seams with Loose coupling and High cohesion
Beware of premature decomposition before domains and usages are solidified.
Ch 4. Integration
Avoid database integration at all cost
DRY and code-reuse can couple microservices
Sync vs Async
request/response vs event-based vs reactive (observe for results)
request/response can be sync or async - async by registering a callback
"orchestration" pattern leads to centralized authorities with anemic CRUD-based services
these systems are more brittle with a higher cost of change
technology choices
RPC
easy to use
watch out for: technology coupling, incorrectly treating remote calls as local calls, lock-step releases
REST over HTTP
more resilient to changes than RPC - sensible default choice
not suited for low latency and small message size (consider WebSockets)
event-based - better decoupling; intelligence is distributed
"choreographed" pattern leads to implicit view of the business process (since no centralized workflow)
additional work to monitor/track - can create a monitoring system that matches the view of the business process - validates flowchart expectations
these systems are more loosely coupled, flexible, and amenable to change
technology choices
RabbitMQ
HTTP + ATOM
managing complexities
maximum retry limits
dead letter queue (for failed messages) - with UI to view and retry
good monitoring
correlation IDs to trace requests
Versioning
Postel's Law, a.k.a. Robustness principle: "Be conservative in what you do, be liberal in what you accept from others."
consumer-driven contracts to catch breaking changes early
semantic versioning - self-documented impact
expand and contract pattern when versioning breaking changes.
co-existing versions needed for blue-green deployments and canary releases
UI as the compositional layer
UI Fragment composition
server-side rendering of course-grained fragments work well when aligned with team ownership
problem: consistency of UX - mitigated with shared CSS/images/etc
problem: doesn't support native applications / thick clients
problem: doesn't work for cross-cutting features that don't fit into a widget/page
Backends for Frontends
aggregated backend layers, with dedicated backends serving UI/APIs for dedicated frontend experiences
danger: keep business logic within the underlying services. Aggregated backends should contain only front-end specific behavior
Hybrid of both approaches above
Third-party software
Build or Buy commercial-of-the-shelf?
"Build if it is unique to what you do, and can be considered a strategic asset; buy if your use of the tool isn't that special" - Build if core to your business
Problems with COTS
lack of control
customization - avoid complex customizations - rather, change your organization's functions
integration spaghetti - with different protocols, etc
On your own terms
Hide COTS CMS behind your own web frontend, putting the COTS within your own service facade
Use Strangler Application Pattern to capture and intercept calls to the old system
Ch 5. Splitting the Monolith
Seams
Identify seams that can become service boundaries (not for the purpose of cleaning up the codebase).
Bounded contexts as seams, as they are cohesive and loosely coupled boundaries.
Reasons to split
Pace of change is faster with autonomous units.
Team autonomy
Replaceable with alternative implementation
Tangled dependencies - use a dependency analysis tool to view the seams as a DAG to find the least depended on seam.
Coupling at the database layer
Examples of database refactoring to separate schemas
Transactional boundaries - split across databases
Design patterns for failures (success in one db, but failure in another)
Eventually consistency - try again later
Compensating transaction - abort entire operation
more complex to recover
need other compensating behavior to fix up inconsistencies
Distributed transaction using a transaction manager
2-phase commit
Locks on resources can lead to contention, inhibiting scaling
Rather than requiring distributed transactions, actually create a higher-level concept that represents the transaction.
Gives a natural place to focus logic around the end-to-end process and to handle exceptions
Reporting Systems
Can use a read-replica to access the data - but couples database technology
APIs - don't scale
Data pumps to push the data, rather than have reporting system pull the data
service owners write their own pump so not coupled to service schemas
reporting schema treated as a published API
aggregated view of all service pumps within the reporting system
need to deal with complexity of segmented schema
Event data pump
Reporting service just binds to the events emitted by upstream services
Looser coupling and fresher data
May not scale as well as data pumps though
Backup data pump
Variant of data pump used by Netflix using Hadoop off of S3-backed Cassandra data
Cost of change
Make small, incremental changes to understand the impact of each alteration - mitigates cost of mistakes
Small cost: Moving code around within a codebase
Large cost: Splitting apart a database
Whiteboard
Make mistakes where the impact will be the lowest: on the whiteboard
Go through use cases
Class-responsibility cards (CRC) - borrowing from OOP - each card includes name of the class, its responsibility, and its collaborators
Ch 6. Deployment
Continuous Integration
CI server detects code is committed, verifies code and runs tests
Versioned Artifacts are also created for further validation and usage in downstream deployments
Confirms that the artifacts deployed are the ones tested
Reused without continual recreation
Traceability back to the commit
3 questions from Jez Humble on whether you're really doing it
Do you check in to mainline once per day?
Even if you are using short-lived branches, integrate frequently
Do you have a suite of tests to validation your changes?
When the build it broken, is it the #1 priority of the team to fix it?
Repo
Since repo and single CI build for all microservices
requires lock-step releases
Ok for early stage and short-period of time
Cycle time impacted - speed of moving a single change to being live
Ownership issues
Single repo with separate CI builds mapping to different parts of the source tree
Better than the above
Can get into the habit of slipping changes that couple services together.
Separate repo with separate CI builds for each microservice
Faster development
Clearer ownership
More difficult to make changes across repos - can be easier via command-line scripts
Continuous Delivery
Treat each check-in as a release candidate, getting constant feedback on its production readiness
Build pipeline
One stage for faster tests and another stage for slower tests
Feel more confident about the change as it goes through the pipeline
Fast tests → Slow tests → User acceptance tests → Performance tests → Production
One microservice per build
Is the goal
However, while service boundaries are still being defined, a single build for all services reduces the cost of cross-service changes
Deployable Artifacts
Platform-specific
OS-specific
Images
Environments for each pipeline stage and different deployments
different collection of configurations and hosts
Service configuration
Keep configuration decoupled from artifacts
Or use a dedicated config system
Service to host mapping
Multiple services per host
Coupling, even with Application Containers
Single service per host
Monitoring and remediation much easier
Isolation of failure
Independent scaling
Automation mitigates amount of overhead
Virtualization
Vagrant - can be taxing on a developer machine when running a lot of VMs at once
Linux Containers
Docker - Vagrant can host a Docker instance
Ch 7. Testing
Brian Marick's Testing Quadrant from Agile Testing by Lisa Crispin and Janet Gregory
Mike Cohn's Test Pyramid from Succeeding with Agile
Go up
scope increases, confidence increases, test-time increases;
when test breaks harder to diagnose cause
order of magnitude less than previous
When broader-scoped test fails, write unit regression test
Test Snow cone - or inverted pyramid - doesn't work well with continuous integration due to slow test cycles
Unit tests
fast feedback on functionality
thousands run in less than a minute
limiting use of external files or network connections
outcomes of test-driven design and property-based testing
Service tests
bypass UI
At the API layer for web services
Can stub out external collaborators to decrease test times
Cover more scope than unit tests, but less brittle than larger-scoped tests
Mocking or stubbing?
Martin Fowler's Test Doubles
Stubbing is preferable
Mocking is brittle as it requires more coupling with the fake collaborators
End-to-end (E2E) journey tests
GUI/browser-driven tests
Higher degree of confidence when they pass, but slower and trickier - especially with microservices
Tricky
Which versions of the services to test against?
Duplicate of effort by different service owners
=> Single suite of end-to-end tests - fan-in from individual pipelines
Downsides
Flaky and brittle
=> Remove flaky tests to retain faith in them
=> See if they can be replaced with smaller-scoped tests
Lack of ownership
=> Shared codebase with joint ownership
Long duration
Large end-to-end test suites take days to run
Take hours to diagnose
=> Can run in parallel
=> Remove unneeded tests
Pile-up of issues
=> release small changes more frequently
Don't create a meta-version of all services that were tested together - results in coupling of services again
Establish agreed-upon core journeys that are to be tested
Each user story should NOT result in an end-to-end test
Very small number (low double digits even for complex systems)
Consumer-driven contract (CDC) tests
Interfaces/expectations by consumers are codified
Pact - open-source consumer-driven testing tool
Codifying discussions between consumers and producers - a test failure triggers a conversation
E2E tests are training wheels for CDC
E2E tests can be a useful safety net, trading off cycle time for decreased risk
Slowly reduce reliance on E2E tests so they are no longer needed
Post-deployment Testing
Acknowledge that we cannot find all errors before production
Smoke test suite runs once code is on prod
Blue/green deployment
allows quick rollback if needed
zero downtime deployments
Canary releasing
to verify new version is performing as expected (error rates, response time, etc) under real traffic
version co-exist for longer periods of time
either a subset of production traffic or shadow of full production traffic
Mean time between failures (MTBF) versus Mean time to repair (MTTR)
Optimizing for MTTR over MTBF allows for less time spent on functional test suites and faster time to customer value/testing an idea
Cross-functional Testing (a.k.a., nonfunctional requirements)
Fall into the Property Testing quadrant
Performance Tests
Follow pyramid as well - unit test level as well as e2e via load tests
Have a mix - of isolated tests as well as journey tests
load tests
gradually increase number of simulated customers
system matches prod as much as possible
may get false positives in addition to false negatives
Test runs
Run a subset of tests every day; larger set every week
Regularly as possible - so can isolate commits
Have targets with clear call-to-action, otherwise results may be ignored
Ch 8. Monitoring
With microservices, multiple servers, multiple log files, etc.
Make log files centrally available/searchable
Metric tracking
aggregated sampling across time, across services
but still see data for individual services and individual instances
Synthetic transaction (a.k.a., semantic monitoring)
fake events to ensure behavior
Correlation IDs
unique identifier (e.g., GUID) used to track a transaction across services/boundaries
one path missing a correlation ID will break the monitoring
having a thin shared client wrapper library can help
Cascading failures
monitor integration points between systems
Service
response time, error rates, application-level metrics
health of downstream services
monitor OS to track rogue processes and for capacity planning
System
aggregate host-level metrics with application-level metrics, but can drill down to individual hosts
maintain long-term data to measure trends
standardize on
tools
log formats
correlation IDs
call to action with alerts and dashboards
Future - common, generic system for business metrics and operation metrics to monitor system in a more holistic way
Ch 9. Security
Ch 10. Conway's Law and System Design
Organizational structure is a strong influence on the structure of the system
Microservices are modeled after business domains, not technical ones
Teams aligned along bounded contexts, as services are.
Align service ownership to co-located teams, which are aligned around the same bounded contexts of the organization.
Ch 11. Microservices at Scale
At scale, concerns over failures (statistically likely) and performance
Can over-optimize unless know requirements for:
response time/latency
availability
durability of data
Graceful degradation
Architectural safety measures
Anti-fragile organization by Nassim Taleb
intentionally causing failures at Netflix and Google
Timeouts
too long slows down whole system
too quick creates false negatives
choose defaults and log → monitor → adjust
Circuit Breakers
Fail fast after a certain number of failures
gracefully degrade or error
queue for later if async
Restart after certain threshold
Bulkheads
Lose a part of the ship but rest remains intact
Separation of concerns via separate microservices
Timeouts and circuit breakers help free up resources when they become constrained
Bulkheads ensure they don't become constrained in the first place
Idempotent operations allow repeatable/replayable messages
Scaling techniques
Vertical scaling with bigger boxes
Splitting workloads with microservices on their own hosts
Spreading risk