Notes on Building Microservices, Designing Fine-grained Systems by Sam Newman

Ch 3. Bounded Contexts

Seams with Loose coupling and High cohesion
Beware of premature decomposition before domains and usages are solidified.

Ch 4. Integration

Avoid database integration at all cost
DRY and code-reuse can couple microservices
Sync vs Async
- request/response vs event-based vs reactive (observe for results)
- request/response can be sync or async - async by registering a callback
  - "orchestration" pattern leads to centralized authorities with anemic CRUD-based services
  - these systems are more brittle with a higher cost of change
  - technology choices
    - RPC
      - easy to use
      - watch out for: technology coupling, incorrectly treating remote calls as local calls, lock-step releases
    - REST over HTTP
      - more resilient to changes than RPC - sensible default choice
      - not suited for low latency and small message size (consider WebSockets)
- event-based - better decoupling; intelligence is distributed
  - "choreographed" pattern leads to implicit view of the business process (since no centralized workflow)
  - additional work to monitor/track - can create a monitoring system that matches the view of the business process - validates flowchart expectations
  - these systems are more loosely coupled, flexible, and amenable to change
  - technology choices
    - RabbitMQ
    - HTTP + ATOM
  - managing complexities
    - maximum retry limits
    - dead letter queue (for failed messages) - with UI to view and retry
    - good monitoring
    - correlation IDs to trace requests
Versioning
- Postel's Law, a.k.a. Robustness principle: "Be conservative in what you do, be liberal in what you accept from others."
- consumer-driven contracts to catch breaking changes early
- semantic versioning - self-documented impact
- expand and contract pattern when versioning breaking changes.
  - co-existing versions needed for blue-green deployments and canary releases
UI as the compositional layer
- UI Fragment composition
  - server-side rendering of course-grained fragments work well when aligned with team ownership
  - problem: consistency of UX - mitigated with shared CSS/images/etc
  - problem: doesn't support native applications / thick clients
  - problem: doesn't work for cross-cutting features that don't fit into a widget/page
- Backends for Frontends
  - aggregated backend layers, with dedicated backends serving UI/APIs for dedicated frontend experiences
  - danger: keep business logic within the underlying services. Aggregated backends should contain only front-end specific behavior
- Hybrid of both approaches above
Third-party software
- Build or Buy commercial-of-the-shelf?
  - "Build if it is unique to what you do, and can be considered a strategic asset; buy if your use of the tool isn't that special" - Build if core to your business
- Problems with COTS
  - lack of control
  - customization - avoid complex customizations - rather, change your organization's functions
  - integration spaghetti - with different protocols, etc
- On your own terms
  - Hide COTS CMS behind your own web frontend, putting the COTS within your own service facade
- Use Strangler Application Pattern to capture and intercept calls to the old system

Ch 5. Splitting the Monolith

Seams
- Identify seams that can become service boundaries (not for the purpose of cleaning up the codebase).
- Bounded contexts as seams, as they are cohesive and loosely coupled boundaries.
Reasons to split
- Pace of change is faster with autonomous units.
- Team autonomy
- Replaceable with alternative implementation
Tangled dependencies - use a dependency analysis tool to view the seams as a DAG to find the least depended on seam.
Coupling at the database layer
- Examples of database refactoring to separate schemas
- Transactional boundaries - split across databases
  - Design patterns for failures (success in one db, but failure in another)
    - Eventually consistency - try again later
    - Compensating transaction - abort entire operation
      - more complex to recover
      - need other compensating behavior to fix up inconsistencies
    - Distributed transaction using a transaction manager
      - 2-phase commit
      - Locks on resources can lead to contention, inhibiting scaling
  - Rather than requiring distributed transactions, actually create a higher-level concept that represents the transaction.
    - Gives a natural place to focus logic around the end-to-end process and to handle exceptions
Reporting Systems
- Can use a read-replica to access the data - but couples database technology
- APIs - don't scale
- Data pumps to push the data, rather than have reporting system pull the data
  - service owners write their own pump so not coupled to service schemas
  - reporting schema treated as a published API
  - aggregated view of all service pumps within the reporting system
  - need to deal with complexity of segmented schema
- Event data pump
  - Reporting service just binds to the events emitted by upstream services
  - Looser coupling and fresher data
  - May not scale as well as data pumps though
- Backup data pump
  - Variant of data pump used by Netflix using Hadoop off of S3-backed Cassandra data
Cost of change
- Make small, incremental changes to understand the impact of each alteration - mitigates cost of mistakes
- Small cost: Moving code around within a codebase
- Large cost: Splitting apart a database
- Whiteboard
  - Make mistakes where the impact will be the lowest: on the whiteboard
  - Go through use cases
  - Class-responsibility cards (CRC) - borrowing from OOP - each card includes name of the class, its responsibility, and its collaborators

Ch 6. Deployment

Continuous Integration
- CI server detects code is committed, verifies code and runs tests
- Versioned Artifacts are also created for further validation and usage in downstream deployments
  - Confirms that the artifacts deployed are the ones tested
  - Reused without continual recreation
  - Traceability back to the commit
- 3 questions from Jez Humble on whether you're really doing it
  - Do you check in to mainline once per day?
    - Even if you are using short-lived branches, integrate frequently
  - Do you have a suite of tests to validation your changes?
  - When the build it broken, is it the #1 priority of the team to fix it?
- Repo
  - Since repo and single CI build for all microservices
    - requires lock-step releases
    - Ok for early stage and short-period of time
    - Cycle time impacted - speed of moving a single change to being live
    - Ownership issues
  - Single repo with separate CI builds mapping to different parts of the source tree
    - Better than the above
    - Can get into the habit of slipping changes that couple services together.
  - Separate repo with separate CI builds for each microservice
    - Faster development
    - Clearer ownership
    - More difficult to make changes across repos - can be easier via command-line scripts
Continuous Delivery
- Treat each check-in as a release candidate, getting constant feedback on its production readiness
- Build pipeline
  - One stage for faster tests and another stage for slower tests
  - Feel more confident about the change as it goes through the pipeline
  - Fast tests → Slow tests → User acceptance tests → Performance tests → Production
One microservice per build
- Is the goal
- However, while service boundaries are still being defined, a single build for all services reduces the cost of cross-service changes
Deployable Artifacts
- Platform-specific
- OS-specific
- Images
Environments for each pipeline stage and different deployments
- different collection of configurations and hosts
- Service configuration
  - Keep configuration decoupled from artifacts
  - Or use a dedicated config system
Service to host mapping
- Multiple services per host
  - Coupling, even with Application Containers
- Single service per host
  - Monitoring and remediation much easier
  - Isolation of failure
  - Independent scaling
  - Automation mitigates amount of overhead
Virtualization
- Vagrant - can be taxing on a developer machine when running a lot of VMs at once
- Linux Containers
- Docker - Vagrant can host a Docker instance

Ch 7. Testing

Brian Marick's Testing Quadrant from Agile Testing by Lisa Crispin and Janet Gregory
Mike Cohn's Test Pyramid from Succeeding with Agile
- Go up
  - scope increases, confidence increases, test-time increases;
  - when test breaks harder to diagnose cause
  - order of magnitude less than previous
- When broader-scoped test fails, write unit regression test
- Test Snow cone - or inverted pyramid - doesn't work well with continuous integration due to slow test cycles
Unit tests
- fast feedback on functionality
- thousands run in less than a minute
- limiting use of external files or network connections
- outcomes of test-driven design and property-based testing
Service tests
- bypass UI
- At the API layer for web services
- Can stub out external collaborators to decrease test times
- Cover more scope than unit tests, but less brittle than larger-scoped tests
- Mocking or stubbing?
  - Martin Fowler's Test Doubles
  - Stubbing is preferable
  - Mocking is brittle as it requires more coupling with the fake collaborators
End-to-end (E2E) journey tests
- GUI/browser-driven tests
- Higher degree of confidence when they pass, but slower and trickier - especially with microservices
- Tricky
  - Which versions of the services to test against?
  - Duplicate of effort by different service owners
  - => Single suite of end-to-end tests - fan-in from individual pipelines
- Downsides
  - Flaky and brittle
    - => Remove flaky tests to retain faith in them
    - => See if they can be replaced with smaller-scoped tests
  - Lack of ownership
    - => Shared codebase with joint ownership
  - Long duration
    - Large end-to-end test suites take days to run
    - Take hours to diagnose
    - => Can run in parallel
    - => Remove unneeded tests
  - Pile-up of issues
    - => release small changes more frequently
  - Don't create a meta-version of all services that were tested together - results in coupling of services again
- Establish agreed-upon core journeys that are to be tested
  - Each user story should NOT result in an end-to-end test
  - Very small number (low double digits even for complex systems)
Consumer-driven contract (CDC) tests
- Interfaces/expectations by consumers are codified
- Pact - open-source consumer-driven testing tool
- Codifying discussions between consumers and producers - a test failure triggers a conversation
- E2E tests are training wheels for CDC
  - E2E tests can be a useful safety net, trading off cycle time for decreased risk
  - Slowly reduce reliance on E2E tests so they are no longer needed
Post-deployment Testing
- Acknowledge that we cannot find all errors before production
- Smoke test suite runs once code is on prod
- Blue/green deployment
  - allows quick rollback if needed
  - zero downtime deployments
- Canary releasing
  - to verify new version is performing as expected (error rates, response time, etc) under real traffic
  - version co-exist for longer periods of time
  - either a subset of production traffic or shadow of full production traffic
- Mean time between failures (MTBF) versus Mean time to repair (MTTR)
  - Optimizing for MTTR over MTBF allows for less time spent on functional test suites and faster time to customer value/testing an idea
Cross-functional Testing (a.k.a., nonfunctional requirements)
- Fall into the Property Testing quadrant
- Performance Tests
  - Follow pyramid as well - unit test level as well as e2e via load tests
  - Have a mix - of isolated tests as well as journey tests
  - load tests
    - gradually increase number of simulated customers
    - system matches prod as much as possible
    - may get false positives in addition to false negatives
  - Test runs
    - Run a subset of tests every day; larger set every week
    - Regularly as possible - so can isolate commits
    - Have targets with clear call-to-action, otherwise results may be ignored

Ch 8. Monitoring

With microservices, multiple servers, multiple log files, etc.
Make log files centrally available/searchable
Metric tracking
- aggregated sampling across time, across services
- but still see data for individual services and individual instances
Synthetic transaction (a.k.a., semantic monitoring)
- fake events to ensure behavior
Correlation IDs
- unique identifier (e.g., GUID) used to track a transaction across services/boundaries
- one path missing a correlation ID will break the monitoring
  - having a thin shared client wrapper library can help
Cascading failures
- monitor integration points between systems
Service
- response time, error rates, application-level metrics
- health of downstream services
- monitor OS to track rogue processes and for capacity planning
System
- aggregate host-level metrics with application-level metrics, but can drill down to individual hosts
- maintain long-term data to measure trends
- standardize on
  - tools
  - log formats
  - correlation IDs
  - call to action with alerts and dashboards
Future - common, generic system for business metrics and operation metrics to monitor system in a more holistic way

Ch 9. Security

Ch 10. Conway's Law and System Design

Organizational structure is a strong influence on the structure of the system
Microservices are modeled after business domains, not technical ones
- Teams aligned along bounded contexts, as services are.
Align service ownership to co-located teams, which are aligned around the same bounded contexts of the organization.

Ch 11. Microservices at Scale

At scale, concerns over failures (statistically likely) and performance
Can over-optimize unless know requirements for:
- response time/latency
- availability
- durability of data
Graceful degradation
Architectural safety measures
- Anti-fragile organization by Nassim Taleb
  - intentionally causing failures at Netflix and Google
- Timeouts
  - too long slows down whole system
  - too quick creates false negatives
  - choose defaults and log → monitor → adjust
- Circuit Breakers
  - Fail fast after a certain number of failures
    - gracefully degrade or error
    - queue for later if async
  - Restart after certain threshold
- Bulkheads
  - Lose a part of the ship but rest remains intact
  - Separation of concerns via separate microservices
- Timeouts and circuit breakers help free up resources when they become constrained
- Bulkheads ensure they don't become constrained in the first place
Idempotent operations allow repeatable/replayable messages
Scaling techniques
- Vertical scaling with bigger boxes
- Splitting workloads with microservices on their own hosts
- Spreading risk
  - Multiple hosts
  - Across regions/data centers
- Load balancing
  - Avoids single point of failure
  - Distributes calls to multiple instances to handle load
- Worker-based systems
  - Message broker is resilient, allowing worker processes to scale to match load and handle failures without loss of work
- Rewrite/redesign/reimplement as scaling needs change
  - "The need to change our systems to deal with scale isn't a sign of failure. It is a sign of success."
Scaling Databases
- Availability vs Durability of data
- Scaling for Reads - caching, read replica
- Scaling for Writes - sharding, though querying across shards become difficult - may scale for write volume, but not improve resiliency
- CQRS pattern separates state modification from state querying
Caching
- Eliminates round-trips to databases and other services for faster results.
- Client-side caching with possible hints from the service
  - Reduces network calls
  - Invalidation is trickier
- Proxy caching in between client and server (e.g., CDN)
  - Generic caching easier to add to existing system
- Server-side caching via Redis or Memcache
  - Opaque to clients
  - Easier to invalidate, track, and optimize across client-types
- Knowing requirements for load, freshness, and current metrics
- HTTP caching with cache-control TTL and ETags
- Caching for writes, such as write-behind cache
- Caching for resilience in case of failure by providing stale data
- Hiding the origin by having the origin populate the cache asynchronously - to protect the origin from cascading load
- Keep it simple - by sticking to one only if needed
Autoscaling
CAP theorem
- Consistency vs Availability vs Partition Tolerance
- Can mix and match
Service Discovery
- DNS
- Zookeeper, Consul, Eureka
Documenting Services
- Swagger, HAL

Ch 12. Final

Summary
- Model around business concepts - bounded contexts
- Automation culture - automated tests, continuous delivery, environment defs, custom images
- Hide internal implementation details - bounded contexts, hide databases, event data pumps, technology-agnostic APIs
- Decentralize all things - self-service, team ownership of services, Conway's law, shared governance, choreography over orchestration, dump middleware with smart endpoints
- Deploy independently - coexisting versioned endpoints, one-service-per-host model, blue/green deployment, canary release, consumer-driven contracts, eliminate lock-step
- Isolate failure - treat remote calls differently from local calls, anti-fragility, timeouts, bulkheads, circuit breakers, CAP
- Highly observable - semantic monitoring, synthetic transactions, aggregate logs/stats, use correlation IDs
Advice
- Go incrementally - make each decision small in scope.
- Change is inevitable. Embrace it.

Browser not supported