Skip to content

7.2 Kafka Replication

This page covers bidirectional cross-cluster Kafka replication between the aws and proxmox environments using Strimzi MirrorMaker 2 (MM2). Each cluster runs its own MM2 instance responsible for one replication direction — proxmox's MM2 handles proxmox→aws, and aws's MM2 handles aws→proxmox. This avoids duplicate data and provides resilience: if one cluster goes down, only its outbound replication stops.


Key Decisions

  • Topology: Active-Active (bidirectional, aws ↔ proxmox)
  • Replication Tool: Strimzi KafkaMirrorMaker2 CRD (one per cluster, one direction each)
  • Replication Policy: DefaultReplicationPolicy with . separator (topics prefixed with source cluster alias)
  • Topics: .* (all topics, excluding internal topics)
  • Consumer Groups: All groups (offset synchronisation enabled)
  • Authentication: SCRAM-SHA-512 (matching existing Kafka listener configuration)
  • Networking: Kubernetes Multi-Cluster Services (MCS) API with ServiceExport and Cilium Cluster Mesh (clusterset.local DNS)

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                   AWS Cluster (Active — Kafka 4.1.0)                │
│                   1 broker, replication factor 1                    │
│                                                                     │
│  ┌──────────┐    ┌──────────────┐    ┌────────────────────────┐     │
│  │ ESB      │───>│ Kafka Broker │    │ mm2-aws-to-proxmox     │     │
│  │ Producer │    │ (local       │    │ (connectCluster:       │     │
│  └──────────┘    │  topics)     │    │  aws)                  │     │
│                  └──────┬───────┘    │                        │     │
│                         │            │ aws ──────────► proxmox│     │
│                         │ Port 9092  └───────────┬────────────┘     │
│                         │ (SCRAM-SHA-512)        │                  │
│  ┌──────────────────────┴────────────────────────┘                  │
│  │ kafka-bootstrap-aws (ServiceExport → clusterset.local)           │
│  └──────────────────────┬───────────────────────────────────────    │
└─────────────────────────┼───────────────────────────────────────────┘
                          │  Cilium Cluster Mesh
                          │  (clusterset.local DNS)
┌─────────────────────────┼───────────────────────────────────────────┐
│  ┌──────────────────────┴───────────────────────────────────────    │
│  │ kafka-bootstrap-proxmox (ServiceExport → clusterset.local)       │
│  └──────────────────────┬────────────────────────┐                  │
│                         │ Port 9092              │                  │
│                         │ (SCRAM-SHA-512)        │                  │
│  ┌──────────┐    ┌──────┴───────┐    ┌───────────┴────────────┐     │
│  │ ESB      │───>│ Kafka Broker │    │ mm2-proxmox-to-aws     │     │
│  │ Producer │    │ (local       │    │ (connectCluster:       │     │
│  └──────────┘    │  topics)     │    │  proxmox)              │     │
│                  └──────────────┘    │                        │     │
│                                      │ proxmox ──────────► aws│     │
│                                      └────────────────────────┘     │
│                   Proxmox Cluster (Active — Kafka 4.1.0)            │
│                   3 brokers, replication factor 3                    │
└─────────────────────────────────────────────────────────────────────┘

Cross-Cluster Service Discovery

Both clusters run Kafka as kafka-rciis-prod in namespace rciis-prod. To avoid naming collisions in Cluster Mesh, each cluster creates a uniquely-named Service with a corresponding ServiceExport (Kubernetes MCS API):

Cluster Service Name ServiceExport Cross-Cluster DNS
AWS kafka-bootstrap-aws kafka-bootstrap-aws kafka-bootstrap-aws.rciis-prod.svc.clusterset.local
Proxmox kafka-bootstrap-proxmox kafka-bootstrap-proxmox kafka-bootstrap-proxmox.rciis-prod.svc.clusterset.local

These services select the same Kafka broker pods via Strimzi labels but are discoverable cross-cluster with distinct names.

MirrorMaker 2 Components

Each MM2 instance runs three connectors for its single replication direction:

Component Purpose Internal Topic
MirrorSourceConnector Consumes from source, produces to target with DefaultReplicationPolicy (topics prefixed with source alias, e.g. proxmox.rect). mm2-offset-syncs.<target>.internal
MirrorCheckpointConnector Translates consumer group offsets from source to target. Emits checkpoints every 60 seconds. <source>.checkpoints.internal
MirrorHeartbeatConnector Monitors replication health. Emits heartbeat every 10 seconds. heartbeats

Topic Naming

With DefaultReplicationPolicy (separator: .), replicated topics are prefixed with the source cluster alias:

Cluster Local Topic Replicated To Remote As
proxmox rect proxmox.rect (on aws cluster)
aws rect aws.rect (on proxmox cluster)

Infinite Loop Prevention

The topicsExcludePattern prevents re-replicating already-replicated topics:

topicsPattern: ".*"
topicsExcludePattern: ".*[\\-\\.]internal|__.*"

Since replicated topics contain the source alias (e.g. proxmox.rect), and DefaultReplicationPolicy only replicates topics that don't already carry a remote prefix, infinite loops are inherently prevented. The exclude pattern additionally filters out MM2 internal topics (mm2-offset-syncs.*.internal, *.checkpoints.internal) and Kafka system topics (__consumer_offsets, __transaction_state).


Prerequisites

Cilium Cluster Mesh

  • Cilium installed on both clusters (AWS: cluster id 2, Proxmox: cluster id 1)
  • Cilium Cluster Mesh connected between clusters (AWS via NLB, Proxmox via MetalLB)
  • KVStoreMesh enabled on both clusters
  • CoreDNS configured with multicluster clusterset.local on both clusters
  • Pod-to-pod connectivity verified across clusters
# Verify Cluster Mesh status
cilium clustermesh status

# Expected:
# Service "clustermesh-apiserver" of type "LoadBalancer" found
# Cluster access information is available:
#   - rciis-aws: available
#   - rciis-proxmox: available

Network Requirements

  • Cilium Cluster Mesh connectivity established between clusters
  • Firewall rules allow Kafka traffic (port 9092 TCP) across cluster mesh
  • AWS Security Groups allow inbound from Proxmox cluster mesh
  • Network latency <150ms (measured between clusters)

Strimzi Operator

  • Strimzi operator (v0.51.0) deployed in both clusters via FluxCD HelmRelease (flux/infra/base/strimzi.yaml)
  • One operator per cluster
  • Both clusters: Kafka 4.1.0 running in KRaft mode with KafkaNodePool CRDs

Storage and Resources

  • AWS cluster: gp3 storage class, 1 broker (KafkaNodePool: 2Gi memory limit)
  • Proxmox cluster: ceph-rbd-single storage class, 3 brokers (KafkaNodePool: 8Gi memory limit)
  • Sufficient CPU/Memory for MM2 pods (2 replicas x 4Gi memory per cluster)

Critical Fix: Resource Sizing

Fix Before Deploying MM2

The AWS Kafka configuration has a JVM max heap (-Xmx: 8192m) that exceeds the KafkaNodePool memory limit (2Gi). Brokers will be OOMKilled when the JVM attempts to allocate 8 GB heap. This must be fixed before proceeding.

# flux/apps/aws/kafka/kafka-nodepool.yaml
spec:
  resources:
    requests:
      memory: 16Gi
      cpu: "2"
    limits:
      memory: 16Gi
      cpu: "4"

# flux/apps/aws/kafka/kafka.yaml
jvmOptions:
  -Xms: 8192m       # Initial heap: 8 GB
  -Xmx: 8192m       # Max heap: 8 GB (50% of pod memory for OS page cache)
# flux/apps/aws/kafka/kafka-nodepool.yaml
spec:
  resources:
    requests:
      memory: 2Gi
      cpu: "200m"
    limits:
      memory: 2Gi
      cpu: "1000m"

# flux/apps/aws/kafka/kafka.yaml
jvmOptions:
  -Xms: 512m        # Initial heap: 512 MB
  -Xmx: 1536m       # Max heap: 1.5 GB (leave 512 MB for OS page cache)

Apply to the AWS cluster (the proxmox cluster already has adequate values: 8Gi memory limit, 2048m–8192m heap):

  1. Edit flux/apps/aws/kafka/kafka-nodepool.yaml and flux/apps/aws/kafka/kafka.yaml
  2. Commit and push — FluxCD will reconcile automatically
  3. Wait for rolling restart to complete
  4. Verify brokers are healthy:
kubectl get pods -n rciis-prod --context rciis-aws -l strimzi.io/cluster=kafka-rciis-prod
kubectl get events -n rciis-prod --context rciis-aws --sort-by='.lastTimestamp' | grep OOMKilled

Implementation Phases

Phase 1: ServiceExport for Cross-Cluster Discovery

Each cluster already has a uniquely-named Service and ServiceExport that makes its Kafka brokers discoverable cross-cluster via clusterset.local DNS.

Files:

  • flux/apps/aws/kafka/service-export-aws.yaml
  • flux/apps/proxmox/kafka/service-export-proxmox.yaml

AWS cluster (service-export-aws.yaml):

---
apiVersion: v1
kind: Service
metadata:
  name: kafka-bootstrap-aws
  namespace: rciis-prod
spec:
  type: ClusterIP
  selector:
    strimzi.io/broker-role: "true"
    strimzi.io/cluster: kafka-rciis-prod
    strimzi.io/kind: Kafka
    strimzi.io/name: kafka-rciis-prod-kafka
  ports:
    - name: tcp-scram
      port: 9092
      targetPort: 9092
      protocol: TCP
    - name: tcp-plain
      port: 9093
      targetPort: 9093
      protocol: TCP
---
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  name: kafka-bootstrap-aws
  namespace: rciis-prod

Proxmox cluster (service-export-proxmox.yaml):

---
apiVersion: v1
kind: Service
metadata:
  name: kafka-bootstrap-proxmox
  namespace: rciis-prod
spec:
  type: ClusterIP
  selector:
    strimzi.io/broker-role: "true"
    strimzi.io/cluster: kafka-rciis-prod
    strimzi.io/kind: Kafka
    strimzi.io/name: kafka-rciis-prod-kafka
  ports:
    - name: tcp-scram
      port: 9092
      targetPort: 9092
      protocol: TCP
    - name: tcp-plain
      port: 9093
      targetPort: 9093
      protocol: TCP
---
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
  name: kafka-bootstrap-proxmox
  namespace: rciis-prod

Verify cross-cluster DNS resolution:

# From AWS cluster pod, resolve proxmox Kafka bootstrap
kubectl run -it --rm debug --image=alpine --restart=Never -n rciis-prod --context rciis-aws -- \
  nslookup kafka-bootstrap-proxmox.rciis-prod.svc.clusterset.local

# From proxmox cluster pod, resolve AWS Kafka bootstrap
kubectl run -it --rm debug --image=alpine --restart=Never -n rciis-prod --context rciis-proxmox -- \
  nslookup kafka-bootstrap-aws.rciis-prod.svc.clusterset.local

# Verify Cilium Cluster Mesh health
cilium clustermesh status --context rciis-proxmox
cilium clustermesh status --context rciis-aws

Phase 2: KafkaUser Resources for MM2

Each cluster needs an mm2-user with SCRAM-SHA-512 authentication and wildcard ACLs for all topics, consumer groups, and MM2 internal topics.

Files (to create):

  • flux/apps/aws/kafka/kafka-mm2-user.yaml
  • flux/apps/proxmox/kafka/kafka-mm2-user.yaml

Both files are identical (both reference strimzi.io/cluster: kafka-rciis-prod):

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: mm2-user
  namespace: rciis-prod
  labels:
    strimzi.io/cluster: kafka-rciis-prod
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Read/Write all application topics
      - resource:
          type: topic
          name: "*"
          patternType: literal
        operations: [Read, Write, Create, Describe]
      # Consumer groups
      - resource:
          type: group
          name: "*"
          patternType: literal
        operations: [Read, Write, Describe]
      # MM2 internal topics
      - resource:
          type: topic
          name: mm2-offset-syncs
          patternType: prefix
        operations: [Read, Write, Create, Describe]
      - resource:
          type: topic
          name: heartbeats
          patternType: literal
        operations: [Read, Write, Create, Describe]
      - resource:
          type: topic
          name: mirrormaker2-cluster
          patternType: prefix
        operations: [Read, Write, Create, Describe]
      # Kafka Connect internal topics
      - resource:
          type: topic
          name: connect-cluster
          patternType: prefix
        operations: [Read, Write, Create, Describe]

Apply and copy user secrets cross-cluster:

# FluxCD will create the KafkaUser resources automatically.
# Wait for Strimzi User Operator to generate SCRAM credentials:
kubectl get secret mm2-user -n rciis-prod --context rciis-proxmox
kubectl get secret mm2-user -n rciis-prod --context rciis-aws

# Copy AWS mm2-user secret → proxmox cluster (as mm2-user-aws)
kubectl get secret mm2-user -n rciis-prod --context rciis-aws -o json \
  | jq '.metadata.name = "mm2-user-aws" | del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp)' \
  | kubectl apply --context rciis-proxmox -f -

# Copy proxmox mm2-user secret → AWS cluster (as mm2-user-proxmox)
kubectl get secret mm2-user -n rciis-prod --context rciis-proxmox -o json \
  | jq '.metadata.name = "mm2-user-proxmox" | del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp)' \
  | kubectl apply --context rciis-aws -f -

# Verify all required secrets in proxmox namespace
kubectl get secrets -n rciis-prod --context rciis-proxmox | grep mm2
# Expected:
# mm2-user                 (local user credentials)
# mm2-user-aws             (remote user credentials)

# Verify all required secrets in AWS namespace
kubectl get secrets -n rciis-prod --context rciis-aws | grep mm2
# Expected:
# mm2-user                 (local user credentials)
# mm2-user-proxmox         (remote user credentials)

Phase 3: Deploy KafkaMirrorMaker2 CRD

Each cluster runs its own MM2 instance responsible for one replication direction only. This ensures no duplicate data and provides resilience — if one cluster goes down, only its outbound replication stops while inbound replication from the other cluster's MM2 continues unaffected.

Files (to create):

  • flux/apps/proxmox/kafka/kafka-mirrormaker2.yamlmm2-proxmox-to-aws (proxmox→aws only)
  • flux/apps/aws/kafka/kafka-mirrormaker2.yamlmm2-aws-to-proxmox (aws→proxmox only)

Example — proxmox cluster CRD (AWS cluster is symmetric with swapped aliases and direction reversed):

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: mm2-proxmox-to-aws
  namespace: rciis-prod
  labels:
    app: kafka-mirrormaker2
    environment: proxmox
spec:
  version: 4.1.0
  replicas: 2
  connectCluster: "proxmox"

  clusters:
    - alias: "proxmox"
      bootstrapServers: kafka-rciis-prod-kafka-bootstrap.rciis-prod.svc:9092
      authentication:
        type: scram-sha-512
        username: mm2-user
        passwordSecret:
          secretName: mm2-user
          password: password
      config:
        config.storage.replication.factor: 3
        offset.storage.replication.factor: 3
        status.storage.replication.factor: 3

    - alias: "aws"
      bootstrapServers: kafka-bootstrap-aws.rciis-prod.svc.clusterset.local:9092
      authentication:
        type: scram-sha-512
        username: mm2-user
        passwordSecret:
          secretName: mm2-user-aws
          password: password

  mirrors:
    - sourceCluster: "proxmox"
      targetCluster: "aws"
      sourceConnector:
        tasksMax: 5
        autoRestart:
          enabled: true
        config:
          replication.factor: 1
          offset-syncs.topic.replication.factor: 1
          offset-syncs.topic.location: "target"
          refresh.topics.interval.seconds: 60
          replication.policy.class: "org.apache.kafka.connect.mirror.DefaultReplicationPolicy"
          replication.policy.separator: "."
          sync.topic.acls.enabled: "false"
          producer.override.acks: all
          producer.override.compression.type: lz4
          producer.override.batch.size: 327680
          producer.override.linger.ms: 100
      checkpointConnector:
        autoRestart:
          enabled: true
        config:
          checkpoints.topic.replication.factor: 1
          sync.group.offsets.enabled: true
          sync.group.offsets.interval.seconds: 60
          emit.checkpoints.interval.seconds: 60
          refresh.groups.interval.seconds: 600
          replication.policy.class: "org.apache.kafka.connect.mirror.DefaultReplicationPolicy"
      heartbeatConnector:
        config:
          heartbeats.topic.replication.factor: 1
          emit.heartbeats.interval.seconds: 10
      topicsPattern: ".*"
      topicsExcludePattern: ".*[\\-\\.]internal|__.*"
      groupsPattern: ".*"

  resources:
    requests:
      cpu: "500m"
      memory: 1Gi
    limits:
      cpu: "2000m"
      memory: 4Gi
  jvmOptions:
    -Xms: 1024m
    -Xmx: 3072m
  logging:
    type: inline
    loggers:
      rootLogger.level: INFO
      logger.mirror.name: org.apache.kafka.connect.mirror
      logger.mirror.level: DEBUG
  livenessProbe:
    initialDelaySeconds: 60
    timeoutSeconds: 5
  readinessProbe:
    initialDelaySeconds: 30
    timeoutSeconds: 5
  metricsConfig:
    type: jmxPrometheusExporter
    valueFrom:
      configMapKeyRef:
        name: mm2-jmx-metrics
        key: mm2-metrics-config.yml

Replication factors differ per target cluster

When replicating to AWS (1 broker), replication factors are set to 1. When replicating to proxmox (3 brokers), replication factors should be 3. The AWS cluster CRD (mm2-aws-to-proxmox) should use replication.factor: 3 in its mirror config.

Key differences between the two CRDs:

Field proxmox CRD aws CRD
metadata.name mm2-proxmox-to-aws mm2-aws-to-proxmox
spec.connectCluster proxmox aws
mirrors[0] direction proxmox → aws aws → proxmox
Local bootstrapServers kafka-rciis-prod-kafka-bootstrap.rciis-prod.svc:9092 kafka-rciis-prod-kafka-bootstrap.rciis-prod.svc:9092
Remote bootstrapServers kafka-bootstrap-aws.rciis-prod.svc.clusterset.local:9092 kafka-bootstrap-proxmox.rciis-prod.svc.clusterset.local:9092
Local user secret mm2-user mm2-user
Remote user secret mm2-user-aws mm2-user-proxmox
Target replication.factor 1 (AWS has 1 broker) 3 (Proxmox has 3 brokers)
Connect cluster *.replication.factor 3 (local proxmox storage) 1 (local AWS storage)

Deploy:

Add the new files to the kustomizations and commit. FluxCD will reconcile automatically.

# Add to flux/apps/proxmox/kafka/kustomization.yaml:
  - kafka-mm2-user.yaml
  - kafka-mirrormaker2.yaml

# Add to flux/apps/aws/kafka/kustomization.yaml:
  - kafka-mm2-user.yaml
  - kafka-mirrormaker2.yaml

Verify:

# Check pods in both clusters (2 replicas each)
kubectl get pods -n rciis-prod --context rciis-proxmox -l app=kafka-mirrormaker2
kubectl get pods -n rciis-prod --context rciis-aws -l app=kafka-mirrormaker2

# Check connector status (proxmox cluster — 3 connectors for proxmox→aws)
kubectl exec -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq
# Expected:
# proxmox->aws.MirrorSourceConnector
# proxmox->aws.MirrorCheckpointConnector
# proxmox->aws.MirrorHeartbeatConnector

# Check connector status (AWS cluster — 3 connectors for aws→proxmox)
kubectl exec -n rciis-prod --context rciis-aws mm2-aws-to-proxmox-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq
# Expected:
# aws->proxmox.MirrorSourceConnector
# aws->proxmox.MirrorCheckpointConnector
# aws->proxmox.MirrorHeartbeatConnector

# Check individual connector status
kubectl exec -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors/proxmox->aws.MirrorSourceConnector/status | jq
# Expected: "state": "RUNNING"

Phase 4: JMX Metrics ConfigMap

Deploy the JMX metrics configuration for Prometheus scraping of MM2 connector metrics.

Files (to create):

  • flux/apps/proxmox/kafka/kafka-mm2-metrics-configmap.yaml
  • flux/apps/aws/kafka/kafka-mm2-metrics-configmap.yaml

Both files are identical:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mm2-jmx-metrics
  namespace: rciis-prod
data:
  mm2-metrics-config.yml: |
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules:
    - pattern: kafka.connect<type=connect-worker-metrics><>(connector-count|task-count|connector-startup-attempts-total|...)
      name: kafka_connect_$1
      type: GAUGE

    - pattern: kafka.connect.mirror<type=MirrorSourceConnector, target=(.+), topic=(.+), partition=([0-9]+)><>([a-z-]+)
      name: kafka_connect_mirror_source_connector_$4
      labels:
        target: "$1"
        topic: "$2"
        partition: "$3"
      type: GAUGE

    - pattern: kafka.connect.mirror<type=MirrorSourceConnector, target=(.+), topic=(.+), partition=([0-9]+)><>replication-latency-ms
      name: kafka_connect_mirror_replication_latency_ms
      labels:
        target: "$1"
        topic: "$2"
        partition: "$3"
      type: GAUGE

    - pattern: kafka.connect.mirror<type=MirrorCheckpointConnector, source=(.+), target=(.+)><>([a-z-]+)
      name: kafka_connect_mirror_checkpoint_connector_$3
      labels:
        source: "$1"
        target: "$2"
      type: GAUGE

Add to both kustomizations and commit — FluxCD will reconcile.


Phase 5: PrometheusRule Alerts

Deploy alerting rules for MM2 health and replication lag in both clusters.

Files (to create):

  • flux/apps/proxmox/kafka/kafka-mm2-prometheus-rules.yaml
  • flux/apps/aws/kafka/kafka-mm2-prometheus-rules.yaml

Both files are identical:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-mirrormaker2-alerts
  namespace: rciis-prod
  labels:
    release: prometheus
spec:
  groups:
    - name: mirrormaker2
      interval: 30s
      rules:
        - alert: MM2ConnectorDown
          expr: kafka_connect_connector_status{state="RUNNING"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "MirrorMaker 2 connector {{ $labels.connector }} is down"

        - alert: MM2ReplicationLagHigh
          expr: |
            (kafka_connect_source_task_source_record_poll_total -
             kafka_connect_source_task_source_record_write_total) > 10000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "MirrorMaker 2 replication lag exceeds 10,000 messages"

        - alert: MM2ReplicationLagCritical
          expr: |
            (kafka_connect_source_task_source_record_poll_total -
             kafka_connect_source_task_source_record_write_total) > 50000
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "MirrorMaker 2 replication lag critical (>50k messages)"

        - alert: MM2CheckpointStale
          expr: time() - kafka_connect_source_task_last_checkpoint_timestamp_seconds > 300
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "MirrorMaker 2 checkpoints are stale (no checkpoint in 5 minutes)"

Add to both kustomizations and commit — FluxCD will reconcile.

Grafana dashboards — import from the Strimzi GitHub repository:

  • strimzi-kafka.json (broker metrics)
  • strimzi-kafka-exporter.json (consumer lag)
  • strimzi-kafka-connect.json (MM2 connector metrics)

Operational Runbooks

Pause/Resume One Replication Direction

Since both clusters are active, you may need to temporarily pause replication in one direction (e.g. during maintenance on a single cluster).

# Pause proxmox→aws replication (on proxmox cluster MM2)
kubectl exec -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  curl -X PUT http://localhost:8083/connectors/proxmox->aws.MirrorSourceConnector/pause

# Verify paused
kubectl exec -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors/proxmox->aws.MirrorSourceConnector/status | jq '.connector.state'
# Expected: "PAUSED"

# ... perform maintenance ...

# Resume
kubectl exec -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  curl -X PUT http://localhost:8083/connectors/proxmox->aws.MirrorSourceConnector/resume

To pause all replication from a cluster, scale down its MM2 instance:

# Scale down proxmox MM2 (pauses proxmox→aws replication)
kubectl scale kafkamirrormaker2/mm2-proxmox-to-aws --replicas=0 -n rciis-prod --context rciis-proxmox

# Scale back up
kubectl scale kafkamirrormaker2/mm2-proxmox-to-aws --replicas=2 -n rciis-prod --context rciis-proxmox

Network Partition Recovery

Symptoms: MM2ConnectorDown alert, logs showing "Connection reset by peer" or "Timed out waiting for".

# Diagnose — check ServiceExport and cross-cluster DNS
kubectl get serviceexport -n rciis-prod --context rciis-aws
kubectl get serviceexport -n rciis-prod --context rciis-proxmox

# Test cross-cluster DNS resolution
kubectl run -it --rm debug --image=alpine --restart=Never -n rciis-prod --context rciis-proxmox -- \
  nslookup kafka-bootstrap-aws.rciis-prod.svc.clusterset.local

kubectl run -it --rm debug --image=alpine --restart=Never -n rciis-prod --context rciis-aws -- \
  nslookup kafka-bootstrap-proxmox.rciis-prod.svc.clusterset.local

# Test cross-cluster connectivity
kubectl exec -it -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  nc -zv kafka-bootstrap-aws.rciis-prod.svc.clusterset.local 9092

# Check Cilium Cluster Mesh health
cilium clustermesh status --context rciis-proxmox
cilium clustermesh status --context rciis-aws

# Resolve — if Cilium Cluster Mesh is down, restart API server
kubectl rollout restart deployment/clustermesh-apiserver -n kube-system

# MM2 will automatically resume once connectivity is restored (autoRestart: enabled)
# Monitor catch-up progress:
watch 'kubectl exec -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors/proxmox->aws.MirrorSourceConnector/status | jq'

High Replication Lag Troubleshooting

Symptoms: MM2ReplicationLagHigh alert, lag >10,000 messages.

# Check connector status on both clusters
kubectl exec -n rciis-prod --context rciis-proxmox mm2-proxmox-to-aws-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq
kubectl exec -n rciis-prod --context rciis-aws mm2-aws-to-proxmox-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq

# Check MM2 logs for errors
kubectl logs -n rciis-prod --context rciis-proxmox -l app=kafka-mirrormaker2 --tail=100 | grep -i error
kubectl logs -n rciis-prod --context rciis-aws -l app=kafka-mirrormaker2 --tail=100 | grep -i error

Common causes:

Cause Resolution
Burst of messages on source Wait for burst to complete; optionally scale MM2 replicas to 3
Network congestion Check bandwidth utilisation; investigate Cilium Cluster Mesh health
Target broker failure Check broker health: kubectl get pods -n rciis-prod --context rciis-aws -l strimzi.io/cluster=kafka-rciis-prod
Insufficient MM2 task parallelism Increase tasksMax from 5 to 10 in KafkaMirrorMaker2 CRD

References


Key Files

File Description
flux/apps/aws/kafka/kafka.yaml AWS cluster Kafka CRD (v4.1.0, 1 broker)
flux/apps/aws/kafka/kafka-nodepool.yaml AWS cluster KafkaNodePool (gp3 storage)
flux/apps/aws/kafka/service-export-aws.yaml AWS ServiceExport for cross-cluster discovery
flux/apps/aws/kafka/kafka-mirrormaker2.yaml AWS MM2 CRD (aws → proxmox)
flux/apps/aws/kafka/kafka-mm2-user.yaml AWS cluster KafkaUser for MM2
flux/apps/aws/kafka/kafka-mm2-metrics-configmap.yaml AWS JMX metrics ConfigMap
flux/apps/aws/kafka/kafka-mm2-prometheus-rules.yaml AWS PrometheusRule for MM2 alerts
flux/apps/aws/kafka/kustomization.yaml AWS Kafka Kustomization
flux/apps/proxmox/kafka/kafka.yaml Proxmox cluster Kafka CRD (v4.1.0, 3 brokers)
flux/apps/proxmox/kafka/kafka-nodepool.yaml Proxmox cluster KafkaNodePool (ceph-rbd-single)
flux/apps/proxmox/kafka/service-export-proxmox.yaml Proxmox ServiceExport for cross-cluster discovery
flux/apps/proxmox/kafka/kafka-mirrormaker2.yaml Proxmox MM2 CRD (proxmox → aws)
flux/apps/proxmox/kafka/kafka-mm2-user.yaml Proxmox cluster KafkaUser for MM2
flux/apps/proxmox/kafka/kafka-mm2-metrics-configmap.yaml Proxmox JMX metrics ConfigMap
flux/apps/proxmox/kafka/kafka-mm2-prometheus-rules.yaml Proxmox PrometheusRule for MM2 alerts
flux/apps/proxmox/kafka/kustomization.yaml Proxmox Kafka Kustomization
flux/infra/base/strimzi.yaml Base Strimzi HelmRelease (v0.51.0)
flux/infra/aws/cilium/patch.yaml AWS Cilium Cluster Mesh config
flux/infra/proxmox/cilium/patch.yaml Proxmox Cilium Cluster Mesh config