Thursday, December 18, 2014

Kafka network utilization (in vs. out)

One thing that was confusing me as I looked into the metrics in my Kafka performance testing (as shown in any of the graphs in my previous blog post) was the approximately 2x factor of input network bytes vs. output network bytes. Given I was doing replication, shouldn't the number of bytes be the same since I have to exactly replicate the messages to a following broker?

Let's assume one producer sending at 300MB/sec and three brokers with a replication factor of two and three partitions and no consumers for a simple example.

A single brokers (bid = 0) receives 100 MB/sec from the producer (eth0 in) for a partition (say pid = 0) as it is the leader for one of the partitions.

The single broker turns around and sends (eth0 out) 100 MB/sec to the broker who is the in sync replica (say bid = 1) for all of that traffic.  To be accurate, it is actually the in sync replica (bid = 1) that pulls.

Next is what I was missing in quick thinking about this problem ...

The single broker (bid = 0) also gets sent (actually pulls), 100 MB/sec from another broker's (say bid = 2) partition (say pid = 2) to which it is not the leader, but an in sync replica.

This means the traffic load for this broker is:
  • producer to broker (bid = 0) for partition (pid = 0)
    • eth0 in = 100 MB/sec
  • broker (bid = 0) for partition (pid = 0) to in sync replica broker (bid = 1)
    • eth0 out = 100 MB/sec
  • broker (bid = 2) for partition (pid = 2) to in sync replica broker (bid = 0)
    • eth0 in = 100 MB/sec
  • total:
    • eth0 in = 200 MB/sec
    • eth0 out = 100 MB/sec
Notice I originally said no consumers.  If I added consumers they would start to add to the eth0 out figure and possibly balance the in vs. out, but only if they were consuming at the same rate as the producers.  If there were more consumers than producers, the consumers could easily overrun the input rate which would be common for streams that were heavily fanned out to different consumer groups.

Now, let's consider what happens when we make the configuration more interesting.  Specifically, we'd want to consider a larger number of brokers and a larger number of partitions and a larger replication factor.

Let's consider the case of 300 brokers with 3000 partitions with the same replication factor of 2.  Let's imagine a producer group that could send at 3000 MB/sec.  That mean's every partition will receive 1 MB/sec (eth0 in).  Every broker will be a leader for 10 of these partitions.  So for broker 0, it would receive 10 MB/sec of the producer traffic.  It would need to send that traffic to 10 other ISR's sending 10 MB/sec of replication "out" traffic.  It would be a ISR non-leader for 10 partitions so it would receive 10 MB/sec of replication "in" traffic.  That would mean 20 MB/sec in and 10 MB/sec out.  Again 2X factor.


Let's now consider 300 brokers with 3000 partitions and a replication factor of 6.  Again imagine a producer group that could send at 3000 MB/sec.  That mean's every partition will receive 1 MB/sec (eth0 in).  Every broker will be a leader for 10 of these partitions.  So for broker 0, it would receive 10 MB/sec of the producer traffic.  It would need to send that traffic to 50 other ISR's (5 for each of the 10 partitions) sending 50 MB/sec of replication "out" traffic.  It would be a ISR non-leader for 50 partitions so it would receive 50 MB/sec of replication "in" traffic.  That would mean 60 MB/sec in and 50 MB/sec out.  Now a 1.2 factor.  So by increasing the number of partitions will decrease the difference between in and out traffic.

Let's do some algebra to generalize this:

Let:
np = number of partitions
nb = number of brokers
rf = replication factor
tr = transfer rate total

tr_p_p = transfer rate for partition = tr / np
nlp_p_b = number of leader partitions per broker = np / nb
f_p_p = number of followers per partition = rf - 1
nl_p_p = number of leaders per partition = 1
f_tot = total number of following partitions = f_p_p * np
f_tot_p_b = total number of following partitions per broker = f_tot / nb

Tr in from producers for a single broker =
    tr_p_p * nlp_p_b =
    tr / np * np / nb =
    tr / nb (let's call this transfer rate per broker = tr_p_b)

Tr out to followers for a single broker =
    nlp_p_b * tr_p_p * f_p_p = 
    np / nb * tr / np * (rf - 1) =
    tr / nb * (rf - 1) =
    tr_p_b * (rf - 1)

Tr in as a follower =
    f_tot_p_b * tr_p_p =
    f_tot / nb * tr / np = 
    f_p_p * np / nb * tr / np = 
    (rf - 1) * np / nb * tr / np = 
    (rf - 1) / nb * tr =
    tr  / nb * (rf - 1)
    tr_p_b * (rf - 1)

Total for a single broker in =
  tr_p_b + (rf - 1) * tr_p_b

Total out for a single broker out =
  tr_p_b * (rf - 1)

The above generalization assumes a larger number of brokers so the replication factor comes into play as it relates to networking.  If you run with three brokers (as my examples above), the replication factor is still really limited to 3 regardless of how many partitions you have, so the difference between input and output would still be around 2X.  So it might be better to consider rf above as the minimum of rf and number of brokers.  However, if you add more brokers to scale this out, the generalization should apply.

In summary, in a large enough cluster, a single broker will take it's share of traffic of front end traffic and a multiple of that same share at one less than the replication factor incoming when acting as a follower.  Also, the single broker will send a similar share of one less than the replication factor to other followers.