A Benchmark Study on Realistic Non-IID Data Silos

Fundamentals and Experimental Analysis of Federated Learning Algorithms

Coming
University of Mohammed-V University of Sherbrooke
*Equal Contribution

🔥[NEW!] FedBench-0.1.3 enables efficient FL benchmarking, evaluating 9+ algorithms across 8+ datasets (images/tabular) on one A40 GPU in ~3 months. It offers precise control over client heterogeneity with minimal setup, providing comprehensive metrics for actionable algorithm selection.

This end-to-end framework sets a new reproducibility standard through production-ready implementations, enabling direct performance comparisons while reducing validation overhead.

Abstract

The efficacy of Federated Learning (FL), a prominent privacy-preserving machine learning paradigm, is critically challenged by the statistical heterogeneity inherent in non-independent and identically distributed (Non-IID) data across client silos. While numerous algorithms have been proposed to address this issue, a comprehensive understanding of their relative performance under diverse conditions remains elusive. This paper presents a large-scale, systematic comparative study, evaluating nine notable FL algorithms—including FedAvg, FedProx, SCAFFOLD, MOON, FedBN, and server-side adaptive methods—under a wide spectrum of Non-IID settings. Our methodology involves rigorous experimentation across eight benchmark datasets, simulating practical challenges such as label distribution skew, feature distribution skew, and data quantity imbalances. Our key finding is that no single algorithm is universally superior; optimal performance is highly contingent on the specific type of data heterogeneity. We demonstrate that adaptive optimizers like FedAdagrad excel under severe label skew, whereas FedBN is the definitive choice for feature skew. Furthermore, we uncover the practical limitations of theoretically-motivated methods like SCAFFOLD in highly heterogeneous environments. These insights culminate in a practical decision tree to guide algorithm selection, providing a clear roadmap for researchers and practitioners.

Dataset Information

Datasets #training instances #test instances #features #classes
MNIST 60,000 10,000 784 10
FMNIST 60,000 10,000 784 10
CIFAR-10 50,000 10,000 3,072 10
SVHN 73,257 26,032 3,072 10
CINIC10 90,000 90,000 3,072 10
FED-ISIC2019 18,597 4,650 3,072 8
Adult 26,048 6,512 99 2
FCUBE 4,000 1,000 3 2

Performance

Federated Learning Algorithms Comparison Table

CATEGORY DATASET PARTITIONING FedAvg FedProx SCAFFOLD FedNova FedAdagrad FedYogi FedAdam MOON FedBN
Label Distribution Skew MNIST pk ~ Dir(0.5) 98.85% ± 0.05% 98.91% ± 0.03% 61.99% ± 1.4% 98.97% ± 0.03% 98.81% ± 0.04% 98.86% ± 0.06% 98.46% ± 0.02% 98.80% ± 0.08% 98.91% ± 0.02%
#C = 1 40.34% ± 26.51% 40.24% ± 21.8% 10.99% ± 0.5% 54.66% ± 9.53% 68.85% ± 9.5% 34.20% ± 15.34% 18.86% ± 12.77% 9.92% ± 0.24% 60.73% ± 7.55%
#C = 2 97.53% ± 0.07% 97.42% ± 0.12% 21.81% ± 1.92% 97.37% ± 0.31% 95.92% ± 0.37% 97.78% ± 0.22% 61.91% ± 36.65% 55.75% ± 32.89% 97.50% ± 0.26%
#C = 3 98.53% ± 0.16% 98.65% ± 0.15% 29.95% ± 0.15% 98.43% ± 0.15% 98.22% ± 0.07% 98.69% ± 0.11% 95.8% ± 1.82% 95.06% ± 1.60% 98.56% ± 0.11%
FMNIST pk ~ Dir(0.5) 86.88% ± 0.38% 87.16% ± 0.18% 48.80% ± 1.36% 87.12% ± 0.32% 86.39% ± 0.19% 86.44% ± 0.15% 84.46% ± 1.72% 87.15% ± 0.38% 87.01% ± 0.28%
#C = 1 15.88% ± 4.6% 28.39% ± 16.68% 10% ± 0.0% 20.28% ± 14.53% 37% ± 4.9% 25.35% ± 10.93% 10% ± 0.0% 10% ± 0.0% 21.13% ± 10.30%
#C = 2 79.3% ± 2.37% 78.33% ± 0.7% 22.85% ± 2.60% 75.76% ± 3.94% 78.11% ± 1.03% 81.97% ± 0.78% 63.14% ± 15.59% 67.77% ± 2.61% 79.09% ± 1.18%
#C = 3 83.67% ± 0.55% 83.62% ± 0.85% 25.56% ± 0.16% 83.51% ± 0.67% 83.52% ± 0.68% 83.42% ± 0.13% 79.62% ± 0.72% 74.57% ± 3.46% 83.56% ± 0.76%
SVHN pk ~ Dir(0.5) 86.44% ± 0.31% 86.48% ± 0.1% 51.74% ± 1.20% 86.48% ± 0.13% 86.7% ± 0.35% 85.34% ± 0.21% 86.7% ± 0.35% 86.15% ± 0.46% 86.45% ± 0.51%
#C = 1 14.32% ± 2.3% 13.27% ± 4.53% 19.59% ± 0.0% 9.15% ± 1.82% 18.37% ± 1.72% 18.37% ± 1.72% 15.65% ± 5.58% 12.49% ± 3.03% 12.69% ± 2.30%
#C = 2 77.49% ± 1.34% 79.28% ± 0.17% 24.55% ± 0.86% 72.41% ± 2.15% 76.37% ± 1.18% 79.03% ± 0.26% 57.99% ± 27.15% 74.41% ± 1.76% 76.27% ± 2.53%
#C = 3 81.71% ± 1.08% 82.11% ± 1.14% 31.11% ± 0.51% 81.59% ± 0.91% 82.43% ± 0.14% 81.27% ± 0.64% 82.43% ± 0.14% 81.26% ± 1.06% 81.77% ± 0.49%
CINIC10 pk ~ Dir(0.5) 36.41% ± 0.18% 36.24% ± 0.3% 19.16% ± 0.32% 36.59% ± 0.69% 30.24% ± 1.90% 35.65% ± 0.27% 30.57% ± 1.46% 36.93% ± 0.94% 36.24% ± 0.23%
#C = 1 9.64% ± 0.54% 9.46% ± 0.77% 11.17% ± 1.51% 10% ± 0.0% 10% ± 0.0% 9.93% ± 0.33% 10% ± 0.0% 10% ± 0.0% 10% ± 0.0%
#C = 2 25.34% ± 1.33% 28.15% ± 1.2% 16.47% ± 1.53% 26.08% ± 1.31% 13.58% ± 5.06% 26.58% ± 0.92% 13.58% ± 5.06% 25.41% ± 0.68% 27.55% ± 0.79%
#C = 3 32.91% ± 0.44% 32.75% ± 0.42% 15.46% ± 1.17% 32.99% ± 0.41% 19% ± 6.36% 30.01% ± 0.91% 18.66% ± 6.13% 33.32% ± 0.86% 33.08% ± 0.23%
CIFAR10 pk ~ Dir(0.5) 65.61% ± 1.52% 64.16% ± 0.25% 25.74% ± 1.65% 64.66% ± 1.05% 60.62% ± 0.84% 59.41% ± 0.4% 38.95% ± 5.79% 64.69% ± 0.82% 63.43% ± 0.102%
#C = 1 9.64% ± 0.54% 9.46% ± 0.77% 10.11% ± 0.34% 11.66% ± 2.9% 19.4% ± 0.11% 22.79% ± 3.63% 10% ± 0.0% 10% ± 0.0% 11.20% ± 1.43%
#C = 2 47.83% ± 3.29% 49.82% ± 0.66% 17.44% ± 1.79% 46.27% ± 2.21% 46.63% ± 0.97% 47.38% ± 1.41% 10% ± 0.0% 42.23% ± 1.0% 48.36% ± 1.50%
#C = 3 62.65% ± 1.49% 63.57% ± 0.2% 20.11% ± 2.02% 61.7% ± 1.51% 59.9% ± 1.06% 59.02% ± 0.34% 38.34% ± 3.52% 61.21% ± 0.74% 61.93% ± 0.89%
FedISIC2019 pk ~ Dir(0.5) 56.46% ± 0.30% 56.23% ± 0.49% 28.30% ± 14.09% 33.61% ± 0.55% 53.80% ± 0.60% 54.14% ± 0.38% 48.27% ± 0.07% 55.90% ± 0.29% 56.24% ± 0.55%
#C = 1 18.34% ± 0.0% 28.3% ± 14.09% 48.22% ± 0.0% 22.56% ± 19.46% 48.55% ± 0.47% 48.22% ± 0.0% 48.22% ± 0.0% 37.02% ± 15.93% 27.02% ± 15.08%
#C = 2 40.09% ± 5.30% 38.52% ± 14.27% 48.22% ± 0.0% 28.95% ± 14.39% 52.18% ± 0.87% 51.84% ± 0.53% 48.22% ± 0.0% 48.43% ± 1.16% 48.10% ± 1.58%
#C = 3 47.93% ± 1.71% 48.02% ± 2.41% 48.22% ± 0.0% 40.68% ± 4.82% 52.25% ± 0.81% 51.92% ± 1.13% 38.26% ± 14.09% 50.52% ± 2.15% 49.18% ± 0.26%
Adult pk ~ Dir(0.5) 81.73% ± 2.62% 83.48% ± 2.17% 76.49% ± 0.0% 69.07% ± 2.05% 83.64% ± 1.71% 83.31% ± 1.29% 82.98% ± 0.92% 82.39% ± 2.40% 84.39% ± 0.49%
#C = 1 80.21% ± 3.35% 81.06% ± 2.98% 76.49% ± 0.0% 54.08% ± 0.29% 85.03% ± 0.23% 84.03% ± 1.48% 84.70% ± 0.57% 77.88% ± 4.82% 79.22% ± 5.57%
Number of times that performs the best 3 4 2 1 8 4 2 2 1
Feature distribution skew MNIST x̂ ~ Gau(0.1) 99.23% ± 0.07% 99.26% ± 0.06% 98.19% ± 0.12% 99.24% ± 0.03% 99.28% ± 0.07% 99.22% ± 0.06% 99.05% ± 0.09% 99.24% ± 0.02% 99.28% ± 0.02%
FMNIST 84.35% ± 0.13% 84.65% ± 0.30% 72.68% ± 0.44% 84.43% ± 0.03% 83.99% ± 0.25% 84.57% ± 0.26% 82.64% ± 0.51% 84.57% ± 0.12% 84.72% ± 0.14%
SVHN 68.85% ± 0.67% 69.39% ± 0.40% 70.25% ± 0.41% 69.75% ± 1.20% 73.14% ± 1.03% 72.8% ± 0.56% 70.09% ± 0.92% 70.09% ± 0.73% 69.27% ± 1.75%
CINIC10 35.66% ± 1.05% 34.51% ± 1.01% 31.97% ± 0.95% 33.84% ± 0.70% 23.10% ± 9.29% 35.44% ± 0.73% 23.44% ± 9.50% 34.18% ± 0.5% 33.82% ± 0.88%
CIFAR10 64.02% ± 0.08% 63.59% ± 0.09% 47.04% ± 1.01% 63.16% ± 1.13% 63.48% ± 0.88% 63.36% ± 0.42% 51.72% ± 0.75% 64.93% ± 0.34% 63.8% ± 0.56%
FedISIC2019 54.75% ± 0.56% 55.11% ± 0.12% 49.56% ± 0.68% 54.78% ± 0.04% 52.36% ± 0.52% 52.7% ± 0.99% 48.87% ± 0.88% 54.96% ± 0.86% 55.3% ± 0.46%
FCUBE synthetic 99.57% ± 0.05% 99.67% ± 0.12% 97.07% ± 1.28% 99.87% ± 0.12% 99.53% ± 0.12% 99.67% ± 0.12% 99.67% ± 0.12% 99.7% ± 0.14% 99.67% ± 0.09%
Number of times that performs the best 1 0 0 1 2 0 0 1 3
Quantity skew MNIST q ~ Dir(0.5) 99.02% ± 0.05% 99.01% ± 0.07% 97.3% ± 0.02% 98.95% ± 0.05% 99.08% ± 0.04% 98.89% ± 0.05% 98.61% ± 0.07% 99.07% ± 0.06% 99.08% ± 0.06%
FMNIST 88.80% ± 0.05% 88.46% ± 0.31% 80.53% ± 0.8% 88.99% ± 0.11% 88.95% ± 0.35% 88.75% ± 0.05% 86.41% ± 0.16% 89.09% ± 0.17% 88.08% ± 0.0%
SVHN 87.38% ± 1.27% 87.99% ± 0.38% 77.08% ± 0.97% 87.52% ± 0.44% 88.80% ± 0.06% 87.44% ± 0.05% 85.08% ± 0.28% 87.85% ± 0.82% 87.94% ± 0.13%
CINIC10 38.73% ± 0.55% 38.56% ± 0.39% 37.61% ± 0.40% 38.55% ± 0.29% 26.96% ± 12.0% 39.23% ± 0.48% 26.63% ± 11.76% 38.22% ± 0.67% 39.73% ± 0.45%
CIFAR10 71.13% ± 0.37% 71.01% ± 0.39% 46.14% ± 0.07% 71.72% ± 0.58% 69.40% ± 0.38% 68.24% ± 0.68% 25.90% ± 22.49% 71.6% ± 0.44% 71.23% ± 0.52%
FedISIC2019 57.73% ± 0.29% 58.14% ± 0.55% 49.96% ± 0.75% 57.04% ± 0.53% 56.18% ± 0.63% 56.52% ± 0.18% 48.66% ± 0.63% 58.51% ± 0.66% 57.93% ± 0.2%
Adult 84.26% ± 0.04% 84.48% ± 0.23% 85.55% ± 0.05% 83.98% ± 0.22% 83.83% ± 0.48% 83.5% ± 0.83% 83.83% ± 0.48% 83.6% ± 0.82% 83.93% ± 0.41%
Number of times that performs the best 0 0 1 1 2 0 0 2 2
Homogeneous partition MNIST IID 99.11% ± 0.05% 99.10% ± 0.01% 98.27% ± 0.09% 99.15% ± 0.04% 99.11% ± 0.07% 98.99% ± 0.11% 98.78% ± 0.02% 99.19% ± 0.03% 99.09% ± 0.07%
FMNIST 89.27% ± 0.23% 89.21% ± 0.24% 83.69% ± 0.45% 89.29% ± 0.08% 89.65% ± 0.03% 89.01% ± 0.11% 87.54% ± 0.57% 89.35% ± 0.16% 89.16% ± 0.36%
SVHN 88.07% ± 0.36% 88.30% ± 0.19% 82.05% ± 0.24% 87.30% ± 1.51% 60.14% ± 28.69% 87.32% ± 0.34% 61.30% ± 29.50% 87.63% ± 1.65% 88.35% ± 0.21%
CINIC10 39.99% ± 0.88% 39.82% ± 0.26% 41.58% ± 0.48% 40.18% ± 0.41% 36.91% ± 1.39% 40.83% ± 0.56% 36.58% ± 0.93% 40.85% ± 1.13% 39.92% ± 0.92%
CIFAR10 72.59% ± 0.51% 74.14% ± 0.01% 51.88% ± 0.45% 72.23% ± 0.31% 69.53% ± 0.59% 68.47% ± 0.37% 65.79% ± 2.08% 69.16% ± 0.47% 72.36% ± 0.32%
FedISIC2019 59.15% ± 0.83% 60.25% ± 0.12% 52.59% ± 0.62% 58.91% ± 0.30% 51.82% ± 1.85% 51.49% ± 2.32% 51.49% ± 2.32% 59.23% ± 0.54% 58.9% ± 0.31%
Adult 84.24% ± 0.39% 84.10% ± 0.16% 85.59% ± 0.12% 83.97% ± 0.06% 83.9% ± 0.1% 84.23% ± 0.51% 83.90% ± 0.10% 84.42% ± 0.97% 84.09% ± 0.51%
Number of times that performs the best 0 2 2 0 1 0 0 1 1
Legend: Best Result â–  High Performance (>80%) â–  Medium Performance (30-80%) â–  Low Performance (<30%)

Training Curves: Comparative Performance

The training curves of different approaches on CIFAR-10.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The training curves of different approaches on CINIC10.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The training curves of different approaches on MNIST.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The training curves of different approaches on SVHN.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The training curves of different approaches on FEDISIC2019.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

Local Epoch Sensitivity Analysis

The test accuracy with different number of local epochs on FEDISIC2019.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The test accuracy with different number of local epochs on CIFAR-10.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The test accuracy with different number of local epochs on CINIC10.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The test accuracy with different number of local epochs on MNIST.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

The test accuracy with different number of local epochs on SVHN.

#C = 1
#C = 2
#C = 3
pk ~ Dir(0.5)
x̂ ~ Gau(0.1)
q ~ Dir(0.5)

BibTeX


  @misc{,
          author={},
          title={}, 
          publisher={},
          year={},
  }