A Benchmark Study on Realistic Non-IID Data Silos

Fundamentals and Experimental Analysis of Federated Learning Algorithms

Coming

Mohammed Nechba^*, Abdellatif El Afia, Bessam Abdulrazak

▶ University of Mohammed-V ▶ University of Sherbrooke

^*Equal Contribution

🔥[NEW!] FedBench-0.1.3 enables efficient FL benchmarking, evaluating 9+ algorithms across 8+ datasets (images/tabular) on one A40 GPU in ~3 months. It offers precise control over client heterogeneity with minimal setup, providing comprehensive metrics for actionable algorithm selection.

This end-to-end framework sets a new reproducibility standard through production-ready implementations, enabling direct performance comparisons while reducing validation overhead.

Abstract

The efficacy of Federated Learning (FL), a prominent privacy-preserving machine learning paradigm, is critically challenged by the statistical heterogeneity inherent in non-independent and identically distributed (Non-IID) data across client silos. While numerous algorithms have been proposed to address this issue, a comprehensive understanding of their relative performance under diverse conditions remains elusive. This paper presents a large-scale, systematic comparative study, evaluating nine notable FL algorithms—including FedAvg, FedProx, SCAFFOLD, MOON, FedBN, and server-side adaptive methods—under a wide spectrum of Non-IID settings. Our methodology involves rigorous experimentation across eight benchmark datasets, simulating practical challenges such as label distribution skew, feature distribution skew, and data quantity imbalances. Our key finding is that no single algorithm is universally superior; optimal performance is highly contingent on the specific type of data heterogeneity. We demonstrate that adaptive optimizers like FedAdagrad excel under severe label skew, whereas FedBN is the definitive choice for feature skew. Furthermore, we uncover the practical limitations of theoretically-motivated methods like SCAFFOLD in highly heterogeneous environments. These insights culminate in a practical decision tree to guide algorithm selection, providing a clear roadmap for researchers and practitioners.

Dataset Information

Datasets	#training instances	#test instances	#features	#classes
MNIST	60,000	10,000	784	10
FMNIST	60,000	10,000	784	10
CIFAR-10	50,000	10,000	3,072	10
SVHN	73,257	26,032	3,072	10
CINIC10	90,000	90,000	3,072	10
FED-ISIC2019	18,597	4,650	3,072	8
Adult	26,048	6,512	99	2
FCUBE	4,000	1,000	3	2

Performance

Federated Learning Algorithms Comparison Table

CATEGORY	DATASET	PARTITIONING	FedAvg	FedProx	SCAFFOLD	FedNova	FedAdagrad	FedYogi	FedAdam	MOON	FedBN
Label Distribution Skew	MNIST	p_k ~ Dir(0.5)	98.85% ± 0.05%	98.91% ± 0.03%	61.99% ± 1.4%	98.97% ± 0.03%	98.81% ± 0.04%	98.86% ± 0.06%	98.46% ± 0.02%	98.80% ± 0.08%	98.91% ± 0.02%
		#C = 1	40.34% ± 26.51%	40.24% ± 21.8%	10.99% ± 0.5%	54.66% ± 9.53%	68.85% ± 9.5%	34.20% ± 15.34%	18.86% ± 12.77%	9.92% ± 0.24%	60.73% ± 7.55%
		#C = 2	97.53% ± 0.07%	97.42% ± 0.12%	21.81% ± 1.92%	97.37% ± 0.31%	95.92% ± 0.37%	97.78% ± 0.22%	61.91% ± 36.65%	55.75% ± 32.89%	97.50% ± 0.26%
		#C = 3	98.53% ± 0.16%	98.65% ± 0.15%	29.95% ± 0.15%	98.43% ± 0.15%	98.22% ± 0.07%	98.69% ± 0.11%	95.8% ± 1.82%	95.06% ± 1.60%	98.56% ± 0.11%
	FMNIST	p_k ~ Dir(0.5)	86.88% ± 0.38%	87.16% ± 0.18%	48.80% ± 1.36%	87.12% ± 0.32%	86.39% ± 0.19%	86.44% ± 0.15%	84.46% ± 1.72%	87.15% ± 0.38%	87.01% ± 0.28%
		#C = 1	15.88% ± 4.6%	28.39% ± 16.68%	10% ± 0.0%	20.28% ± 14.53%	37% ± 4.9%	25.35% ± 10.93%	10% ± 0.0%	10% ± 0.0%	21.13% ± 10.30%
		#C = 2	79.3% ± 2.37%	78.33% ± 0.7%	22.85% ± 2.60%	75.76% ± 3.94%	78.11% ± 1.03%	81.97% ± 0.78%	63.14% ± 15.59%	67.77% ± 2.61%	79.09% ± 1.18%
		#C = 3	83.67% ± 0.55%	83.62% ± 0.85%	25.56% ± 0.16%	83.51% ± 0.67%	83.52% ± 0.68%	83.42% ± 0.13%	79.62% ± 0.72%	74.57% ± 3.46%	83.56% ± 0.76%
	SVHN	p_k ~ Dir(0.5)	86.44% ± 0.31%	86.48% ± 0.1%	51.74% ± 1.20%	86.48% ± 0.13%	86.7% ± 0.35%	85.34% ± 0.21%	86.7% ± 0.35%	86.15% ± 0.46%	86.45% ± 0.51%
		#C = 1	14.32% ± 2.3%	13.27% ± 4.53%	19.59% ± 0.0%	9.15% ± 1.82%	18.37% ± 1.72%	18.37% ± 1.72%	15.65% ± 5.58%	12.49% ± 3.03%	12.69% ± 2.30%
		#C = 2	77.49% ± 1.34%	79.28% ± 0.17%	24.55% ± 0.86%	72.41% ± 2.15%	76.37% ± 1.18%	79.03% ± 0.26%	57.99% ± 27.15%	74.41% ± 1.76%	76.27% ± 2.53%
		#C = 3	81.71% ± 1.08%	82.11% ± 1.14%	31.11% ± 0.51%	81.59% ± 0.91%	82.43% ± 0.14%	81.27% ± 0.64%	82.43% ± 0.14%	81.26% ± 1.06%	81.77% ± 0.49%
	CINIC10	p_k ~ Dir(0.5)	36.41% ± 0.18%	36.24% ± 0.3%	19.16% ± 0.32%	36.59% ± 0.69%	30.24% ± 1.90%	35.65% ± 0.27%	30.57% ± 1.46%	36.93% ± 0.94%	36.24% ± 0.23%
		#C = 1	9.64% ± 0.54%	9.46% ± 0.77%	11.17% ± 1.51%	10% ± 0.0%	10% ± 0.0%	9.93% ± 0.33%	10% ± 0.0%	10% ± 0.0%	10% ± 0.0%
		#C = 2	25.34% ± 1.33%	28.15% ± 1.2%	16.47% ± 1.53%	26.08% ± 1.31%	13.58% ± 5.06%	26.58% ± 0.92%	13.58% ± 5.06%	25.41% ± 0.68%	27.55% ± 0.79%
		#C = 3	32.91% ± 0.44%	32.75% ± 0.42%	15.46% ± 1.17%	32.99% ± 0.41%	19% ± 6.36%	30.01% ± 0.91%	18.66% ± 6.13%	33.32% ± 0.86%	33.08% ± 0.23%
	CIFAR10	p_k ~ Dir(0.5)	65.61% ± 1.52%	64.16% ± 0.25%	25.74% ± 1.65%	64.66% ± 1.05%	60.62% ± 0.84%	59.41% ± 0.4%	38.95% ± 5.79%	64.69% ± 0.82%	63.43% ± 0.102%
		#C = 1	9.64% ± 0.54%	9.46% ± 0.77%	10.11% ± 0.34%	11.66% ± 2.9%	19.4% ± 0.11%	22.79% ± 3.63%	10% ± 0.0%	10% ± 0.0%	11.20% ± 1.43%
		#C = 2	47.83% ± 3.29%	49.82% ± 0.66%	17.44% ± 1.79%	46.27% ± 2.21%	46.63% ± 0.97%	47.38% ± 1.41%	10% ± 0.0%	42.23% ± 1.0%	48.36% ± 1.50%
		#C = 3	62.65% ± 1.49%	63.57% ± 0.2%	20.11% ± 2.02%	61.7% ± 1.51%	59.9% ± 1.06%	59.02% ± 0.34%	38.34% ± 3.52%	61.21% ± 0.74%	61.93% ± 0.89%
	FedISIC2019	p_k ~ Dir(0.5)	56.46% ± 0.30%	56.23% ± 0.49%	28.30% ± 14.09%	33.61% ± 0.55%	53.80% ± 0.60%	54.14% ± 0.38%	48.27% ± 0.07%	55.90% ± 0.29%	56.24% ± 0.55%
		#C = 1	18.34% ± 0.0%	28.3% ± 14.09%	48.22% ± 0.0%	22.56% ± 19.46%	48.55% ± 0.47%	48.22% ± 0.0%	48.22% ± 0.0%	37.02% ± 15.93%	27.02% ± 15.08%
		#C = 2	40.09% ± 5.30%	38.52% ± 14.27%	48.22% ± 0.0%	28.95% ± 14.39%	52.18% ± 0.87%	51.84% ± 0.53%	48.22% ± 0.0%	48.43% ± 1.16%	48.10% ± 1.58%
		#C = 3	47.93% ± 1.71%	48.02% ± 2.41%	48.22% ± 0.0%	40.68% ± 4.82%	52.25% ± 0.81%	51.92% ± 1.13%	38.26% ± 14.09%	50.52% ± 2.15%	49.18% ± 0.26%
	Adult	p_k ~ Dir(0.5)	81.73% ± 2.62%	83.48% ± 2.17%	76.49% ± 0.0%	69.07% ± 2.05%	83.64% ± 1.71%	83.31% ± 1.29%	82.98% ± 0.92%	82.39% ± 2.40%	84.39% ± 0.49%
	Adult	#C = 1	80.21% ± 3.35%	81.06% ± 2.98%	76.49% ± 0.0%	54.08% ± 0.29%	85.03% ± 0.23%	84.03% ± 1.48%	84.70% ± 0.57%	77.88% ± 4.82%	79.22% ± 5.57%
Number of times that performs the best			3	4	2	1	8	4	2	2	1

Feature distribution skew	MNIST	x̂ ~ Gau(0.1)	99.23% ± 0.07%	99.26% ± 0.06%	98.19% ± 0.12%	99.24% ± 0.03%	99.28% ± 0.07%	99.22% ± 0.06%	99.05% ± 0.09%	99.24% ± 0.02%	99.28% ± 0.02%
	FMNIST		84.35% ± 0.13%	84.65% ± 0.30%	72.68% ± 0.44%	84.43% ± 0.03%	83.99% ± 0.25%	84.57% ± 0.26%	82.64% ± 0.51%	84.57% ± 0.12%	84.72% ± 0.14%
	SVHN		68.85% ± 0.67%	69.39% ± 0.40%	70.25% ± 0.41%	69.75% ± 1.20%	73.14% ± 1.03%	72.8% ± 0.56%	70.09% ± 0.92%	70.09% ± 0.73%	69.27% ± 1.75%
	CINIC10		35.66% ± 1.05%	34.51% ± 1.01%	31.97% ± 0.95%	33.84% ± 0.70%	23.10% ± 9.29%	35.44% ± 0.73%	23.44% ± 9.50%	34.18% ± 0.5%	33.82% ± 0.88%
	CIFAR10		64.02% ± 0.08%	63.59% ± 0.09%	47.04% ± 1.01%	63.16% ± 1.13%	63.48% ± 0.88%	63.36% ± 0.42%	51.72% ± 0.75%	64.93% ± 0.34%	63.8% ± 0.56%
	FedISIC2019		54.75% ± 0.56%	55.11% ± 0.12%	49.56% ± 0.68%	54.78% ± 0.04%	52.36% ± 0.52%	52.7% ± 0.99%	48.87% ± 0.88%	54.96% ± 0.86%	55.3% ± 0.46%
	FCUBE	synthetic	99.57% ± 0.05%	99.67% ± 0.12%	97.07% ± 1.28%	99.87% ± 0.12%	99.53% ± 0.12%	99.67% ± 0.12%	99.67% ± 0.12%	99.7% ± 0.14%	99.67% ± 0.09%
Number of times that performs the best			1	0	0	1	2	0	0	1	3

Quantity skew	MNIST	q ~ Dir(0.5)	99.02% ± 0.05%	99.01% ± 0.07%	97.3% ± 0.02%	98.95% ± 0.05%	99.08% ± 0.04%	98.89% ± 0.05%	98.61% ± 0.07%	99.07% ± 0.06%	99.08% ± 0.06%
	FMNIST		88.80% ± 0.05%	88.46% ± 0.31%	80.53% ± 0.8%	88.99% ± 0.11%	88.95% ± 0.35%	88.75% ± 0.05%	86.41% ± 0.16%	89.09% ± 0.17%	88.08% ± 0.0%
	SVHN		87.38% ± 1.27%	87.99% ± 0.38%	77.08% ± 0.97%	87.52% ± 0.44%	88.80% ± 0.06%	87.44% ± 0.05%	85.08% ± 0.28%	87.85% ± 0.82%	87.94% ± 0.13%
	CINIC10		38.73% ± 0.55%	38.56% ± 0.39%	37.61% ± 0.40%	38.55% ± 0.29%	26.96% ± 12.0%	39.23% ± 0.48%	26.63% ± 11.76%	38.22% ± 0.67%	39.73% ± 0.45%
	CIFAR10		71.13% ± 0.37%	71.01% ± 0.39%	46.14% ± 0.07%	71.72% ± 0.58%	69.40% ± 0.38%	68.24% ± 0.68%	25.90% ± 22.49%	71.6% ± 0.44%	71.23% ± 0.52%
	FedISIC2019		57.73% ± 0.29%	58.14% ± 0.55%	49.96% ± 0.75%	57.04% ± 0.53%	56.18% ± 0.63%	56.52% ± 0.18%	48.66% ± 0.63%	58.51% ± 0.66%	57.93% ± 0.2%
	Adult		84.26% ± 0.04%	84.48% ± 0.23%	85.55% ± 0.05%	83.98% ± 0.22%	83.83% ± 0.48%	83.5% ± 0.83%	83.83% ± 0.48%	83.6% ± 0.82%	83.93% ± 0.41%
Number of times that performs the best			0	0	1	1	2	0	0	2	2

Homogeneous partition	MNIST	IID	99.11% ± 0.05%	99.10% ± 0.01%	98.27% ± 0.09%	99.15% ± 0.04%	99.11% ± 0.07%	98.99% ± 0.11%	98.78% ± 0.02%	99.19% ± 0.03%	99.09% ± 0.07%
	FMNIST		89.27% ± 0.23%	89.21% ± 0.24%	83.69% ± 0.45%	89.29% ± 0.08%	89.65% ± 0.03%	89.01% ± 0.11%	87.54% ± 0.57%	89.35% ± 0.16%	89.16% ± 0.36%
	SVHN		88.07% ± 0.36%	88.30% ± 0.19%	82.05% ± 0.24%	87.30% ± 1.51%	60.14% ± 28.69%	87.32% ± 0.34%	61.30% ± 29.50%	87.63% ± 1.65%	88.35% ± 0.21%
	CINIC10		39.99% ± 0.88%	39.82% ± 0.26%	41.58% ± 0.48%	40.18% ± 0.41%	36.91% ± 1.39%	40.83% ± 0.56%	36.58% ± 0.93%	40.85% ± 1.13%	39.92% ± 0.92%
	CIFAR10		72.59% ± 0.51%	74.14% ± 0.01%	51.88% ± 0.45%	72.23% ± 0.31%	69.53% ± 0.59%	68.47% ± 0.37%	65.79% ± 2.08%	69.16% ± 0.47%	72.36% ± 0.32%
	FedISIC2019		59.15% ± 0.83%	60.25% ± 0.12%	52.59% ± 0.62%	58.91% ± 0.30%	51.82% ± 1.85%	51.49% ± 2.32%	51.49% ± 2.32%	59.23% ± 0.54%	58.9% ± 0.31%
	Adult		84.24% ± 0.39%	84.10% ± 0.16%	85.59% ± 0.12%	83.97% ± 0.06%	83.9% ± 0.1%	84.23% ± 0.51%	83.90% ± 0.10%	84.42% ± 0.97%	84.09% ± 0.51%
Number of times that performs the best			0	2	2	0	1	0	0	1	1

Legend: Best Result ■ High Performance (>80%) ■ Medium Performance (30-80%) ■ Low Performance (<30%)

Training Curves: Comparative Performance

The training curves of different approaches on CIFAR-10.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The training curves of different approaches on CINIC10.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The training curves of different approaches on MNIST.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The training curves of different approaches on SVHN.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The training curves of different approaches on FEDISIC2019.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

Local Epoch Sensitivity Analysis

The test accuracy with different number of local epochs on FEDISIC2019.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The test accuracy with different number of local epochs on CIFAR-10.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The test accuracy with different number of local epochs on CINIC10.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The test accuracy with different number of local epochs on MNIST.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

The test accuracy with different number of local epochs on SVHN.

#C = 1

#C = 2

#C = 3

p_k ~ Dir(0.5)

x̂ ~ Gau(0.1)

q ~ Dir(0.5)

BibTeX


  @misc{,
          author={},
          title={}, 
          publisher={},
          year={},
  }