Benchmarking Robustness and Generalization in Multi-Agent Systems