[Pytorch] FullyShardedDataParallel (FSDP) 관련 삽질

TL; DR; FSDP는 아직 본격적으로 쓸만한 것 같진 않습니다

model.state_dict()가 nightly에서만 지원 (1.11 이 stable인 시점 기준) [1]
- error case: roberta model로 save 시 전체 모델이 아닌, 해당 rank의 flattened params만 저장됨 (사실 rank별로 전부 저장하고 잘 조합하면 쓸 수 있을거 같기도 한데..)
- model._fsdp_wrapped_module._fpw_module.state_dict()도 마찬가지
torch.cuda.amp 미지원 [2, 3]

References

[1] https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html (2.4 Define a distributed train function that wraps the model in FSDP) (Note. 공식코드인데 오타가 많음. dist.barrier()를 dist_barrier()로 오타, States로 선언하고 states로 씀 등등..)

[2] https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/ (Future Work Section)

[3] https://pytorch.org/docs/stable/notes/amp_examples.html (Working with Multiple GPUs Section에 FSDP 항목 없음)