91在线精品,国产在线观看自拍,91高清在线

DDP 數據shuffle 的設置

使用DDP要給dataloader傳入sampler參數（torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)）。默認shuffle=True，但按照pytorch DistributedSampler的實現：

				?

									def __iter__(self) -> Iterator[T_co]:

									    if self.shuffle:

									        # deterministically shuffle based on epoch and seed

									        g = torch.Generator()

									        g.manual_seed(self.seed + self.epoch)

									        indices = torch.randperm(len(self.dataset), generator=g).tolist()  # type: ignore

									    else:

									        indices = list(range(len(self.dataset)))  # type: ignore

產生隨機indix的種子是和當前的epoch有關，所以需要在訓練的時候手動set epoch的值來實現真正的shuffle：

				?

									for epoch in range(start_epoch, n_epochs):

									    if is_distributed:

									        sampler.set_epoch(epoch)

									    train(loader)

DDP 增大batchsize 效果變差的問題

large batchsize：

理論上的優點：

數據中的噪聲影響可能會變小，可能容易接近最優點；

缺點和問題：

降低了梯度的variance；(理論上，對于凸優化問題，低的梯度variance可以得到更好的優化效果; 但是實際上Keskar et al驗證了增大batchsize會導致差的泛化能力);

對于非凸優化問題，損失函數包含多個局部最優點，小的batchsize有噪聲的干擾可能容易跳出局部最優點，而大的batchsize有可能停在局部最優點跳不出來。

解決方法：

增大learning_rate，但是可能出現問題，在訓練開始就用很大的learning_rate 可能導致模型不收斂 (https://arxiv.org/abs/1609.04836)

使用warming up (https://arxiv.org/abs/1706.02677)

warmup

在訓練初期就用很大的learning_rate可能會導致訓練不收斂的問題，warmup的思想是在訓練初期用小的學習率，隨著訓練慢慢變大學習率，直到base learning_rate，再使用其他decay（CosineAnnealingLR）的方式訓練.

				?

									# copy from https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py

									from torch.optim.lr_scheduler import _LRScheduler

									from torch.optim.lr_scheduler import ReduceLROnPlateau

									class GradualWarmupScheduler(_LRScheduler):

									    """ Gradually warm-up(increasing) learning rate in optimizer.

									    Proposed in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.

									    Args:

									        optimizer (Optimizer): Wrapped optimizer.

									        multiplier: target learning rate = base lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the base_lr.

									        total_epoch: target learning rate is reached at total_epoch, gradually

									        after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)

									    """

									    def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):

									        self.multiplier = multiplier

									        if self.multiplier < 1.:

									            raise ValueError('multiplier should be greater thant or equal to 1.')

									        self.total_epoch = total_epoch

									        self.after_scheduler = after_scheduler

									        self.finished = False

									        super(GradualWarmupScheduler, self).__init__(optimizer)

									    def get_lr(self):

									        if self.last_epoch > self.total_epoch:

									            if self.after_scheduler:

									                if not self.finished:

									                    self.after_scheduler.base_lrs = [base_lr * self.multiplier for base_lr in self.base_lrs]

									                    self.finished = True

									                return self.after_scheduler.get_last_lr()

									            return [base_lr * self.multiplier for base_lr in self.base_lrs]

									        if self.multiplier == 1.0:

									            return [base_lr * (float(self.last_epoch) / self.total_epoch) for base_lr in self.base_lrs]

									        else:

									            return [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]

									    def step_ReduceLROnPlateau(self, metrics, epoch=None):

									        if epoch is None:

									            epoch = self.last_epoch + 1

									        self.last_epoch = epoch if epoch != 0 else 1  # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning

									        if self.last_epoch <= self.total_epoch:

									            warmup_lr = [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]

									            for param_group, lr in zip(self.optimizer.param_groups, warmup_lr):

									                param_group['lr'] = lr

									        else:

									            if epoch is None:

									                self.after_scheduler.step(metrics, None)

									            else:

									                self.after_scheduler.step(metrics, epoch - self.total_epoch)

									    def step(self, epoch=None, metrics=None):

									        if type(self.after_scheduler) != ReduceLROnPlateau:

									            if self.finished and self.after_scheduler:

									                if epoch is None:

									                    self.after_scheduler.step(None)

									                else:

									                    self.after_scheduler.step(epoch - self.total_epoch)

									                self._last_lr = self.after_scheduler.get_last_lr()

									            else:

									                return super(GradualWarmupScheduler, self).step(epoch)

									        else:

									            self.step_ReduceLROnPlateau(metrics, epoch)

分布式多卡訓練DistributedDataParallel踩坑

近幾天想研究了多卡訓練，就花了點時間，本以為會很輕松，可是好多坑，一步一步踏過來，一般分布式訓練分為單機多卡與多機多卡兩種類型；

主要有兩種方式實現：

１、DataParallel: Parameter Server模式，一張卡位reducer，實現也超級簡單，一行代碼

DataParallel是基于Parameter server的算法，負載不均衡的問題比較嚴重，有時在模型較大的時候（比如bert-large），reducer的那張卡會多出3-4g的顯存占用

２、DistributedDataParallel：官方建議用新的DDP，采用all-reduce算法，本來設計主要是為了多機多卡使用，但是單機上也能用

為什么要分布式訓練？

可以用多張卡，總體跑得更快

可以得到更大的 BatchSize

有些分布式會取得更好的效果

主要分為以下幾個部分：

單機多卡，DataParallel（最常用，最簡單）

單機多卡，DistributedDataParallel（較高級）、多機多卡，DistributedDataParallel（最高級）

如何啟動訓練

模型保存與讀取

注意事項

一、單機多卡（DATAPARALLEL）

				?

									from torch.nn import DataParallel

									device = torch.device("cuda")

									?；蛘遜evice = torch.device("cuda:0" if True else "cpu")

									model = MyModel()

									model = model.to(device)

									model = DataParallel(model)

									＃或者model = nn.DataParallel(model,device_ids=[0,1，2,3])

比較簡單，只需要加一行代碼就行， model = DataParallel(model)

二、多機多卡、單機多卡（DISTRIBUTEDDATAPARALLEL）

建議先把注意事項看完在修改代碼，防止出現莫名的bug，修改訓練代碼如下：

其中opt.local_rank要在代碼前面解析這個參數，可以去后面看我寫的注意事項；

				?

									    from torch.utils.data.distributed import DistributedSampler

									    import torch.distributed as dist

									    import torch

									    # Initialize Process Group

									    dist_backend = 'nccl'

									    print('args.local_rank: ', opt.local_rank)

									    torch.cuda.set_device(opt.local_rank)

									    dist.init_process_group(backend=dist_backend)

									    model = yourModel()＃自己的模型

									    if torch.cuda.device_count() > 1:

									        print("Let's use", torch.cuda.device_count(), "GPUs!")

									        # 5) 封裝

									        # model = torch.nn.parallel.DistributedDataParallel(model,

									        #                                                   device_ids=[opt.local_rank],

									        #                                                   output_device=opt.local_rank)

									        model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[opt.local_rank])

									    device = torch.device(opt.local_rank)

									    model.to(device)

									    dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training, img_size=opt.img_size, normalized_labels=True)#自己的讀取數據的代碼

									    world_size = torch.cuda.device_count()

									    datasampler = DistributedSampler(dataset, num_replicas=dist.get_world_size(), rank=opt.local_rank)

									    dataloader = torch.utils.data.DataLoader(

									        dataset,

									        batch_size=opt.batch_size,

									        shuffle=False,

									        num_workers=opt.n_cpu,

									        pin_memory=True,

									        collate_fn=dataset.collate_fn,

									        sampler=datasampler

									    )＃在原始讀取數據中加sampler參數就行

									.....

									訓練過程中，數據轉cuda

									      imgs = imgs.to(device)

									      targets = targets.to(device)

三、如何啟動訓練

１、DataParallel方式

正常訓練即可，即

python3 train.py

２、DistributedDataParallel方式

需要通過torch.distributed.launch來啟動，一般是單節點，

				?

									CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train.py

其中CUDA_VISIBLE_DEVICES　設置用的顯卡編號，--nproc_pre_node 每個節點的顯卡數量，一般有幾個顯卡就用幾個顯卡

多節點

				?

									python３ -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0

									＃兩個節點，在０號節點

要是訓練成功，就會打印出幾個信息，有幾個卡就打印幾個信息，如下圖所示:

pytorch DistributedDataParallel 多卡訓練結果變差的解決方案

四、模型保存與讀取

以下a、b是對應的，用a保存，就用a方法加載

１、保存

a、只保存參數

				?

									torch.save(model.module.state_dict(), path)

b、保存參數與網絡

				?

									torch.save(model.module,path)

２、加載

a、多卡加載模型預訓練；

				?

									model = Yourmodel()

									if opt.pretrained_weights:

									        if opt.pretrained_weights.endswith(".pth"):

									            model.load_state_dict(torch.load(opt.pretrained_weights))

									        else:

									            model.load_darknet_weights(opt.pretrained_weights)

單卡加載模型，需要加載模型時指定主卡讀模型，而且這個'cuda:0',是看你訓練的模型是０還是１（否則就會出錯RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device），可以根據自己的更改：

				?

									model = Yourmodel()

									if opt.pretrained_weights:

									        if opt.pretrained_weights.endswith(".pth"):

									            model.load_state_dict(torch.load(opt.pretrained_weights，map_location="cuda:0"))

									        else:

									            model.load_darknet_weights(opt.pretrained_weights)

b、單卡加載模型；

同樣也要指定讀取模型的卡。　　

				?

									model = torch.load(opt.weights_path, map_location="cuda:0")

多卡加載預訓練模型，以b這種方式還沒跑通。

五、注意事項

１、model后面添加module

獲取到網絡模型后，使用并行方法，并將網絡模型和參數移到GPU上。注意，若需要修改網絡模塊或者獲得模型的某個參數，一定要在model后面加上.module，否則會報錯，比如：

				?

									model.img_size　　要改成　　model.module.img_size

２、.cuda或者.to(device)等問題

device是自己設置，如果.cuda出錯，就要化成相應的device

model（如：model.to(device)）

input（通常需要使用Variable包裝，如：input = Variable(input).to(device)）

target（通常需要使用Variable包裝

nn.CrossEntropyLoss()（如：criterion = nn.CrossEntropyLoss().to(device)）

３、args.local_rank的參數

通過torch.distributed.launch來啟動訓練，torch.distributed.launch 會給模型分配一個args.local_rank的參數，所以在訓練代碼中要解析這個參數，也可以通過torch.distributed.get_rank()獲取進程id。

				?

									parser.add_argument("--local_rank", type=int, default=-1, help="number of cpu threads to use during batch generation")

以上為個人經驗，希望能給大家一個參考，也希望大家多多支持服務器之家。

原文鏈接：https://blog.csdn.net/cuifan0814/article/details/115861782

一区二区三区在线-一区二区三区亚洲视频-一区二区三区亚洲-一区二区三区午夜-一区二区三区四区在线视频-一区二区三区四区在线免费观看

pytorch DistributedDataParallel 多卡訓練結果變差的解決方案

DDP 數據shuffle 的設置

DDP 增大batchsize 效果變差的問題

解決方法：

warmup

分布式多卡訓練DistributedDataParallel踩坑

一、單機多卡（DATAPARALLEL）

二、多機多卡、單機多卡（DISTRIBUTEDDATAPARALLEL）

三、如何啟動訓練

四、模型保存與讀取

五、注意事項

延伸 · 閱讀

python直接訪問私有屬性的簡單方法

使用NumPy和pandas對CSV文件進行寫操作的實例

Python的dict字典結構操作方法學習筆記

python 列表轉為字典的兩個小方法(小結)

Python實現ping指定IP的示例

在Windows系統上搭建Nginx+Python+MySQL環境的教程

Python3以GitHub為例來實現模擬登錄和爬取的實例講解

python 插入Null值數據到Postgresql的操作

PyCharm設置SSH遠程調試的方法

Python安裝圖文教程 Pycharm安裝教程

python是什么意思？python有什么用？

使用Python抓取模板之家的CSS模板

Python 列表(List)操作方法詳解

pytorch DistributedDataParallel 多卡訓練結果變差的解決方案

DDP 數據shuffle 的設置

DDP 增大batchsize 效果變差的問題

解決方法：

warmup

分布式多卡訓練DistributedDataParallel踩坑

一、單機多卡（DATAPARALLEL）

二、多機多卡、單機多卡（DISTRIBUTEDDATAPARALLEL）

三、如何啟動訓練

四、模型保存與讀取

五、注意事項

延伸 · 閱讀

一、單機多卡（DATAPARALLEL）

二、多機多卡、單機多卡（DISTRIBUTEDDATAPARALLEL）

三、如何啟動訓練

四、模型保存與讀取

五、注意事項