數據安全是做數據分析的人需要關注的一大問題。對于我們分析的關鍵數據、使用的關鍵腳本都需要定期備份。
scp
最簡單的備份方式,就是使用cp (本地硬盤)或scp (遠程硬盤)命令,給自己的結果文件新建一個拷貝;每有更新,再拷貝一份。具體命令如下:
- cp -fur source_project project_bak
- scp -r source_project user@remote_server_ip:project_bak
為了實現定期備份,我們可以把上述命令寫入crontab程序中,設置每天的晚上23:00執行。對于遠程服務器的備份,我們可以配置免密碼登錄,便于自動備份。后臺輸入免密碼登錄服務器,獲取免密碼登錄服務器的方法。
- # Crontab format
- # MinuteHourDayMonthWeekcommand
- # * 表示每分/時/天/月/周
- # 每天23:00 執行cp命令
- 0 23 * * * cp -fur source_project project_bak
- # */2 表示每隔2分分/時/天/月/周執行命令
- # 每隔24小時執行cp命令
- 0 */24 * * * cp -fur source_project project_bak
- 0 0 */1 * * scp -r source_project user@remote_server_ip:project_bak
- # 另外crotab還有個特殊的時間
- # @reboot: 開機運行指定命令
- @reboot cmd
rsync
cp或scp使用簡單,但每次執行都會對所有文件進行拷貝,耗時耗力,尤其是需要拷貝的內容很多時,重復拷貝對時間和硬盤都是個損耗。
rsync則是一個增量備份工具,只針對修改過的文件的修改過的部分進行同步備份,大大縮短了傳輸的文件的數量和傳輸時間。具體使用如下 :
- # 把本地project目錄下的東西備份到遠程服務器的/backup/project目錄下
- # 注意***個project后面的反斜線,表示拷貝目錄內的內容,不在目標目錄新建project文件夾。注意與第二個命令的比較,兩者實現同樣的功能。
- # -a: archive mode, quals -rlptgoD
- # -r: 遞歸同步
- # -p: 同步時保留原文件的權限設置
- # -u: 若文件在遠端做過更新,則不同步,避免覆蓋遠端的修改
- # -L: 同步符號鏈接鏈接的文件,防止在遠程服務器出現文件路徑等不匹配導致的軟連接失效
- # -t: 保留修改時間
- # -v: 顯示更新信息
- # -z: 傳輸過程中壓縮文件,對于傳輸速度慢時適用
- rsync -aruLptvz --delete project/ user@remoteServer:/backup/project
- rsync -aruLptvz --delete project user@remoteServer:/backup/
rsync所做的工作為鏡像,保證遠端服務器與本地文件的統一。如果本地文件沒問題,遠端也不會有問題。但如果發生誤刪或因程序運行錯誤,導致文件出問題,而在同步之前又沒有意識到的話,遠端的備份也就沒了備份的意義,因為它也被損壞了。誤刪是比較容易發現的,可以及時矯正。但程序運行出問題,則不一定了。
rdiff-backup
這里推薦一個工具rdiff-backup不只可以做增量備份,而且會保留每次備份的狀態,新備份和上一次備份的差別,可以輕松回到之前的某個版本。***的要求就是,本地服務器和遠端服務器需要安裝統一版本的rdiff-backup。另外還有2款工具 duplicity和`Rsnapshot也可以做類似工作,但方法不一樣,占用的磁盤空間也不一樣,具體可查看原文鏈接中的比較。
具體的rdiff-backup安裝和使用如下 (之前寫的是英文,內容比較簡單,就不再翻譯了):
- Install rdiff-backup at both local and remote computers
- #install for ubuntu, debian
- sudo apt-get install python-dev librsync-dev
- #self compile
- #downlaod rsync-dev from https://sourceforge.net/project/showfiles.php?group_id=56125
- tar xvzf librsync-0.9.7.tar.gz
- export CFLAGS="$CFLAGS -fPIC"
- ./configure --prefix=/home/user/rsync --with-pic
- make
- make install
- Install rdiff-backup
- #See Reference part for download link
- # http://www.nongnu.org/rdiff-backup/
- python setup.py install --prefix=/home/user/rdiff-backup
- #If you complied rsync-dev yourself, please specify the location of rsync-dev
- python setup.py --librsync-dir=/home/user/rsync install -- prefix=/home/user/rdiff-backup
- Add exeutable files and python modules to environmental variables
- #Add the following words into .bashrc or .bash_profile or any other config files
- export PATH=${PATH}:/home/user/rdiff-backup/bin
- export PYTHONPATH=${PYTHONPATH}:/home/user/rdiff-backup/lib/python2.x/site-packages
- #pay attention to the x in python2.x of above line which can be 6 or 7 depending on
- #the Python version used.
- Test environmental variable when executing commands through ssh
- ssh user@host 'echo ${PATH}' #When I run this command in my local computer,
- #I found only system environmetal variable is used
- #and none of my self-defined environmetal variable is used.
- #Then, I modified the following lines in file 'SetConnections.py' in
- #/home/user/rdiff-backup/lib/python2.x/site-packages/rdiff_backup
- #to set environmental explicitly when login.
- #pay attention to the single quote used inside double quote
- __cmd_schema = "ssh -C %s 'source ~/.bash_profile; rdiff-backup --server'"
- __cmd_schema_no_compress = "ssh %s 'source ~/.bash_profile; rdiff-backup --server'"
- #choose the one contains environmental variable for rdiff-backup from .bash_profile and .bashrc.
Use rdiff-backup
- Start backup
rdiff-backup --no-compression --print-statistics user@host::/home/user/source_dir destination_dir
If the destination_dir exists, please add --force like rdiff-backup --no-compression --force --print-statistics user@host::/home/user/source_dir destination_dir. All things in original destination_dir will be depleted.
If you want to exclude or include special files or dirs please specify like --exclude '**trash' or --include /home/user/source_dir/important.
- Timely backup your data
Add the above command into crontab (hit 'crontab -e' in terminal to open crontab) in the format like 5 22 */1 * * command which means executing the command at 22:05 everyday.
- Restore data
Restore the latest data by running rdiff-backup -r now destination_dir user@host::/home/user/source_dir.restore. Add --force if you want to restore to source_dir.
Restore files 10 days ago by running rdiff-backup -r 10D destination_dir user@host::/home/user/source_dir.restore. Other acceptable time formats include 5m4s (5 minutes 4 seconds) and 2014-01-01 (January 1st, 2014).
Restore files from an increment file by running rdiff-backup destination_dir/rdiff-backup-data/increments/server_add.2014-02-21T09:22:45+08:00.missing user@host::/home/user/source_dir.restore/server_add. Increment files are stored in destination_dir/rdiff-backup-data/increments/server_add.2014-02-21T09:22:45+08:00.missing.
- Remove older records to save space
Deletes all information concerning file versions which have not been current for 2 weeks by running rdiff-backup --remove-older-than 2W --force destination_dir. Note that an existing file which has not changed for a year will still be preserved. But a file which was deleted 15 days ago can not be restored after this command. Normally one should use --force since it is used to delete multiple increments at the same time which --remove-older-thanrefuses to do by default.
Only keeps the last n rdiff-backup sessions by running rdiff-backup --remove-older-than 20B --force destination_dir.
- Statistics
Lists increments in given golder by rdiff-backup --list-increments destination_dir/.
Lists of files changed in last 5 days by rdiff-backup --list-changed-since 5D destination_dir/.
Compare the difference between source and bak by rdiff-backup --compare user@host::source-dir destination_dir
Compare the sifference between source and bak (as it was two weeks ago) by rdiff-backup --compare-at-time 2W user@host::source-dir destination_dir.
A complete script (automatically sync using crontab)
- #!/bin/bash
- export PYTHONPATH=${PYTHONPATH}:/soft/rdiff_backup/lib/python2.7/site-packages/
- rdiff-backup --no-compression -v5 --exclude '**trash' user@server::source/ bak_dir/
- ret=$?
- if test $ret -ne 0; then
- echo "Wrong in bak" | mutt -s "Wrong in bak" [email protected]
- else
- echo "Right in bak" | mutt -s "Right in bak" [email protected]
- fi
- echo "Finish rdiff-backup $0 ---`date`---" >>bak.log 2>&1
- echo "`rdiff-backup --exclude '**trash' --compare-at-time 1D user@server::source/ bak_dir/`" | mutt -s "Lists of baked files" [email protected]
References
- rdiff-backup
- duplicity
- rsnapshot
- http://www.saltycrane.com/blog/2008/02/backup-on-linux-rsnapshot-vs-rdiff/
- http://james.lab6.com/2008/07/09/rdiff-backup-and-duplicity/
- http://bitflop.com/document/75
- http://askubuntu.com/questions/2596/comparison-of-backup-tools
- http://www.reddit.com/r/linux/comments/fgmbb/rdiffbackup_duplicity_or_rsnapshot_which_is/
- http://serverfault.com/questions/491341/optimize-space-rdiff-backup
- Another great post on usage of rdiff-backup
原文鏈接:https://mp.weixin.qq.com/s/Ovl46SbnQLc5q6Rz3Iaczg