Huffman編碼處理的是字符以及字符對應的二進制的編碼配對問題,分為編碼和解碼,目的是壓縮字符對應的二進制數據長度。我們知道字符存貯和傳輸的時候都是二進制的(計算機只認識0/1),那么就有字符與二進制之間的mapping關系。字符屬于字符集(Charset), 字符需要通過編碼(encode)為二進制進行存貯和傳輸,顯示的時候需要解碼(decode)回字符,字符集與編碼方法是一對多關系(Unicode可以用UTF-8,UTF-16等編碼)。理解了字符集,編碼以及解碼,滿天飛的亂碼問題也就游刃而解了。以英文字母小寫a為例, ASCII編碼中,十進制為97,二進制為01100001。ASCII的每一個字符都用8個Bit(1Byte)編碼,假如有1000個字符要傳輸,那么就要傳輸8000個Bit。問題來了,英文中字母e的使用頻率為12.702%,而z為0.074%,前者是后者的100多倍,但是確使用相同位數的二進制。可以做得更好,方法就是可變長度編碼,指導原則就是頻率高的用較短的位數編碼,頻率低的用較長位數編碼。Huffman編碼算法就是處理這樣的問題。
Huffman編碼Java實現
Huffman編碼算法主要用到的數據結構是完全二叉樹(full binary tree)和優先級隊列。后者用的是Java.util.PriorityQueue,前者自己實現(都為內部類),代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
static class Tree { private Node root; public Node getRoot() { return root; } public void setRoot(Node root) { this .root = root; } } static class Node implements Comparable<Node> { private String chars = "" ; private int frequence = 0 ; private Node parent; private Node leftNode; private Node rightNode; @Override public int compareTo(Node n) { return frequence - n.frequence; } public boolean isLeaf() { return chars.length() == 1 ; } public boolean isRoot() { return parent == null ; } public boolean isLeftChild() { return parent != null && this == parent.leftNode; } public int getFrequence() { return frequence; } public void setFrequence( int frequence) { this .frequence = frequence; } public String getChars() { return chars; } public void setChars(String chars) { this .chars = chars; } public Node getParent() { return parent; } public void setParent(Node parent) { this .parent = parent; } public Node getLeftNode() { return leftNode; } public void setLeftNode(Node leftNode) { this .leftNode = leftNode; } public Node getRightNode() { return rightNode; } public void setRightNode(Node rightNode) { this .rightNode = rightNode; } } |
統計數據
既然要按頻率來安排編碼表,那么首先當然得獲得頻率的統計信息。我實現了一個方法處理這樣的問題。如果已經有統計信息,那么轉為Map<Character,Integer>即可。如果你得到的信息是百分比,乘以100或1000,或10000。總是可以轉為整數。比如12.702%乘以1000為12702,Huffman編碼只關心大小問題。統計方法實現如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
public static Map<Character, Integer> statistics( char [] charArray) { Map<Character, Integer> map = new HashMap<Character, Integer>(); for ( char c : charArray) { Character character = new Character(c); if (map.containsKey(character)) { map.put(character, map.get(character) + 1 ); } else { map.put(character, 1 ); } } return map; } |
構建樹
構建樹是Huffman編碼算法的核心步驟。思想是把所有的字符掛到一顆完全二叉樹的葉子節點,任何一個非頁子節點的左節點出現頻率不大于右節點。算法為把統計信息轉為Node存放到一個優先級隊列里面,每一次從隊列里面彈出兩個最小頻率的節點,構建一個新的父Node(非葉子節點), 字符內容剛彈出來的兩個節點字符內容之和,頻率也是它們的和,最開始的彈出來的作為左子節點,后面一個作為右子節點,并且把剛構建的父節點放到隊列里面。重復以上的動作N-1次,N為不同字符的個數(每一次隊列里面個數減1)。結束以上步驟,隊列里面剩一個節點,彈出作為樹的根節點。代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
private static Tree buildTree(Map<Character, Integer> statistics, List<Node> leafs) { Character[] keys = statistics.keySet().toArray( new Character[ 0 ]); PriorityQueue<Node> priorityQueue = new PriorityQueue<Node>(); for (Character character : keys) { Node node = new Node(); node.chars = character.toString(); node.frequence = statistics.get(character); priorityQueue.add(node); leafs.add(node); } int size = priorityQueue.size(); for ( int i = 1 ; i <= size - 1 ; i++) { Node node1 = priorityQueue.poll(); Node node2 = priorityQueue.poll(); Node sumNode = new Node(); sumNode.chars = node1.chars + node2.chars; sumNode.frequence = node1.frequence + node2.frequence; sumNode.leftNode = node1; sumNode.rightNode = node2; node1.parent = sumNode; node2.parent = sumNode; priorityQueue.add(sumNode); } Tree tree = new Tree(); tree.root = priorityQueue.poll(); return tree; } |
編碼
某個字符對應的編碼為,從該字符所在的葉子節點向上搜索,如果該字符節點是父節點的左節點,編碼字符之前加0,反之如果是右節點,加1,直到根節點。只要獲取了字符和二進制碼之間的mapping關系,編碼就非常簡單。代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
public static String encode(String originalStr, Map<Character, Integer> statistics) { if (originalStr == null || originalStr.equals( "" )) { return "" ; } char [] charArray = originalStr.toCharArray(); List<Node> leafNodes = new ArrayList<Node>(); buildTree(statistics, leafNodes); Map<Character, String> encodInfo = buildEncodingInfo(leafNodes); StringBuffer buffer = new StringBuffer(); for ( char c : charArray) { Character character = new Character(c); buffer.append(encodInfo.get(character)); } return buffer.toString(); } private static Map<Character, String> buildEncodingInfo(List<Node> leafNodes) { Map<Character, String> codewords = new HashMap<Character, String>(); for (Node leafNode : leafNodes) { Character character = new Character(leafNode.getChars().charAt( 0 )); String codeword = "" ; Node currentNode = leafNode; do { if (currentNode.isLeftChild()) { codeword = "0" + codeword; } else { codeword = "1" + codeword; } currentNode = currentNode.parent; } while (currentNode.parent != null ); codewords.put(character, codeword); } return codewords; } |
解碼
因為Huffman編碼算法能夠保證任何的二進制碼都不會是另外一個碼的前綴,解碼非常簡單,依次取出二進制的每一位,從樹根向下搜索,1向右,0向左,到了葉子節點(命中),退回根節點繼續重復以上動作。代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
public static String decode(String binaryStr, Map<Character, Integer> statistics) { if (binaryStr == null || binaryStr.equals( "" )) { return "" ; } char [] binaryCharArray = binaryStr.toCharArray(); LinkedList<Character> binaryList = new LinkedList<Character>(); int size = binaryCharArray.length; for ( int i = 0 ; i < size; i++) { binaryList.addLast( new Character(binaryCharArray[i])); } List<Node> leafNodes = new ArrayList<Node>(); Tree tree = buildTree(statistics, leafNodes); StringBuffer buffer = new StringBuffer(); while (binaryList.size() > 0 ) { Node node = tree.root; do { Character c = binaryList.removeFirst(); if (c.charValue() == '0' ) { node = node.leftNode; } else { node = node.rightNode; } } while (!node.isLeaf()); buffer.append(node.chars); } return buffer.toString(); } |
測試以及比較
以下測試Huffman編碼的正確性(先編碼,后解碼,包括中文),以及Huffman編碼與常見的字符編碼的二進制字符串比較。代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
|
public static void main(String[] args) { String oriStr = "Huffman codes compress data very effectively: savings of 20% to 90% are typical, " + "depending on the characteristics of the data being compressed. 中華崛起" ; Map<Character, Integer> statistics = statistics(oriStr.toCharArray()); String encodedBinariStr = encode(oriStr, statistics); String decodedStr = decode(encodedBinariStr, statistics); System.out.println( "Original sstring: " + oriStr); System.out.println( "Huffman encoed binary string: " + encodedBinariStr); System.out.println( "decoded string from binariy string: " + decodedStr); System.out.println( "binary string of UTF-8: " + getStringOfByte(oriStr, Charset.forName( "UTF-8" ))); System.out.println( "binary string of UTF-16: " + getStringOfByte(oriStr, Charset.forName( "UTF-16" ))); System.out.println( "binary string of US-ASCII: " + getStringOfByte(oriStr, Charset.forName( "US-ASCII" ))); System.out.println( "binary string of GB2312: " + getStringOfByte(oriStr, Charset.forName( "GB2312" ))); } public static String getStringOfByte(String str, Charset charset) { if (str == null || str.equals( "" )) { return "" ; } byte [] byteArray = str.getBytes(charset); int size = byteArray.length; StringBuffer buffer = new StringBuffer(); for ( int i = 0 ; i < size; i++) { byte temp = byteArray[i]; buffer.append(getStringOfByte(temp)); } return buffer.toString(); } public static String getStringOfByte( byte b) { StringBuffer buffer = new StringBuffer(); for ( int i = 7 ; i >= 0 ; i--) { byte temp = ( byte ) ((b >> i) & 0x1 ); buffer.append(String.valueOf(temp)); } return buffer.toString(); } |
以上就是本文的全部內容,希望對大家的學習有所幫助,也希望大家多多支持服務器之家。
原文鏈接:http://blog.csdn.net/kimylrong/article/details/17022319