Friday, November 24, 2017

較快把卡提諾小說轉成epub的方法

轉成epub的好處是下一頁直接點一下即可。用網頁看沒辦法這樣;捲頁比較麻煩。另外一點是,網頁版沒辦法記住這次看到哪裡,每次開網頁都得重來。遇到那種1000多章的小說,真的會很煩。

還有,epub版本最大的優點:不需要網路也可以看。

幾個步驟:

  1. 先用CSNovelCrawler把要抓的小說抓下來,CSNovelCrawler可以從這裡抓:http://rngmontoli.blogspot.tw/2013/06/csnovelcrawler.html
  2. CSNovelCrawler抓下來的檔案是純文字檔,得把它轉成HTML或是doc檔案。
    1. 轉成HTML檔案
    2. 轉成DOC檔案:
  3. 用calibre把HTML或是DOC檔案轉成epub
麻煩的是第二點,一開始是用DOC檔案,但是每一章標題得自己轉成DOC的標題,不然轉出來的epub就沒有bookmark。在WORD內要偵測每一章標題的標題就只能用VBA,但是這玩意的語法很奇怪,微軟文件也沒有很詳細的介紹。最後還是用硬幹法寫出來了,但是速度很差,轉個1MB檔案要快10分鐘。這還可以接受,反正只是放著就好。

最終放棄DOC檔案的契機是,我不是每台電腦都有WORD啊。要換成open office還得重寫open office的script。算了,試試看HTML檔案吧。

首先得先選擇轉HTML檔的語言。C/C++,JAVA跟C#都出局,因為他們處理文字跟檔案實在很麻煩。那就來試試看python吧。

結果太好了,沒幾行就可以了。下面是python的程式碼。它做的事情很簡單:

  1. 開檔案
  2. 把一些字轉換掉(主要是空白跟簡體) 
  3. 一行一行找"第x章"這類的文字,有的話就加上<h2>標籤,不然就是<p>標籤。
  4. 如果是<h2>標籤的文字,再把中文數字轉成阿拉伯數字,原因是我不喜歡中文數字,很不直覺跟佔字數。"一千一百一十八"跟"1118"比,我還是喜歡1118。直接在網路上找到這個用即可(https://github.com/binux/binux-tools/blob/master/python/chinese_digit.py)。

還有一點要注意:用Calibre把HTML檔轉成epub的時候,因為它預設是不會把<h2>標籤輸出成epub的bookmark,所幸可以手動指定(第一級目錄頁設為 //h:h1,第二級目錄頁設為//h:h2):



import os, sys
import os, sys
dict ={u'零':0, u'一':1, u'二':2, u'三':3, u'四':4, u'五':5, u'六':6, u'七':7, u'八':8, u'九':9, u'十':10, u'百':100, u'千':1000, u'萬':10000,
       u'0':0, u'1':1, u'2':2, u'3':3, u'4':4, u'5':5, u'6':6, u'7':7, u'8':8, u'9':9,
       u'0' :0, u'1' :1, u'2' :2, u'3' :3, u'4' :4, u'5' :5, u'6' :6, u'7' :7, u'8' :8, u'9' :9,
                u'壹':1, u'貳':2, u'參':3, u'肆':4, u'伍':5, u'陸':6, u'柒':7, u'捌':8, u'玖':9, u'拾':10, u'佰':100, u'仟':1000, u'萬':10000,
                         u'兩':2,
       u'億':100000000}
def GetResultForDigit(a, encoding="utf-8"):
    count = 0
    result = 0
    tmp = 0
    Billion = 0
    while count < len(a):
        tmpChr = a[count]
        #print tmpChr
        tmpNum = dict.get(tmpChr, 0)
        #如果等于1亿
        if tmpNum == 100000000:
            result = result + tmp
            result = result * tmpNum
            #获得亿以上的数量,将其保存在中间变量Billion中并清空result
            Billion = Billion * 100000000 + result
            result = 0
            tmp = 0
        #如果等于1万
        elif tmpNum == 10000:
            result = result + tmp
            result = result * tmpNum
            tmp = 0
        #如果等于十或者百,千
        elif tmpNum >= 10:
            if tmp == 0:
                tmp = 1
            result = result + tmpNum * tmp
            tmp = 0
        #如果是个位数
        elif tmpNum is not None:
            tmp = tmp * 10 + tmpNum
        count += 1
    result = result + tmp
    result = result + Billion
    return str(result)


def Katino_format( input_file_name, author ):
    filename, file_extension = os.path.splitext(input_file_name)
    output_file_name = filename + '.html'

    with open(input_file_name, 'r', encoding='utf8') as input_file:
        print('file is opened.');
        with open(output_file_name, 'w', encoding='utf8') as output_file:
            output_file.write("<!DOCTYPE html>\n");
            output_file.write("<html>\n");
            output_file.write("<title>" + filename + "</title>\n");
            output_file.write("<meta name=Author content=\"" + author + "\">");
            output_file.write("<body>\n");
            output_file.write("<h1>" + filename + "</h1>\n");
            for line in input_file:
                #print(line);
                # replace redundent line break
                line = line.replace("\n\n", "\n");

                # remove 4 spaces
                line = line.replace("    ", "");

                # remove 2 full-size spaces
                line = line.replace("  ", "");

                # replace 1 full-size space with half-size space
                line = line.replace(" ", " ");

                # replace 1 ? space with half-size space
                line = line.replace(" ", " ");
                
                # replace  '(' with '('
                line = line.replace("(", "(");

                # replace  ')' with ')'
                line = line.replace(")", ")");

                # replace  ',' with ','
                line = line.replace(",", ",");

                line = line.replace(":", ":")
                line = line.replace("!", "!");
                line = line.replace("?", "?");
                line = line.replace("隻是", "只是");
                line = line.replace("麵對", "面對");

                # find 第x章
                keyword1_pos = line.find("章");
                keyword2_pos= -1
                chapter_find = 0;
                if keyword1_pos >=0 and keyword1_pos < 10:
                    keyword2_pos = line.find("第");
                    if keyword2_pos >= 0 and keyword2_pos < 5 and keyword2_pos < keyword1_pos:
                        # replace '  ' with ' '
                        line = line.replace("  ", " ");
                        # add ' ' if no ' ' after '章'
                        if line[keyword1_pos+1] != ' ':
                            line = line[:keyword1_pos+1] + ' ' + line[keyword1_pos+1:];
                        # try convert chinese number to arabic number
                        chinese_number = line[keyword2_pos+1:keyword1_pos];
                        arabic_number = GetResultForDigit(chinese_number)
                        print(line + ": " + chinese_number + "-->"+ arabic_number)
                        line = line.replace(chinese_number, arabic_number);                        
                        # it is <h2>
                        chapter_find = 1;

                if chapter_find == 1:
                    line = line.replace("\n", "");
                    output_file.write("<h2>" + line + "</h2>\n");
                else:
                    if line != "\n":
                        line = line.replace("\n", "");
                        output_file.write("<p>" + line + "</p>\n");

            output_file.write("</body>\n");
            output_file.write("</html>\n");
            output_file.close()
        input_file.close()
    return

author = '蝴蝶藍';
Katino_format('[蝴蝶藍] 天醒之路[1].txt', author)
print('done.');

codeblock