程式亂亂寫: 較快把卡提諾小說轉成epub的方法

轉成epub的好處是下一頁直接點一下即可。用網頁看沒辦法這樣；捲頁比較麻煩。另外一點是，網頁版沒辦法記住這次看到哪裡，每次開網頁都得重來。遇到那種1000多章的小說，真的會很煩。

還有，epub版本最大的優點：不需要網路也可以看。

幾個步驟：

先用CSNovelCrawler把要抓的小說抓下來，CSNovelCrawler可以從這裡抓：http://rngmontoli.blogspot.tw/2013/06/csnovelcrawler.html
CSNovelCrawler抓下來的檔案是純文字檔，得把它轉成HTML或是doc檔案。

轉成HTML檔案
轉成DOC檔案：

用calibre把HTML或是DOC檔案轉成epub

麻煩的是第二點，一開始是用DOC檔案，但是每一章標題得自己轉成DOC的標題，不然轉出來的epub就沒有bookmark。在WORD內要偵測每一章標題的標題就只能用VBA，但是這玩意的語法很奇怪，微軟文件也沒有很詳細的介紹。最後還是用硬幹法寫出來了，但是速度很差，轉個1MB檔案要快10分鐘。這還可以接受，反正只是放著就好。

最終放棄DOC檔案的契機是，我不是每台電腦都有WORD啊。要換成open office還得重寫open office的script。算了，試試看HTML檔案吧。

首先得先選擇轉HTML檔的語言。C/C++，JAVA跟C#都出局，因為他們處理文字跟檔案實在很麻煩。那就來試試看python吧。

結果太好了，沒幾行就可以了。下面是python的程式碼。它做的事情很簡單：

開檔案
把一些字轉換掉（主要是空白跟簡體）
一行一行找＂第x章＂這類的文字，有的話就加上<h2>標籤，不然就是<p>標籤。
如果是<h2>標籤的文字，再把中文數字轉成阿拉伯數字，原因是我不喜歡中文數字，很不直覺跟佔字數。"一千一百一十八"跟"1118"比，我還是喜歡1118。直接在網路上找到這個用即可（https://github.com/binux/binux-tools/blob/master/python/chinese_digit.py）。

還有一點要注意：用Calibre把HTML檔轉成epub的時候，因為它預設是不會把<h2>標籤輸出成epub的bookmark，所幸可以手動指定(第一級目錄頁設為 //h:h1，第二級目錄頁設為//h:h2)：

import os, sys
import os, sys
dict ={u'零':0, u'一':1, u'二':2, u'三':3, u'四':4, u'五':5, u'六':6, u'七':7, u'八':8, u'九':9, u'十':10, u'百':100, u'千':1000, u'萬':10000,
       u'０':0, u'１':1, u'２':2, u'３':3, u'４':4, u'５':5, u'６':6, u'７':7, u'８':8, u'９':9,
       u'0' :0, u'1' :1, u'2' :2, u'3' :3, u'4' :4, u'5' :5, u'6' :6, u'7' :7, u'8' :8, u'9' :9,
                u'壹':1, u'貳':2, u'參':3, u'肆':4, u'伍':5, u'陸':6, u'柒':7, u'捌':8, u'玖':9, u'拾':10, u'佰':100, u'仟':1000, u'萬':10000,
                         u'兩':2,
       u'億':100000000}
def GetResultForDigit(a, encoding="utf-8"):
    count = 0
    result = 0
    tmp = 0
    Billion = 0
    while count < len(a):
        tmpChr = a[count]
        #print tmpChr
        tmpNum = dict.get(tmpChr, 0)
        #如果等于1亿
        if tmpNum == 100000000:
            result = result + tmp
            result = result * tmpNum
            #获得亿以上的数量，将其保存在中间变量Billion中并清空result
            Billion = Billion * 100000000 + result
            result = 0
            tmp = 0
        #如果等于1万
        elif tmpNum == 10000:
            result = result + tmp
            result = result * tmpNum
            tmp = 0
        #如果等于十或者百，千
        elif tmpNum >= 10:
            if tmp == 0:
                tmp = 1
            result = result + tmpNum * tmp
            tmp = 0
        #如果是个位数
        elif tmpNum is not None:
            tmp = tmp * 10 + tmpNum
        count += 1
    result = result + tmp
    result = result + Billion
    return str(result)


def Katino_format( input_file_name, author ):
    filename, file_extension = os.path.splitext(input_file_name)
    output_file_name = filename + '.html'

    with open(input_file_name, 'r', encoding='utf8') as input_file:
        print('file is opened.');
        with open(output_file_name, 'w', encoding='utf8') as output_file:
            output_file.write("<!DOCTYPE html>\n");
            output_file.write("<html>\n");
            output_file.write("<title>" + filename + "</title>\n");
            output_file.write("<meta name=Author content=\"" + author + "\">");
            output_file.write("<body>\n");
            output_file.write("<h1>" + filename + "</h1>\n");
            for line in input_file:
                #print(line);
                # replace redundent line break
                line = line.replace("\n\n", "\n");

                # remove 4 spaces
                line = line.replace("    ", "");

                # remove 2 full-size spaces
                line = line.replace("　　", "");

                # replace 1 full-size space with half-size space
                line = line.replace("　", " ");

                # replace 1 ? space with half-size space
                line = line.replace(" ", " ");
                
                # replace  '(' with '（'
                line = line.replace("(", "（");

                # replace  ')' with '）'
                line = line.replace(")", "）");

                # replace  ',' with '，'
                line = line.replace(",", "，");

                line = line.replace(":", "：")
                line = line.replace("!", "！");
                line = line.replace("?", "？");
                line = line.replace("隻是", "只是");
                line = line.replace("麵對", "面對");

                # find 第x章
                keyword1_pos = line.find("章");
                keyword2_pos= -1
                chapter_find = 0;
                if keyword1_pos >=0 and keyword1_pos < 10:
                    keyword2_pos = line.find("第");
                    if keyword2_pos >= 0 and keyword2_pos < 5 and keyword2_pos < keyword1_pos:
                        # replace '  ' with ' '
                        line = line.replace("  ", " ");
                        # add ' ' if no ' ' after '章'
                        if line[keyword1_pos+1] != ' ':
                            line = line[:keyword1_pos+1] + ' ' + line[keyword1_pos+1:];
                        # try convert chinese number to arabic number
                        chinese_number = line[keyword2_pos+1:keyword1_pos];
                        arabic_number = GetResultForDigit(chinese_number)
                        print(line + ": " + chinese_number + "-->"+ arabic_number)
                        line = line.replace(chinese_number, arabic_number);                        
                        # it is <h2>
                        chapter_find = 1;

                if chapter_find == 1:
                    line = line.replace("\n", "");
                    output_file.write("<h2>" + line + "</h2>\n");
                else:
                    if line != "\n":
                        line = line.replace("\n", "");
                        output_file.write("<p>" + line + "</p>\n");

            output_file.write("</body>\n");
            output_file.write("</html>\n");
            output_file.close()
        input_file.close()
    return

author = '蝴蝶藍';
Katino_format('[蝴蝶藍] 天醒之路[1].txt', author)
print('done.');

程式亂亂寫

Friday, November 24, 2017

較快把卡提諾小說轉成epub的方法

No comments:

codeblock

Labels