利用beautifulsoup4解析Kindle笔记

Varietyankikindle小书匠笔记管理

拥有Kindle Paperwhite 3 ( KPW3 )设备，平常会在KPW3、android手机、ipad及电脑等多端设备阅读电子书，阅读过程中会对书籍标记、做笔记，比较奇怪的是KPW3上的标记、笔记能同步到其他终端上，反过来虽然可以同步到KPW3上，但是标注及笔记无法记录到My Clippings.txt，以至于无法进一步加工处理读书笔记，所以利用android手机的kindle的笔记导出功能，将一本书籍的所有笔记以html导出，进一步解析合并至My Clippings.txt及处理到Variety、anki等应用上。

2.系统环境

#系统环境 

!lsb_release -a 

No LSB modules are available. 

Distributor ID:	LinuxMint 

Description:	Linux Mint 19.3 Tricia 

Release:	19.3 

Codename:	tricia

#Python及相关库版本 

!python --version 

!python -m pip list --format=columns | grep beautifulsoup4 

!python -m pip list --format=columns | grep lxml

Python 3.6.9 

beautifulsoup4                    4.9.0           

lxml                              4.5.0           

3.利用app版kindle导出读书笔记

确保android上的Kindle笔记已经完整（可能出现手机1笔记完整，手机2只有一个字，如下图，这种情况只能定位过去再标记一遍）

多端同步出现笔记不一致问题

笔记完整后，使用无格式导出笔记，导出流程如下图：


第一步	第二步

笔记导出后，效果如下图：

导出的kindle笔记

4.解析html笔记

4.1解析书籍基本信息

#导入库 

import re 

from bs4 import BeautifulSoup 

from lxml.html.clean import unicode 

#创建Beautifulsoup对象 

soup=BeautifulSoup(open(\'./demo.html\'),features=\'html.parser\') 

#获取书籍名称及作者 

bookname=soup.find_all(\'div\',class_=\'bookTitle\')[0].text.strip() 

authors=soup.find_all(\'div\',class_=\'authors\')[0].text.strip() 

print(bookname,authors)

拆掉思维里的墙:原来我还可以这样活 古典 

4.2解析书籍笔记

#所有笔记内容 

allcontents=soup.contents[3].contents[3].contents[1] 

#遍历所有笔记内容 

allnotes=[] 

takenoteflag=False 

for conind in range(11,len(allcontents)): 

    content=BeautifulSoup(unicode(allcontents.contents[conind])) 

    if len(content)==0: 

        continue 

    if conind==11: 

        note={\'sectionHeading\':\'\',\'noteHeading\':{\'markColor\':\'\',\'markPosition\':\'\'},\'noteText\':\'\',\'takenote\':{\'takePosition\':\'\',\'note\':\'\'}}    

    #根据css样式区分内容 

    div=content.select(\'div\') 

    divclass=div[0].get("class")[0] 

    #笔记所处章节 

    if divclass==\'sectionHeading\': 

        note[\'sectionHeading\']=content.text.strip().replace(\'\n\',\'\') 

    #笔记样式 

    elif divclass==\'noteHeading\': 

        if takenoteflag: 

            markpos=re.findall(r\'\d+\',content.text.strip().replace(\'\n\',\'\'))[0] 

            note[\'takenote\'][\'markPosition\']=markpos 

        else: 

            markclo=content.span.text.strip().replace(\'\n\',\'\') 

            markpos=re.findall(r\'\d+\',content.text.strip().replace(\'\n\',\'\'))[0] 

            note[\'noteHeading\'][\'markColor\']=markclo 

            note[\'noteHeading\'][\'markPosition\']=markpos 

    #自己做了笔记 

    elif divclass==\'noteText\' and takenoteflag: 

        note[\'takenote\'][\'note\']=content.text.strip().replace(\'\n\',\'\') 

        takenoteflag=False 

        allnotes.append(note) 

        note={\'sectionHeading\':note[\'sectionHeading\'],\'noteHeading\':{\'markColor\':\'\',\'markPosition\':\'\'},\'noteText\':\'\',\'takenote\':{\'takePosition\':\'\',\'note\':\'\'}} 

    #仅仅是标记笔记 

    elif divclass==\'noteText\' and not takenoteflag: 

        note[\'noteText\']=content.text.strip().replace(\'\n\',\'\') 

        #判断后续是否有笔记 

        strtind=1 

        nextnote=BeautifulSoup(unicode(allcontents.contents[conind+strtind])) 

        while len(nextnote)==0 and (conind+strtind)<len(allcontents): 

            nextnote=BeautifulSoup(unicode(allcontents.contents[conind+strtind])) 

            strtind+=1 

        if \'笔记\' in nextnote.text.strip().replace(\'\n\',\'\'): 

            takenoteflag=True 

        else: 

            allnotes.append(note) 

            note={\'sectionHeading\':note[\'sectionHeading\'],\'noteHeading\':{\'markColor\':\'\',\'markPosition\':\'\'},\'noteText\':\'\',\'takenote\':{\'takePosition\':\'\',\'note\':\'\'}} 

# print(allnotes)

5.应用html笔记

5.1追加至kindle笔记管理文件My Clippings.txt

解析了笔记内容，按照My Clippings.txt文件中的标记、笔记格式，将导出笔记内容追加至My Clippings.txt，笔记合并后，可利用现有的诸如clippings.io、书见等工具进行笔记管理。

注意：由于导出笔记不含时间信息，因此至获取当前系统时间作为笔记时间，该时间非真实做笔记时间

#获取当前时间 

import time 

def Getnowdate(): 

    week_day_dict = { 

        0 : \'星期一\', 

        1 : \'星期二\', 

        2 : \'星期三\', 

        3 : \'星期四\', 

        4 : \'星期五\', 

        5 : \'星期六\', 

        6 : \'星期天\', 

      } 

    loctime=time.localtime() 

    years=time.strftime("%Y年%-m月%-d日", loctime) 

    weeks=week_day_dict[loctime[6]] 

    if loctime[3]<=12: 

        times=time.strftime("上午%-H:%-M:%S", loctime) 

    else: 

        times=\'下午\'+time.localtime()[3]-12+time.strftime(":%M:%S", loctime) 

    nowdate=years+weeks+\' \'+times 

    return nowdate

#读入已做的笔记 

existnotes=open(\'My Clippings.txt\',\'r\').readlines() 

#写入文件 

fw=open(\'My Clippings.txt\',\'a\') 

for noteind in range(0,len(allnotes)): 

    if allnotes[noteind][\'takenote\'][\'note\']!=\'\': 

        if (allnotes[noteind][\'noteText\'].replace(\' \',\'\')+\'\n\') not in existnotes: 

            fw.write(bookname+\' (\'+authors+\')\n\') 

            fw.write(\'- 您在位置 #\'+allnotes[noteind][\'noteHeading\'][\'markPosition\']+\'-\'+str(int(allnotes[noteind][\'noteHeading\'][\'markPosition\'])+1)+\' 的标注\'+\' | 添加于 \'+Getnowdate()+\'\n\n\') 

            fw.write(allnotes[noteind][\'noteText\'].replace(\' \',\'\')+\'\n\') 

            fw.write(\'==========\n\') 

        if (allnotes[noteind][\'takenote\'][\'note\'].replace(\' \',\'\')+\'\n\') not in existnotes: 

            fw.write(bookname+\' (\'+authors+\')\n\') 

            fw.write(\'- 您在位置 #\'+allnotes[noteind][\'noteHeading\'][\'markPosition\']+\' 的笔记\'+\' | 添加于 \'+Getnowdate()+\'\n\n\') 

            fw.write(allnotes[noteind][\'takenote\'][\'note\']+\'\n\') 

            fw.write(\'==========\n\')        

    else: 

        if (allnotes[noteind][\'noteText\'].replace(\' \',\'\')+\'\n\') not in existnotes: 

            fw.write(bookname+\' (\'+authors+\')\n\') 

            fw.write(\'- 您在位置 #\'+allnotes[noteind][\'noteHeading\'][\'markPosition\']+\'-\'+str(int(allnotes[noteind][\'noteHeading\'][\'markPosition\'])+1)+\' 的标注\'+\' | 添加于 \'+Getnowdate()+\'\n\n\') 

            fw.write(allnotes[noteind][\'noteText\'].replace(\' \',\'\')+\'\n\') 

            fw.write(\'==========\n\')         

fw.close()

将读书笔记追加至My Clippings.txt

5.2适配成Variety箴言

Variety是linux下的壁纸管理工具，具备使用本地文档显示箴言的功能，现将kindle笔记解析成Variety识别的格式，并展示出来，方便日常查看。

#读入已做的笔记 

#处理已添加的箴言 

def Delline(line): 

    lastind=0 

    if \'[\' in line: 

        lastind=line.index(\'[\') 

    return line[:lastind] 

existnotes=list(map(Delline,open(\'/home/wu/.config/variety/pluginconfig/quotes/qotes.txt\',\'r\').readlines())) 

#写入文件 

fw=open(\'qotes.txt\',\'w\') 

for noteind in range(0,len(allnotes)): 

    if allnotes[noteind][\'noteText\'].replace(\' \',\'\') not in existnotes: 

        fw.write(allnotes[noteind][\'noteText\'].replace(\' \',\'\')+\'[\'+allnotes[noteind][\'sectionHeading\'].replace(\' \',\'\')+\']\'+\'——\'+bookname+\' (\'+authors+\')\n\') 

        if allnotes[noteind][\'takenote\'][\'note\'].replace(\' \',\'\')!=\'\': 

            fw.write(\'#\'+allnotes[noteind][\'takenote\'][\'note\'].replace(\' \',\'\')+\'——@\'+\'WuShaogui\n\') 

        fw.write(\'.\n\') 

fw.close()


解析后的文档	Variety配置

Variety箴言显示效果

5.3匹配成anki笔记模式

anki是背书神器，将kindle笔记导入anki中，可以对一本书的笔记进行反复的练习，加深感悟！

#写入anki笔记导入格式 

fw=open(\'%s-%s.txt\'%(bookname,authors),\'w\') 

for noteind in range(0,len(allnotes)): 

    fw.write(allnotes[noteind][\'noteText\'].replace(\' \',\'\')+\'\t\'\ 

         +allnotes[noteind][\'sectionHeading\'].replace(\' \',\'\')+\'\t\'\ 

         +bookname+\'\t\'+authors+\'\t\'+allnotes[noteind][\'takenote\'][\'note\'].replace(\' \',\'\')+\'\n\') 

fw.close()


解析后的文档	Anki导入解析后文档


文档导入后效果	最终效果图