如何用 Python 解析 HTML

作為 Scribus 文檔團隊的長期成員，我要隨時了解最新的源代碼更新，以便對文檔進行更新和補充。我最近在剛升級到 Fedora 27 系統的計算機上使用 Subversion 進行檢出操作時，對於下載該文檔所需要的時間我感到很驚訝，文檔由 HTML 頁面和相關圖像組成。我恐怕該項目的文檔看起來比項目本身大得多，並且懷疑其中的一些內容是「殭屍」文檔——不再使用的 HTML 文件以及 HTML 中無法訪問到的圖像。

我決定為自己創建一個項目來解決這個問題。一種方法是搜索未使用的現有圖像文件。如果我可以掃描所有 HTML 文件中的圖像引用，然後將該列表與實際圖像文件進行比較，那麼我可能會看到不匹配的文件。

這是一個典型的圖像標籤：

<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>

我對 src= 之後的第一組引號之間的部分很感興趣。在尋找了一些解決方案後，我找到一個名為 BeautifulSoup 的 Python 模塊。腳本的核心部分如下所示：

soup = BeautifulSoup(all_text, &apos;html.parser&apos;)
match = soup.findAll("img")
if len(match) > 0:
    for m in match:
        imagelist.append(str(m))

我們可以使用這個 findAll 方法來挖出圖片標籤。這是一小部分輸出：

<img src="images/pdf-form-ht3.png"/><img src="images/pdf-form-ht4.png"/><img src="images/pdf-form-ht5.png"/><img src="images/pdf-form-ht6.png"/><img align="middle" alt="GSview - Advanced Options Panel" src="images/gsadv1.png" title="GSview - Advanced Options Panel"/><img align="middle" alt="Scribus External Tools Preferences" src="images/gsadv2.png" title="Scribus External Tools Preferences"/>

到現在為止還挺好。我原以為下一步就可以搞定了，但是當我在腳本中嘗試了一些字元串方法時，它返回了有關標記的錯誤而不是字元串的錯誤。我將輸出保存到一個文件中，並在 KWrite 中進行編輯。 KWrite 的一個好處是你可以使用正則表達式（regex）來做「查找和替換」操作，所以我可以用 n<img 替換 <img，這樣可以看得更清楚。 KWrite 的另一個好處是，如果你用正則表達式做了一個不明智的選擇，你還可以撤消。

但我認為，肯定有比這更好的東西，所以我轉而使用正則表達式，或者更具體地說 Python 的 re 模塊。這個新腳本的相關部分如下所示：

match = re.findall(r&apos;src="(.*)/>&apos;, all_text)
if len(match)>0:
    for m in match:
        imagelist.append(m)

它的一小部分輸出如下所示：

images/cmcanvas.png" title="Context Menu for the document canvas" alt="Context Menu for the document canvas" /></td></tr></table><br images/eps-imp1.png" title="EPS preview in a file dialog" alt="EPS preview in a file dialog" images/eps-imp5.png" title="Colors imported from an EPS file" alt="Colors imported from an EPS file" images/eps-imp4.png" title="EPS font substitution" alt="EPS font substitution" images/eps-imp2.png" title="EPS import progress" alt="EPS import progress" images/eps-imp3.png" title="Bitmap conversion failure" alt="Bitmap conversion failure"

乍一看，它看起來與上面的輸出類似，並且附帶有去除圖像的標籤部分的好處，但是有令人費解的是還夾雜著表格標籤和其他內容。我認為這涉及到這個正則表達式 src="(.*)/>，這被稱為貪婪，意味著它不一定停止在遇到 /> 的第一個實例。我應該補充一點，我也嘗試過 src="(.*)"，這真的沒有什麼更好的效果，我不是一個正則表達式專家（只是做了這個），找了各種方法來改進這一點但是並沒什麼用。

做了一系列的事情之後，甚至嘗試了 Perl 的 HTML::Parser 模塊，最終我試圖將這與我為 Scribus 編寫的一些腳本進行比較，這些腳本逐個字元的分析文本內容，然後採取一些行動。為了最終目的，我終於想出了所有這些方法，並且完全不需要正則表達式或 HTML 解析器。讓我們回到展示的那個 img 標籤的例子。

<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>

我決定回到 src= 這一塊。一種方法是等待 s 出現，然後看下一個字元是否是 r，下一個是 c，下一個是否 =。如果是這樣，那就匹配上了！那麼兩個雙引號之間的內容就是我所需要的。這種方法的問題在於需要連續識別上面這樣的結構。一種查看代表一行 HTML 文本的字元串的方法是：

for c in all_text:

但是這個邏輯太亂了，以至於不能持續匹配到前面的 c，還有之前的字元，更之前的字元，更更之前的字元。

最後，我決定專註於 = 並使用索引方法，以便我可以輕鬆地引用字元串中的任何先前或將來的字元。這裡是搜索部分：

    index = 3
    while index < linelength:
        if (all_text[index] == &apos;=&apos;):
            if (all_text[index-3] == &apos;s&apos;) and (all_text[index-2] == &apos;r&apos;) and (all_text[index-1] == &apos;c&apos;):
                imagefound(all_text, imagelist, index)
                index += 1
            else:
                index += 1
        else:
            index += 1

我用第四個字元開始搜索（索引從 0 開始），所以我在下面沒有出現索引錯誤，並且實際上，在每一行的第四個字元之前不會有等號。第一個測試是看字元串中是否出現了 =，如果沒有，我們就會前進。如果我們確實看到一個等號，那麼我們會看前三個字元是否是 s、r 和 c。如果全都匹配了，就調用函數 imagefound：

def imagefound(all_text, imagelist, index):
    end = 0
    index += 2
    newimage = &apos;&apos;
    while end == 0:
        if (all_text[index] != &apos;"&apos;):
            newimage = newimage + all_text[index]
            index += 1
        else:
            newimage = newimage + &apos;n&apos;
            imagelist.append(newimage)
            end = 1
            return

我們給函數發送當前索引，它代表著 =。我們知道下一個字元將會是 "，所以我們跳過兩個字元，並開始向名為 newimage 的控制字元串添加字元，直到我們發現下一個 "，此時我們完成了一次匹配。我們將字元串加一個換行符（n）添加到列表 imagelist 中並返回（return），請記住，在剩餘的這個 HTML 字元串中可能會有更多圖片標籤，所以我們馬上回到搜索循環中。

以下是我們的輸出現在的樣子：

images/text-frame-link.png
images/text-frame-unlink.png
images/gimpoptions1.png
images/gimpoptions3.png
images/gimpoptions2.png
images/fontpref3.png
images/font-subst.png
images/fontpref2.png
images/fontpref1.png
images/dtp-studio.png

啊，乾淨多了，而這隻花費幾秒鐘的時間。我本可以將索引前移 7 步來剪切 images/ 部分，但我更願意把這個部分保存下來，以確保我沒有剪切掉圖像文件名的第一個字母，這很容易用 KWrite 編輯成功 —— 你甚至不需要正則表達式。做完這些並保存文件後，下一步就是運行我編寫的另一個腳本 sortlist.py：

#!/usr/bin/env python
# -*- coding: utf-8  -*-
# sortlist.py

import os

imagelist = []
for line in open(&apos;/tmp/imagelist_parse4.txt&apos;).xreadlines():
    imagelist.append(line)

imagelist.sort()

outfile = open(&apos;/tmp/imagelist_parse4_sorted.txt&apos;, &apos;w&apos;)
outfile.writelines(imagelist)
outfile.close()

這會讀取文件內容，並存儲為列表，對其排序，然後另存為另一個文件。之後，我可以做到以下幾點：

ls /home/gregp/development/Scribus15x/doc/en/images/*.png > &apos;/tmp/actual_images.txt&apos;

然後我需要在該文件上運行 sortlist.py，因為 ls 方法的排序與 Python 不同。我原本可以在這些文件上運行比較腳本，但我更願意以可視方式進行操作。最後，我成功找到了 42 個圖像，這些圖像沒有來自文檔的 HTML 引用。

這是我的完整解析腳本：

#!/usr/bin/env python
# -*- coding: utf-8  -*-
# parseimg4.py

import os

def imagefound(all_text, imagelist, index):
    end = 0
    index += 2
    newimage = &apos;&apos;
    while end == 0:
        if (all_text[index] != &apos;"&apos;):
            newimage = newimage + all_text[index]
            index += 1
        else:
            newimage = newimage + &apos;n&apos;
            imagelist.append(newimage)
            end = 1
            return

htmlnames = []
imagelist = []
tempstring = &apos;&apos;
filenames = os.listdir(&apos;/home/gregp/development/Scribus15x/doc/en/&apos;)
for name in filenames:
    if name.endswith(&apos;.html&apos;):
        htmlnames.append(name)
#print htmlnames
for htmlfile in htmlnames:
    all_text = open(&apos;/home/gregp/development/Scribus15x/doc/en/&apos; + htmlfile).read()
    linelength = len(all_text)
    index = 3
    while index < linelength:
        if (all_text[index] == &apos;=&apos;):
            if (all_text[index-3] == &apos;s&apos;) and (all_text[index-2] == &apos;r&apos;) and
(all_text[index-1] == &apos;c&apos;):
                imagefound(all_text, imagelist, index)
                index += 1
            else:
                index += 1
        else:
            index += 1

outfile = open(&apos;/tmp/imagelist_parse4.txt&apos;, &apos;w&apos;)
outfile.writelines(imagelist)
outfile.close()
imageno = len(imagelist)
print str(imageno) + " images were found and saved"

腳本名稱為 parseimg4.py，這並不能真實反映我陸續編寫的腳本數量（包括微調的和大改的以及丟棄並重新開始寫的）。請注意，我已經對這些目錄和文件名進行了硬編碼，但是很容易變得通用化，讓用戶輸入這些信息。同樣，因為它們是工作腳本，所以我將輸出發送到 /tmp 目錄，所以一旦重新啟動系統，它們就會消失。

這不是故事的結尾，因為下一個問題是：殭屍 HTML 文件怎麼辦？任何未使用的文件都可能會引用圖像，不能被前面的方法所找出。我們有一個 menu.xml 文件作為聯機手冊的目錄，但我還需要考慮 TOC（LCTT 譯註：TOC 是 table of contents 的縮寫）中列出的某些文件可能引用了不在 TOC 中的文件，是的，我確實找到了一些這樣的文件。

最後我可以說，這是一個比圖像搜索更簡單的任務，而且開發的過程對我有很大的幫助。

關於作者

Greg Pittman 是 Kentucky 州 Louisville 市的一名退休的神經學家，從二十世紀六十年代的 Fortran IV 語言開始長期以來對計算機和編程有著濃厚的興趣。當 Linux 和開源軟體出現的時候，Greg 深受啟發，去學習更多知識，並實現最終貢獻的承諾。他是 Scribus 團隊的成員。更多關於我

via: https://opensource.com/article/18/1/parsing-html-python

作者：Greg Pittman 譯者：Flowsnow 校對：wxy

本文由 LCTT 原創編譯，Linux中國榮譽推出

本文轉載來自 Linux 中國: https://github.com/Linux-CN/archive

對這篇文章感覺如何？

太棒了

不錯

愛死了

不太好

感覺很糟

Rain

雨落清風。心向陽

如何用 Python 解析 HTML

關於作者

對這篇文章感覺如何？

3 種擴展 Kubernetes 能力的方式

如何在 Linux 上安裝應用程序

Leave a reply 取消回復

More in:Linux中國

如何通過 VLC 使用字幕

Unix 桌面：在 Linux 問世之前

Valve 對於 Ubuntu 的 Snap 版本的 Steam 並不滿意：原因何在

Wine 9.0 發布，實驗性地加入了 Wayland 驅動

中文操作系統論壇

關注 LinuxStory

開源學村

編程類開放書籍薈萃

如何殺死 Linux 中的殭屍進程

2022年，從學習Rust開始

5本學習 TeX 的最佳免費書籍

使用 Linux 命令行解決Wordle 問題

Makefile 簡介

Linux 內核補丁提交初體驗

Linux 的前世今生 – 1

特別關注

把各種舊電腦和舊電子設備變成遊戲終端：Lakka 5.0 正式發布

NetBSD 10.0 正式發布

「Linux 中國」開源社區，停止運營

LinuxStory

加入 LinuxStory 交流群組

投票調查

最熱標籤

關於作者

分享

對這篇文章感覺如何？

You may also like

Leave a reply 取消回復

More in:Linux中國

中文操作系統論壇

關注 LinuxStory

開源學村

特別關注

最新文章

最熱標籤