使用 Python 解析 HTML

图片来源：

Jason Baker for Opensource.com.

作为 Scribus 文档团队的长期成员，我不断关注源代码的最新更新，以便帮助更新和添加文档。当我最近在我刚刚升级到 Fedora 27 的计算机上使用 Subversion 进行“checkout”时，我对下载文档所花费的时间感到惊讶，这些文档由 HTML 页面和相关图像组成。我开始担心该项目的文档似乎比它应该的大小要大得多，并怀疑某些内容是“僵尸”文档——不再使用的 HTML 文件和在当前使用的 HTML 中丢失所有引用的图像。

我决定创建一个项目来弄清楚这一点。一种方法是搜索未使用的现有图像文件。如果我可以扫描所有 HTML 文件以查找图像引用，然后将该列表与实际图像文件进行比较，那么我很有可能会看到不匹配的情况。

这是一个典型的图像标签

<img src="https://open-source.net.cn/images/edit_shapes.png" ALT="Edit examples" ALIGN=left>

我对 src= 后的第一组引号之间的部分感兴趣。在搜索解决方案后，我找到了一个名为 BeautifulSoup 的 Python 模块。我编写的脚本中好用的部分看起来像这样

    soup = BeautifulSoup(all_text, 'html.parser')
    match = soup.findAll("img")
    if len(match) > 0:
        for m in match:
            imagelist.append(str(m))

我们可以使用这个 findAll 方法来提取图像标签。这是一个很小的输出片段

<img src="https://open-source.net.cn/images/pdf-form-ht3.png"/><img src="https://open-source.net.cn/images/pdf-form-ht4.png"/><img src="https://open-source.net.cn/images/pdf-form-ht5.png"/><img src="https://open-source.net.cn/images/pdf-form-ht6.png"/><img align="middle" alt="GSview - Advanced Options Panel" src="https://open-source.net.cn/images/gsadv1.png" title="GSview - Advanced Options Panel"/><img align="middle" alt="Scribus External Tools Preferences" src="https://open-source.net.cn/images/gsadv2.png" title="Scribus External Tools Preferences"/>

到目前为止，一切都很好。我以为下一步可能是直接删减，但是当我在脚本中尝试一些字符串方法时，它返回了关于这是标签而不是字符串的错误。我将输出保存到一个文件，并在 KWrite 中完成了编辑过程。KWrite 的一个优点是你可以使用正则表达式（regex）进行“查找和替换”，所以我可以用 \n<img 替换 <img，这使得更容易看到如何从那里进行删减。KWrite 的另一个优点是，如果你在使用正则表达式时做出了不明智的选择，你可以撤消它。

但是我想，肯定有比这更好的方法，所以我求助于正则表达式，或者更具体地说，Python 的 re 模块。这个新脚本的相关部分看起来像这样

    match = re.findall(r'src="(.*)/>', all_text)
    if len(match)>0:
        for m in match:
            imagelist.append(m)

它的一个很小的输出片段看起来像这样

images/cmcanvas.png" title="Context Menu for the document canvas" alt="Context Menu for the document canvas" /></td></tr></table><br images/eps-imp1.png" title="EPS preview in a file dialog" alt="EPS preview in a file dialog" images/eps-imp5.png" title="Colors imported from an EPS file" alt="Colors imported from an EPS file" images/eps-imp4.png" title="EPS font substitution" alt="EPS font substitution" images/eps-imp2.png" title="EPS import progress" alt="EPS import progress" images/eps-imp3.png" title="Bitmap conversion failure" alt="Bitmap conversion failure"

乍一看，它看起来与上面的输出类似，并且具有修剪掉图像标签部分的好处，但是也有令人困惑的表格标签和其他内容包含在内。我认为这与这个正则表达式 src="(.*)/> 有关，它被称为贪婪，这意味着它不一定在遇到的第一个 /> 处停止。我应该补充说，我也尝试了 src="(.*)"，但并没有好多少。由于不是 regexpert（刚编造的词），我四处搜索各种改进这个方法的想法并没有帮助。

经过一系列其他尝试，甚至尝试了 Perl 的 HTML::Parser，我最终尝试将此与我为 Scribus 编写的一些脚本的情况进行比较，这些脚本逐字符分析文本框的内容，然后执行一些操作。为了我的目的，我最终想出的方法改进了所有这些方法，并且根本不需要正则表达式或 HTML 解析器。让我们回到我展示的那个 img 标签示例。

<img src="https://open-source.net.cn/images/edit_shapes.png" ALT="Edit examples" ALIGN=left>

我决定专注于 src= 部分。一种方法是等待 s 的出现，然后查看下一个字符是否是 r，下一个是否是 c，下一个是否是 =。如果是，那就对了！然后，两个双引号之间的是我需要的。这样做的问题是需要构建结构来保存这些字符。一种查看表示 HTML 文本行的字符串的方法是

for c in all_text:

但是逻辑太混乱了，无法保存之前的 c，以及之前的那个，再之前的那个，以及再之前的那个。

最后，我决定专注于 =，并使用索引方法，这样我可以轻松地引用字符串中的任何先前或未来的字符。这是搜索部分

    index = 3
    while index < linelength:
        if (all_text[index] == '='):
            if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and 
(all_text[index-1] == 'c'):
                imagefound(all_text, imagelist, index)
                index += 1
            else:
                index += 1
        else:
            index += 1

我从第四个字符开始搜索（索引从 0 开始），所以我不会在下面得到索引错误，并且实际上，在一行的第四个字符之前不会有等号。第一个测试是查看我们是否在遍历字符串时找到 =，如果没有，我们就继续前进。如果我们确实看到了一个，那么我们会询问之前的三个字符是否分别是 s、r 和 c，顺序是否正确。如果发生这种情况，我们调用函数 imagefound

def imagefound(all_text, imagelist, index):
    end = 0
    index += 2
    newimage = ''
    while end == 0:
        if (all_text[index] != '"'):
            newimage = newimage + all_text[index]
            index += 1
        else:
            newimage = newimage + '\n'
            imagelist.append(newimage)
            end = 1
            return

我们将当前索引发送给函数，该索引表示 =。我们知道下一个字符将是 "，所以我们跳过两个字符并开始将字符添加到名为 newimage 的保持字符串中，直到我们到达后面的 "，此时我们就完成了。我们将字符串加上一个 newline 字符添加到我们的列表 imagelist 并 return，记住在这个剩余的 HTML 字符串中可能还有更多的图像标签，所以我们又回到了搜索循环的中间。

这是我们现在的输出

images/text-frame-link.png
images/text-frame-unlink.png
images/gimpoptions1.png
images/gimpoptions3.png
images/gimpoptions2.png
images/fontpref3.png
images/font-subst.png
images/fontpref2.png
images/fontpref1.png
images/dtp-studio.png

啊哈，干净多了，而且这只花了几秒钟就运行完了。我可以再跳过七个索引位置来删除 images/ 部分，但我喜欢保留它，以确保我没有截断图像文件名的第一个字母，而且这很容易用 KWrite 编辑掉——你甚至不需要正则表达式。在完成此操作并保存文件后，下一步是运行我编写的另一个名为 sortlist.py 的脚本

#!/usr/bin/env python
# -*- coding: utf-8  -*-
# sortlist.py

import os

imagelist = []
for line in open('/tmp/imagelist_parse4.txt').xreadlines():
    imagelist.append(line)
    
imagelist.sort()

outfile = open('/tmp/imagelist_parse4_sorted.txt', 'w')
outfile.writelines(imagelist)
outfile.close()

这会将文件内容作为列表拉入，对其进行排序，然后将其另存为一个文件。之后，我可以执行以下操作

ls /home/gregp/development/Scribus15x/doc/en/images/*.png > '/tmp/actual_images.txt'

然后我需要对该文件也运行 sortlist.py，因为 ls 使用的排序方法与 Python 不同。我本可以对这些文件运行比较脚本，但我更喜欢直观地执行此操作。最后，我得到了 42 个没有 HTML 文档引用的图像。

这是我的完整解析脚本

#!/usr/bin/env python
# -*- coding: utf-8  -*-
# parseimg4.py

import os

def imagefound(all_text, imagelist, index):
    end = 0
    index += 2
    newimage = ''
    while end == 0:
        if (all_text[index] != '"'):
            newimage = newimage + all_text[index]
            index += 1
        else:
            newimage = newimage + '\n'
            imagelist.append(newimage)
            end = 1
            return
        
htmlnames = []
imagelist = []
tempstring = ''
filenames = os.listdir('/home/gregp/development/Scribus15x/doc/en/')
for name in filenames:
    if name.endswith('.html'):
        htmlnames.append(name)
#print htmlnames
for htmlfile in htmlnames:
    all_text = open('/home/gregp/development/Scribus15x/doc/en/' + htmlfile).read()
    linelength = len(all_text)
    index = 3
    while index < linelength:
        if (all_text[index] == '='):
            if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and 
(all_text[index-1] == 'c'):
                imagefound(all_text, imagelist, index)
                index += 1
            else:
                index += 1
        else:
            index += 1

outfile = open('/tmp/imagelist_parse4.txt', 'w')
outfile.writelines(imagelist)
outfile.close()
imageno = len(imagelist)
print str(imageno) + " images were found and saved"

它的名称 parseimg4.py 并没有真正反映我一路编写的脚本数量，包括小的和大的重写，以及丢弃和重新开始。请注意，我已经硬编码了这些目录和文件名，但很容易将其通用化，要求用户输入这些信息。此外，由于它们是工作脚本，我将输出发送到 /tmp，因此它们在我的系统重启后就会消失。

这还不是故事的结局，因为下一个问题是：僵尸 HTML 文件呢？任何未使用的文件都可能引用先前方法未拾取的图像。我们有一个 menu.xml 文件，它充当在线手册的目录，但我也需要考虑到 TOC 中列出的某些文件可能会引用 TOC 中没有的文件，是的，我确实找到了一些。

最后，我想说，这项任务比图像搜索更简单，并且我已经开发的流程对此有很大帮助。

标签

Python