Python Beautiful Soup 刮取簡易指南

今天我們將討論如何使用 Beautiful Soup 庫從 HTML 頁面中提取內容，之後，我們將使用它將其轉換為 Python 列表或字典。

什麼是 Web 刮取，為什麼我需要它？

答案很簡單：並非每個網站都有獲取內容的 API。你可能想從你最喜歡的烹飪網站上獲取食譜，或者從旅遊博客上獲取照片。如果沒有 API，提取 HTML（或者說刮取 scraping 可能是獲取內容的唯一方法。我將向你展示如何使用 Python 來獲取。

並非所以網站都喜歡被刮取，有些網站可能會明確禁止。請於網站所有者確認是否同意刮取。

Python 如何刮取網站？

使用 Python 進行刮取，我們將執行三個基本步驟：

使用 requests 庫獲取 HTML 內容
分析 HTML 結構並識別包含我們需要內容的標籤
使用 Beautiful Soup 提取標籤並將數據放入 Python 列表中

安裝庫

首先安裝我們需要的庫。requests 庫從網站獲取 HTML 內容，Beautiful Soup 解析 HTML 並將其轉換為 Python 對象。在 Python3 中安裝它們，運行：

pip3 install requests beautifulsoup4

提取 HTML

在本例中，我將選擇刮取網站的 Techhology 部分。如果你跳轉到此頁面，你會看到帶有標題、摘錄和發布日期的文章列表。我們的目標是創建一個包含這些信息的文章列表。

網站頁面的完整 URL 是：

https://notes.ayushsharma.in/technology

我們可以使用 requests 從這個頁面獲取 HTML 內容：

#!/usr/bin/python3
import requests

url = &apos;https://notes.ayushsharma.in/technology&apos;

data = requests.get(url)

print(data.text)

變數 data 將包含頁面的 HTML 源代碼。

從 HTML 中提取內容

為了從 data 中提取數據，我們需要確定哪些標籤具有我們需要的內容。

如果你瀏覽 HTML，你會發現靠近頂部的這一段：

<div class="col">
  <a href="/2021/08/using-variables-in-jekyll-to-define-custom-content" class="post-card">
    <div class="card">
      <div class="card-body">
        <h5 class="card-title">Using variables in Jekyll to define custom content</h5>
        <small class="card-text text-muted">I recently discovered that Jekyll&apos;s config.yml can be used to define custom
          variables for reusing content. I feel like I&apos;ve been living under a rock all this time. But to err over and
          over again is human.</small>
      </div>
      <div class="card-footer text-end">
        <small class="text-muted">Aug 2021</small>
      </div>
    </div>
  </a>
</div>

這是每篇文章在整個頁面中重複的部分。我們可以看到 .card-title 包含文章標題，.card-text 包含摘錄，.card-footer > small 包含發布日期。

讓我們使用 Beautiful Soup 提取這些內容。

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = &apos;https://notes.ayushsharma.in/technology&apos;
data = requests.get(url)

my_data = []

html = BeautifulSoup(data.text, &apos;html.parser&apos;)
articles = html.select(&apos;a.post-card&apos;)

for article in articles:

    title = article.select(&apos;.card-title&apos;)[0].get_text()
    excerpt = article.select(&apos;.card-text&apos;)[0].get_text()
    pub_date = article.select(&apos;.card-footer small&apos;)[0].get_text()

    my_data.append({"title": title, "excerpt": excerpt, "pub_date": pub_date})

pprint(my_data)

以上代碼提取文章信息並將它們放入 my_data 變數中。我使用了 pprint 來美化輸出，但你可以在代碼中忽略它。將上面的代碼保存在一個名為 fetch.py 的文件中，然後運行它：

python3 fetch.py

如果一切順利，你應該會看到：

[{&apos;excerpt&apos;: "I recently discovered that Jekyll&apos;s config.yml can be used to"
"define custom variables for reusing content. I feel like I&apos;ve"
&apos;been living under a rock all this time. But to err over and over&apos;
&apos;again is human.&apos;,
&apos;pub_date&apos;: &apos;Aug 2021&apos;,
&apos;title&apos;: &apos;Using variables in Jekyll to define custom content&apos;},
{&apos;excerpt&apos;: "In this article, I&apos;ll highlight some ideas for Jekyll"
&apos;collections, blog category pages, responsive web-design, and&apos;
&apos;netlify.toml to make static website maintenance a breeze.&apos;,
&apos;pub_date&apos;: &apos;Jul 2021&apos;,
&apos;title&apos;: &apos;The evolution of ayushsharma.in: Jekyll, Bootstrap, Netlify,&apos;
&apos;static websites, and responsive design.&apos;},
{&apos;excerpt&apos;: "These are the top 5 lessons I&apos;ve learned after 5 years of"
&apos;Terraform-ing.&apos;,
&apos;pub_date&apos;: &apos;Jul 2021&apos;,
&apos;title&apos;: &apos;5 key best practices for sane and usable Terraform setups&apos;},

... (truncated)

以上是全部內容！在這 22 行代碼中，我們用 Python 構建了一個網路刮取器，你可以在我的示例倉庫中找到源代碼。

總結

對於 Python 列表中的網站內容，我們現在可以用它做一些很酷的事情。我們可以將它作為 JSON 返回給另一個應用程序，或者使用自定義樣式將其轉換為 HTML。隨意複製粘貼以上代碼並在你最喜歡的網站上進行試驗。

玩的開心，繼續編碼吧。

本文最初發表在作者個人博客上，經授權改編。

via: https://opensource.com/article/21/9/web-scraping-python-beautiful-soup

作者：Ayush Sharma 選題：lujun9972 譯者：MjSeven 校對：wxy

本文由 LCTT 原創編譯，Linux中國榮譽推出

本文轉載來自 Linux 中國: https://github.com/Linux-CN/archive

對這篇文章感覺如何？

太棒了

不錯

愛死了

不太好

感覺很糟

Rain

雨落清風。心向陽

Python Beautiful Soup 刮取簡易指南

什麼是 Web 刮取，為什麼我需要它？

Python 如何刮取網站？

安裝庫

提取 HTML

從 HTML 中提取內容

總結

對這篇文章感覺如何？

Linux 黑話解釋：什麼是定時任務

來點更高雅的！用 Linux Sampler 演奏數字管弦樂

Leave a reply 取消回復

More in:Linux中國

如何通過 VLC 使用字幕

Unix 桌面：在 Linux 問世之前

Valve 對於 Ubuntu 的 Snap 版本的 Steam 並不滿意：原因何在

Wine 9.0 發布，實驗性地加入了 Wayland 驅動

中文操作系統論壇

關注 LinuxStory

開源學村

編程類開放書籍薈萃

如何殺死 Linux 中的殭屍進程

2022年，從學習Rust開始

5本學習 TeX 的最佳免費書籍

Makefile 簡介

使用 Linux 命令行解決Wordle 問題

Linux 內核補丁提交初體驗

Linux 的前世今生 – 1

特別關注

NetBSD 10.0 正式發布

「Linux 中國」開源社區，停止運營

把各種舊電腦和舊電子設備變成遊戲終端：Lakka 5.0 正式發布

NetBSD 10.0 正式發布

「Linux 中國」開源社區，停止運營

把各種舊電腦和舊電子設備變成遊戲終端：Lakka 5.0 正式發布

NetBSD 10.0 正式發布

LinuxStory

加入 LinuxStory 交流群組

投票調查

最熱標籤

什麼是 Web 刮取，為什麼我需要它？

Python 如何刮取網站？

安裝庫

提取 HTML

從 HTML 中提取內容

總結

分享

對這篇文章感覺如何？

You may also like

Leave a reply 取消回復

More in:Linux中國

中文操作系統論壇

關注 LinuxStory

開源學村

特別關注

最新文章

最熱標籤