使用 Python 的 urllib.parse 庫解析 URL

Python 中的 urllib.parse 模塊提供了很多解析和組建 URL 的函數。

解析url

urlparse() 函數可以將 URL 解析成 ParseResult 對象。對象中包含了六個元素，分別為：

協議（scheme）
域名（netloc）
路徑（path）
路徑參數（params）
查詢參數（query）
片段（fragment）

from urllib.parse import urlparse

url=&apos;http://user:pwd@domain:80/path;params?query=queryarg#fragment&apos;

parsed_result=urlparse(url)

print(&apos;parsed_result 包含了&apos;,len(parsed_result),&apos;個元素&apos;)
print(parsed_result)

結果為:

parsed_result 包含了 6 個元素
ParseResult(scheme=&apos;http&apos;, netloc=&apos;user:pwd@domain:80&apos;, path=&apos;/path&apos;, params=&apos;params&apos;, query=&apos;query=queryarg&apos;, fragment=&apos;fragment&apos;)

ParseResult 繼承於 namedtuple，因此可以同時通過索引和命名屬性來獲取 URL 中各部分的值。

為了方便起見， ParseResult 還提供了 username、 password、 hostname、 port 對 netloc 進一步進行拆分。

print(&apos;scheme  :&apos;, parsed_result.scheme)
print(&apos;netloc  :&apos;, parsed_result.netloc)
print(&apos;path    :&apos;, parsed_result.path)
print(&apos;params  :&apos;, parsed_result.params)
print(&apos;query   :&apos;, parsed_result.query)
print(&apos;fragment:&apos;, parsed_result.fragment)
print(&apos;username:&apos;, parsed_result.username)
print(&apos;password:&apos;, parsed_result.password)
print(&apos;hostname:&apos;, parsed_result.hostname)
print(&apos;port    :&apos;, parsed_result.port)

結果為：

scheme  : http
netloc  : user:pwd@domain:80
path    : /path
params  : params
query   : query=queryarg
fragment: fragment
username: user
password: pwd
hostname: domain
port    : 80

除了 urlparse() 之外，還有一個類似的 urlsplit() 函數也能對 URL 進行拆分，所不同的是， urlsplit() 並不會把 路徑參數(params) 從 路徑(path) 中分離出來。

當 URL 中路徑部分包含多個參數時，使用 urlparse() 解析是有問題的：

url=&apos;http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment&apos;

parsed_result=urlparse(url)

print(parsed_result)
print(&apos;parsed.path    :&apos;, parsed_result.path)
print(&apos;parsed.params  :&apos;, parsed_result.params)

結果為：

ParseResult(scheme=&apos;http&apos;, netloc=&apos;user:pwd@domain:80&apos;, path=&apos;/path1;params1/path2&apos;, params=&apos;params2&apos;, query=&apos;query=queryarg&apos;, fragment=&apos;fragment&apos;)
parsed.path    : /path1;params1/path2
parsed.params  : params2

這時可以使用 urlsplit() 來解析：

from urllib.parse import urlsplit
split_result=urlsplit(url)

print(split_result)
print(&apos;split.path    :&apos;, split_result.path)
# SplitResult 沒有 params 屬性

結果為：

SplitResult(scheme=&apos;http&apos;, netloc=&apos;user:pwd@domain:80&apos;, path=&apos;/path1;params1/path2;params2&apos;, query=&apos;query=queryarg&apos;, fragment=&apos;fragment&apos;)
split.path    : /path1;params1/path2;params2

若只是要將 URL 後的 fragment 標識拆分出來，可以使用 urldefrag() 函數：

from urllib.parse import urldefrag

url = &apos;http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment&apos;

d = urldefrag(url)
print(d)
print(&apos;url     :&apos;, d.url)
print(&apos;fragment:&apos;, d.fragment)

結果為：

DefragResult(url=&apos;http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg&apos;, fragment=&apos;fragment&apos;)
url     : http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg
fragment: fragment

組建URL

ParsedResult 對象和 SplitResult 對象都有一個 geturl() 方法，可以返回一個完整的 URL 字元串。

print(parsed_result.geturl())
print(split_result.geturl())

結果為：

http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment
http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment

但是 geturl() 只在 ParsedResult 和 SplitResult 對象中有，若想將一個普通的元組組成 URL，則需要使用 urlunparse() 函數：

from urllib.parse import urlunparse
url_compos = (&apos;http&apos;, &apos;user:pwd@domain:80&apos;, &apos;/path1;params1/path2&apos;, &apos;params2&apos;, &apos;query=queryarg&apos;, &apos;fragment&apos;)
print(urlunparse(url_compos))

結果為：

http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment

相對路徑轉換絕對路徑

除此之外，urllib.parse 還提供了一個 urljoin() 函數，來將相對路徑轉換成絕對路徑的 URL。

from urllib.parse import urljoin

print(urljoin(&apos;http://www.example.com/path/file.html&apos;, &apos;anotherfile.html&apos;))
print(urljoin(&apos;http://www.example.com/path/&apos;, &apos;anotherfile.html&apos;))
print(urljoin(&apos;http://www.example.com/path/file.html&apos;, &apos;../anotherfile.html&apos;))
print(urljoin(&apos;http://www.example.com/path/file.html&apos;, &apos;/anotherfile.html&apos;))

結果為：

http://www.example.com/path/anotherfile.html
http://www.example.com/path/anotherfile.html
http://www.example.com/anotherfile.html
http://www.example.com/anotherfile.html

查詢參數的構造和解析

使用 urlencode() 函數可以將一個 dict 轉換成合法的查詢參數：

from urllib.parse import urlencode

query_args = {
    &apos;name&apos;: &apos;dark sun&apos;,
    &apos;country&apos;: &apos;中國&apos;
}

query_args = urlencode(query_args)
print(query_args)

結果為：

name=dark+sun&country=%E4%B8%AD%E5%9B%BD

可以看到特殊字元也被正確地轉義了。

相對的，可以使用 parse_qs() 來將查詢參數解析成 dict。

from urllib.parse import parse_qs
print(parse_qs(query_args))

結果為：

{&apos;name&apos;: [&apos;dark sun&apos;], &apos;country&apos;: [&apos;中國&apos;]}

如果只是希望對特殊字元進行轉義，那麼可以使用 quote 或 quote_plus 函數，其中 quote_plus 比 quote 更激進一些，會把 :、/ 一類的符號也給轉義了。

from urllib.parse import quote, quote_plus, urlencode

url = &apos;http://localhost:1080/~hello!/&apos;
print(&apos;urlencode :&apos;, urlencode({&apos;url&apos;: url}))
print(&apos;quote     :&apos;, quote(url))
print(&apos;quote_plus:&apos;, quote_plus(url))

結果為：

urlencode : url=http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2F
quote     : http%3A//localhost%3A1080/%7Ehello%21/
quote_plus: http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2F

可以看到 urlencode 中應該是調用 quote_plus 來進行轉義的。

逆向操作則使用 unquote 或 unquote_plus 函數：

from urllib.parse import unquote, unquote_plus

encoded_url = &apos;http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2F&apos;
print(unquote(encoded_url))
print(unquote_plus(encoded_url))

結果為：

http://localhost:1080/~hello!/
http://localhost:1080/~hello!/

你會發現 unquote 函數居然能正確地將 quote_plus 的結果轉換回來。

本文轉載來自 Linux 中國: https://github.com/Linux-CN/archive

對這篇文章感覺如何？

太棒了

不錯

愛死了

不太好

感覺很糟

Rain

雨落清風。心向陽

使用 Python 的 urllib.parse 庫解析 URL

解析url

組建URL

相對路徑轉換絕對路徑

查詢參數的構造和解析

對這篇文章感覺如何？

如何使用 Ansible 創建 AWS ec2 密鑰

每個系統管理員都要知道的 30 個 Linux 系統監控工具

Leave a reply 取消回復

More in:Linux中國

捐贈 Let's Encrypt，共建安全的互聯網

Let's Encrypt 正式發布，已經保護 380 萬個域名

關於Linux防火牆iptables的面試問答

Lets Encrypt 已被所有主流瀏覽器所信任

中文操作系統論壇

關注 LinuxStory

開源學村

編程類開放書籍薈萃

如何殺死 Linux 中的殭屍進程

2022年，從學習Rust開始

使用 Linux 命令行解決Wordle 問題

5本學習 TeX 的最佳免費書籍

Makefile 簡介

Linux 內核補丁提交初體驗

Linux 的前世今生 – 1

特別關注

更開放的分散式事務 | Fescar 品牌升級，更名為 Seata

HeRM’s – 一個命令食譜管理器

使用 Let's Encrypt 保護你的網站

LinuxStory

加入 LinuxStory 交流群組

投票調查

最熱標籤

解析url

組建URL

相對路徑轉換絕對路徑

查詢參數的構造和解析

分享

對這篇文章感覺如何？

You may also like

Leave a reply 取消回復

More in:Linux中國

中文操作系統論壇

關注 LinuxStory

開源學村

特別關注

最新文章

最熱標籤