【Python 網路爬蟲筆記】BeautifulSoup Library - part 3

感謝你點進本篇文章！！我是 LukeTseng，一個熱愛資訊的無名創作者，由於近期大學開設大數據分析程式設計這門課程，裡面談到了爬蟲概念，讓我激起一些興趣，因而製作本系列筆記。

聲明：本篇筆記僅供個人學習用途，斟酌參考。

另外本篇筆記使用 VSCode 環境進行編寫，部分模組（函式庫）需自行下載。

安裝 BeautifulSoup 模組

若使用 google colab 或 anaconda 環境者無須安裝。

指令：

1	pip install beautifulsoup4

引入 BeautifulSoup 模組

1	from bs4 import BeautifulSoup

為什麼我們要用 BeautifulSoup？

BeautifulSoup 的主要用途是解析 HTML 和 XML，將網頁內容轉換成結構化的樹狀格式供程式操作。

網頁資料解析與擷取是 BeautifulSoup 最主要的用途。

在網路爬蟲的世界，無可或缺的除了 request 模組外，就是 BeautifulSoup，有了這個模組就可以進一步擷取、分析我們想要的資訊。

例如可以擷取個人部落格所有文章的總瀏覽量，可以做到的方式就是透過 sitemap 一個一個進文章，去抓取每個文章的瀏覽量資訊，最後加總起來。

第一支 BeautifulSoup 程式

以我的部落格網站為例：https://luketsengtw.github.io/

import requests
from bs4 import BeautifulSoup

url = 'https://luketsengtw.github.io/'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.title)

Output：

1	<title>Yaoの程式小窩 - 只想好好學程式</title>

如果想要去掉 <title></title> 標籤的話，可以加上 .string 方法。

import requests
from bs4 import BeautifulSoup

url = 'https://luketsengtw.github.io/'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.title.string) # 加上 .string

Output：

1	Yaoの程式小窩 - 只想好好學程式

解析器（Parser）

解析器是 BeautifulSoup 第二個參數，用於將 html 原始碼轉換成標籤樹好讓程式去做一些操作。

Python 內建的網頁解析器是 html.parser，如果要使用其它解析器需要額外安裝。

常見的解析器就有 lxml 跟 html5lib。

要安裝它們的話可以輸入指令：pip install lxml html5lib

以下表格可以幫各位快速閱覽這些解析器的能力：

解析器	速度	準確性	容錯能力
html.parser	中	最差	最差
lxml	最快	高	高
html5lib	最慢	最高	最高

通常會使用 lxml 作為解析器，若在學習階段，不想裝這些有的沒的話，用後續用 html.parser 就可以了。

BeautifulSoup 常用方法

搜尋方法

find()、find_all() 是在 BeautifulSoup 中使用頻率最高的方法，因此先特別介紹這個。

基本上他的功能就是尋找這樣，find() 若有多個標籤存在的話，只會找第一個。

以下是一個範例：

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <div class="container">
      <h2>標題</h2>
      <p class="content">第一段內容</p>
      <p class="content">第二段內容</p>
      <a href="https://example.com">連結一</a>
      <a href="https://google.com">連結二</a>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# 找於 class_ = 'content' 中第一個段落 <p> 元素
first_p = soup.find('p', class_='content')
print(first_p.text)

# 找於 class_ = 'content' 中所有段落 <p> 的元素
all_p = soup.find_all('p', class_='content')
for p in all_p:
    print(p.text)

# 在 html 所有內容中找所有連結 <a>
links = soup.find_all('a', href=True)
for link in links:
    print(f"連結文字: {link.text}, 網址: {link['href']}")

Output：

第一段內容
第一段內容
第二段內容
連結文字: 連結一, 網址: https://example.com
連結文字: 連結二, 網址: https://google.com

CSS Selector

使用 select()。

以下是測試資料：

from bs4 import BeautifulSoup

html_content = """
<!DOCTYPE html>
<html>
<head>
    <title>CSS 選擇器練習範例</title>
</head>
<body>
    <header id="main-header" class="site-header">
        <h1 class="title">網站標題</h1>
        <nav class="navigation">
            <a href="/home" class="nav-link active">首頁</a>
            <a href="/about" class="nav-link">關於我們</a>
            <a href="/contact" class="nav-link">聯絡我們</a>
        </nav>
    </header>
    
    <main class="container">
        <article id="article-1" class="post featured">
            <h2 class="post-title">第一篇文章</h2>
            <p class="post-content">這是第一篇文章的內容。</p>
            <div class="meta">
                <span class="author" data-name="A">作者：A</span>
                <span class="date" data-published="2025-01-01">日期：2025-01-01</span>
            </div>
        </article>
        
        <article id="article-2" class="post">
            <h2 class="post-title">第二篇文章</h2>
            <p class="post-content highlight">這是第二篇文章的重點內容。</p>
            <div class="meta">
                <span class="author" data-name="B">作者：B</span>
                <span class="date" data-published="2025-01-02">日期：2025-01-02</span>
            </div>
        </article>
        
        <aside class="sidebar">
            <div class="widget recent-posts">
                <h3>最新文章</h3>
                <ul>
                    <li><a href="/post1">文章一</a></li>
                    <li><a href="/post2">文章二</a></li>
                    <li><a href="/post3">文章三</a></li>
                </ul>
            </div>
        </aside>
    </main>
    
    <footer id="main-footer" class="site-footer">
        <p>&copy; 2025 練習網站</p>
    </footer>
</body>
</html>
"""

soup = BeautifulSoup(html_content, 'lxml')

標籤選擇器：以下範例可選取 HTML 中所有的 h2 標籤及 p 段落標籤。

註：使用選擇器回傳的物件會是一個 list，所以以下的 for h2 in h2_tags: 會迭代 list 物件內的元素。

# 選取所有 h2 標籤
h2_tags = soup.select('h2')
print("所有 h2 標籤:")
for h2 in h2_tags:
    print(f"  {h2.text}")

# 選取所有段落
paragraphs = soup.select('p')
print(f"\n找到 {len(paragraphs)} 個段落")

Output：

所有 h2 標籤:
  第一篇文章
  第二篇文章

找到 3 個段落

ID 選擇器：<header id="main-header" class="site-header"> 做舉例，可以選擇 <header> 標籤裡面的 id。

要選取特定 ID，則使用一個 # 井字號作為前綴。

# 選取特定 ID 的元素
header = soup.select('#main-header')
print(f"\n標頭內容: {header[0].find('h1').text}")

# 選取特定文章 ID
article1 = soup.select('#article-1')
print(f"第一篇文章標題: {article1[0].find('h2').text}")

Output：

1 2	標頭內容: 網站標題第一篇文章標題: 第一篇文章

Class 選擇器：顧名思義，選擇 class 的值。

那 class 選擇器與 ID 選擇器不一樣，class 選擇器使用 . 一個半形點作為前綴。

註：以下程式碼的 .get('href') 是 BeautifulSoup 的方法，用於取得某個標籤的屬性值。

要取得屬性值也可直接寫 link['href']，與使用方法的差別在於這種方式比較不安全（會直接報錯），而 .get() 找不到的話會直接回傳 None。

# 選取所有有 post 類別的元素
posts = soup.select('.post')
print(f"\n找到 {len(posts)} 篇文章:")
for post in posts:
    title = post.find('h2').text
    print(f"  {title}")

# 選取導航連結
nav_links = soup.select('.nav-link')
print(f"\n找到 {len(nav_links)} 個導航連結:")
for link in nav_links:
    print(f"  {link.text} -> {link.get('href')}") # get 方法取得屬性值

Output：

找到 2 篇文章:
  第一篇文章
  第二篇文章

找到 3 個導航連結:
  首頁 -> /home
  關於我們 -> /about
  聯絡我們 -> /contact

多重選擇器：可以同時選擇多個選擇器。像是可以同時有兩個類別，請見範例：

# 標籤 + 類別選擇器
post_titles = soup.select('h2.post-title')
print("\n文章標題:")
for title in post_titles:
    print(f"  {title.text}")

# 多個類別選擇器（同時擁有兩個類別）
featured_posts = soup.select('.post.featured')
print(f"\n精選文章數量: {len(featured_posts)}")

Output：

文章標題:
  第一篇文章
  第二篇文章

精選文章數量: 1

後代選擇器：可以選擇一個標籤底下的其中一個標籤，假設要找 <article> 裡面的 <p>，那透過這個選擇器，就會找出 <article> 裡面的所有 <p> 標籤。

若要做到這個選擇器的功能，則在兩個標籤之間空一格。

以下是個範例：

# 選取 main 內的所有 span 標籤
meta_spans = soup.select('main span')
print("\n文章元資訊:")
for span in meta_spans:
    print(f"  {span.text}")

# 選取 article 內的 p 標籤
article_paragraphs = soup.select('article p')
print(f"\n文章段落數: {len(article_paragraphs)}")

Output：

文章元資訊:
  作者：A
  日期：2025-01-01
  作者：B
  日期：2025-01-02

文章段落數: 2

透過 BeautifulSoup 提取純文字

透過 .get_text() 可獲取標籤內的內容，而非標籤本身（如<p>123</p>）。

from bs4 import BeautifulSoup

html = """
<div class="article">
    <h1>文章標題</h1>
    <p>這是第一段落。</p>
    <p>這是<strong>重要</strong>的第二段落。</p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# 提取純文字（包含所有子元素的文字）
article = soup.find('div', class_='article')
print(article.get_text())

Output：

1
2
3

文章標題
這是第一段落。
這是重要的第二段落。

BeautifulSoup 結合 Requests 小應用

爬取網站：https://www.nptu.edu.tw/p/412-1000-2972.php?Lang=zh-tw

爬取國立屏東大學中的所有學術單位，含學院、以及旗下學系。

註：此爬蟲程式僅供學習用途，絕無任何其餘用途。

建議使用 colab 或 jupyter notebook 進行實作，若使用真實環境有可能會遇到 SSL Certificate 的問題。

另外可於該網站中任意處右鍵、按下檢查，開啟開發者工具介面，在左上方箭頭處可選取任意元素，選取完後會跳至該行的 HTML 原始碼。

範例程式碼：

import requests
from bs4 import BeautifulSoup

url = "https://www.nptu.edu.tw/p/412-1000-2972.php?Lang=zh-tw" # 目標網址
html = requests.get(url) # 對目標網址發送 GET 請求
html.encoding = "utf-8" # 設定正確編碼
soup = BeautifulSoup(html.text, "lxml") # 建立 BeautifulSoup 物件, 使用 lxml 解析器

main_content = soup.find("div", class_="main") # 主內容, 
if main_content:
    # 獲取所有文字內容並按行分割
    full_text = main_content.get_text()
    lines = [line.strip() for line in full_text.split('\n') if line.strip()]
    
    # 找到學術單位列表的開始位置
    start_idx = -1
    for i, line in enumerate(lines):
        if "學術單位列表" in line:
            start_idx = i
            break
    
    if start_idx != -1:
        relevant_lines = lines[start_idx+1:]  # 跳過"學術單位列表"這行
        
        current_college = None
        departments = []
        
        for line in relevant_lines:
            # 如果遇到包含"學院"的行，這是新的學院
            if "學院" in line and not line.endswith("系"):
                # 如果之前已經有學院資料，先印出來
                if current_college and departments:
                    print(f"{current_college}:")
                    for dept in departments:
                        print(f"  - {dept}")
                    print()  # 空行分隔
                
                # 開始新的學院
                current_college = line
                departments = []
            
            # 如果包含"學系"、"研究所"、"中心"或"學程"，這是學系
            elif any(keyword in line for keyword in ["學系", "研究所", "中心", "學程"]):
                if current_college:  # 確保目前有學院
                    departments.append(line)
            
            # 如果遇到校區資訊，結束處理
            elif "校區" in line:
                break
        
        # 處理最後一個學院
        if current_college and departments:
            print(f"{current_college}:")
            for dept in departments:
                print(f"  - {dept}")

Output：

管理學院:
  - 商業大數據學系(含碩士班)
  - 行銷與流通管理學系(含碩士班)
  - 休閒事業經營學系(含碩士班)
  - 不動產經營學系(含碩士班)
  - 企業管理學系(含碩士班)
  - 國際經營與貿易學系(含碩士班)
  - 財務金融學系(含碩士班)
  - 會計學系
  - 大數據商務應用學士學位學程(113學年停招)

資訊學院:
  - 電腦與通訊學系
  - 資訊工程學系(含碩士班)
  - 電腦科學與人工智慧學系(含碩士班)
  - 資訊管理學系(含碩士班)
  - 智慧機器人學系
  - 國際資訊科技與應用碩士學位學程

教育學院:
  - 教育行政研究所(含博碩士班)
  - 教育心理與輔導學系(含碩士班)
  - 教育學系(含碩士班)
  - 特殊教育學系(含碩士班)
  - 幼兒教育學系(含碩士班)
  - STEM教育國際碩士學位學程
  - 特殊教育中心
  - 社區諮商中心
  - 文教事業經營碩士在職學位學程(110學年停招)

人文社會學院:
  - 視覺藝術學系(含碩士班)
  - 音樂學系(含碩士班)
  - 文化創意產業學系(含碩士班)
  - 社會發展學系(含碩士班)
  - 中國語文學系(含碩士班)
  - 應用日語學系
  - 應用英語學系
  - 英語學系(含碩士班)
  - 文化發展學士學位學程原住民專班
  - 文化事業發展碩士學位學程原住民專班
  - 客家文化產業碩士學位學程
  - 客家研究中心
  - 原住民族教育研究中心
  - 藝文中心

理學院:
  - 科學傳播學系(含科學傳播暨教育碩士班)
  - 應用化學系(含碩士班)
  - 應用數學系(含碩士班)
  - 體育學系(含碩士班)

國際學院:
  - 東南亞發展中心
  - 華語教學中心

大武山學院:
  - 共同教育中心
  - 博雅教育中心
  - 跨領域學程中心
  - EMI發展中心
  - 大武山社會實踐暨永續發展中心
  - 新媒體創意應用碩士學位學程
  - 大武山跨領域學士學位學程
  - 師資培育中心
  - 師資培育中心
  - 教育學程組

總結

BeautifulSoup 是 Python 中用於解析 HTML 和 XML 的函式庫，將網頁內容轉換成樹狀結構供程式操作。主要應用於網路爬蟲和網頁資料擷取。

第一支 BeautifulSoup 範例

import requests
from bs4 import BeautifulSoup

url = 'https://luketsengtw.github.io/'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.title)

from bs4 import BeautifulSoup 以此來引入 BeautifulSoup 做網頁分析與資料擷取。

soup = BeautifulSoup(html.text, 'html.parser') 是建立 BeautifulSoup 的方法。

解析器介紹

解析器	速度	準確性	容錯能力
html.parser	中	最差	最差
lxml	最快	高	高
html5lib	最慢	最高	最高

BeautifulSoup 的常用方法

find() - 找第一個符合的元素
find_all() - 找所有符合的元素

以下程式碼分別找出第一個 <p> 和找出所有的 <p> 標籤。

1 2	first_p = soup.find('p', class_='content') all_p = soup.find_all('p', class_='content')

CSS 選擇器

使用 select() 方法：

一般標籤：soup.select('h2')
ID 選擇器（用 #）：soup.select('#main-header')
Class 選擇器（用 .）：soup.select('.nav-link')
多重選擇器：soup.select('h2.post-title')
後代選擇器（空格）：soup.select('main span')

文字提取

soup.title.string - 取得標籤內文字
element.get_text() - 取得純文字內容（去除標籤）
link['href'] 或 link.get('href') - 取得屬性值

基本爬蟲模板

import requests
from bs4 import BeautifulSoup

url = "目標網址"
html = requests.get(url)
html.encoding = "utf-8"  # 設定編碼
soup = BeautifulSoup(html.text, "lxml")

# 找到主要內容區域
main_content = soup.find("div", class_="main")
text = main_content.get_text()