Programming

Alternative HTML parsing library to Beautifulsoup

oleh Ekky Armandi • 28 Agu 2024

![[images/selectolax-alternative-to-beautifulsoup/selectolax-0.3.21.png|selectolax-0.3.21]] Hi, there! I’m Ekky Armandi, a Python Developer with experience in web scraping and web development. I’m also a freelancer, you can check out my portfolio website here.

In this post, we’ll dive into Selectolax, a promising alternative to BeautifulSoup for HTML parsing in data scraping projects.

If you’re new to data scraping, imagine it as a digital miner, just like a miner extracts valuable resources from the earth, a data scraper extracts valuable information from the vast expanse of the internet. This technique, also known as web scraping, involves using software to extract data from websites. The collected data can be used for various purposes, such as training machine learning models, building databases, or creating knowledge bases for artificial intelligence agents.

Selectolax an alternative to BeautifulSoup library

When exploring data scraping, BeautifulSoup is a library that you’ll undoubtedly encounter. It has been a trusted companion for developers since 2004, simplifying the process of extracting data from websites.

HTML parsing essentially involves transforming raw HTML into a structured format that can be easily searched. BeautifulSoup offers a variety of methods for navigating and extracting data, including XPath, CSS selectors, regular expressions, and the find method.

image from BeautifulSoup documentation site

I believe there are other tools for parsing HTML, but this time I will talk about Selectolax compared to beautifulsoup.

Let’s see how the both BeautifulSoup and Selectolax code structure look like when it comes to parsing HTML.

If you want to try it make sure you install the dependencies below.

# BeautifulSoup
pip install bs4 request
# Selectolax
pip install selectolax request

BeautifulSoup code

from bs4 import BeautifulSoup
import requests
import time

url = "https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

start = time.time()

title = soup.select_one("h1").text
price = soup.select_one("p.price_color").text
image = soup.select_one("img")["src"]
book_info = {
    "title": title,
    "price": price,
    "image": image,
}
print(book_info)

end = time.time()

print(f"Time taken: {end-start} seconds")

Selectolax code

from selectolax.parser import HTMLParser
import requests
import time

url = "https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html"

response = requests.get(url)
html = HTMLParser(response.text)

start = time.time()

title = html.css_first("h1").text()
price = html.css_first("p.price_color").text()
image = html.css_first("img").attrs["src"]
book_info = {
    "title": title,
    "price": price,
    "image": image,
}
print(book_info)

end = time.time()

print(f"Time taken: {end-start} seconds")

Sample Benchmark

Extract title, links, scripts and a meta tag from main pages of top 754 domains.

PackageTime
Beautiful Soup (html.parser)61.02 sec.
lxml9.09 sec.
html5_parser16.10 sec.
selectolax (Modest)2.94 sec.
selectolax (Lexbor)2.39 sec.

source

Conclusion

In conclusion, Selectolax could be a viable alternative to BeautifulSoup for parsing HTML. However, the choice between the two will ultimately depend on the specific requirements of your project. If speed and efficiency are a priority, Selectolax might be worth considering.

Follow me on