mastering BeautifulSoup
tl;dr: BeautifulSoup selectors and code snippets
Once you've become familiar with scraping websites with Python, requests, and BeautifulSoup (if not, read this first), you'll want to start creating reusable components to speed build time and improve data reliability.
Below I've included reference snippets for
- extracting data
- cleaning data
- picking a parser
- handling links
- handling tables
- general tips
- general functions
- gotchas
extracting data
Notation | Type | Comments |
---|---|---|
.attrs |
Macro | All attributes of a selected element (print this) |
.div |
Element | The div element inside the currently selected element |
.a |
Element | The a element. inside the currently selected element Note you'll have to use .get('href') to get the link associated |
.p |
Element | Get the p element (paragraph) inside the currently selected element |
.span |
Element | Get the span element inside the currently selected element |
.title |
Element | Get the title element inside the currently selected element |
.svg |
Element | Get the SVG inside the currently selected element |
.img |
Element | Get the image inside the currently selected element |
.get('src') |
Image Link | The link associated with an image |
.get('alt') |
Image Link | The alt text associated with an image |
.text |
Attribute | The text associated with an element (string) |
.string |
Attribute | Similar to .text, but supports some navigation (eg with .children) |
.strong |
Attribute | Get the bolded text, also link display text sometimes |
.get_text("\n", strip=True) |
Attribute | Get text that is broken up (e.g. by newlines) |
.get('href') |
Attributes | Get the link value (e.g. of a parent a tag) |
.nextSibling |
Navigation | Get the next item in the DOM tree (right below in the console) |
.contents |
Attribute | The elements inside the current element (in bytes) |
.contents[0] |
Attribute | The first element inside the current element |
.contents[0].contents[1] |
Navigation | You can navigate this way if the sitemap is consistent |
.a.strong.text |
Navigation | You can stack dotwise queries if sitemap consistent |
[0] / [1] |
Navigation | If your elements are a list (eg with a find_all), you can query through them with the index |
.extract |
Misc | If you need a tag and you want to throw away the parsed. Rare. (Explanation) |
cleaning data
Notation | Comments |
---|---|
.split("/", 1)[1].strip() |
Split by predictable substring e.g. "/" |
", ".join(set(data_as_list)) |
Deduplicate list and convert to string |
.strip().replace("\n", "").replace("\r", "") |
Remove whitespace and newlines |
' '.join(data_as_str.split()) |
Replace blocks of whitespace with single whitespace |
" ".join() -> .replace(' ', ', ') -> .split(',, ')[1] |
Split by spaces in a string (converting it to a list first) |
.rstrip(",") |
Delete trailing commas or other delimiters |
str(HTMLSnippet) |
Convert to a string, then parse with regex or substr in (generally inadvisable) |
True if parsed.find("div", {"class", "features"}) else False |
Convert tag presence to a boolean |
parser
the default is html.parser
from bs4 import BeautifulSoup
parsed = BeautifulSoup(response.content, "html.parser")
lxml
is faster, but you have to manage the dependency (which is 12 MB unzipped)
from bs4 import BeautifulSoup
import lxml
parsed = BeautifulSoup(response.content, "lxml")
(once you've created the parsed
object) if you want to iterate across all text elements as text (instead of HTML tags), you can use
text = parsed.get_text(separator=" ", strip=True)
handling links
to get all links on a page:
links = parsed.select('a[href]')
get the values from those links
link_text = [x.get_text() for x in parsed.find_all("a")]
link_href = [x.get("href") for x in parsed.select('a[href]')]
if you only want internal subsite links:
all_links = parsed.find_all('a', href=True)
internal_links = []
for n, link in enumerate(all_links):
if link.startswith("/"):
internal_links.append(f"https://example.com{link}")
elif "example.com" in link:
internal_links.append(link)
if you only want links to one external domain
internal_links = parsed.select('a[href^="http://othersite.com"]')
handling tables
iterate through each row in a table:
rows = [x for x in parsed.body.find_all('tr')]
get the text from every cell in a row of a table:
result_list = [td.text.strip() for row in rows for td in row.find_all('td')]
you may also run into tables created out of divs. to parse, find the parent tag, and iterate through child divs
table_parent = parsed.find("div", {'class': "parent-class"})
for table_cell_tag in table_parent.find_all("div", {"class": "cell-class"})
print(table_cell_tag.get_text())
general tips
to get the next selector at the same level of the tree:
find_next_sibling("div")
you can turn recursive off in a find_all:
all_divs_at_that_level = parsed.find_all("div", recursive=False)
you can also specify only tags with some attribute are included, for example all images with alt text (SO):
data_as_str = (", ".join([img['alt'] for img in parsed.find_all('img', alt=True)]))
to get a selector with spaces in the name, you can use select with periods instead of spaces (Docs)
element = parsed.select('div.container-lg.clearfix.px-3.mt-4')
alternately, you can take the most specific substring
element = parsed.find("div", {"class" : "clearfix"})
you can use regex on tag elements, but it will be hard to troubleshoot
import re
element = parsed.find("div", {"id":re.compile('foo|bar')})
get elements that have a non-standard attribute present:
currency_option_tags = parsed.select('option[data-currency]')
functions
if you want to find elements that have consistent text (you can use this to chain .nextSibling) (SO)
contacts = parsed.find(lambda elm: elm.name == "h2" and "Contact" in elm.text)
find and concatenate all matching selectors (even if there are none present):
def flatten_enclosed_elements(enclosing_element, selector_type, **kwargs):
if not enclosing_element:
logging.warning('no enclosing element for flatten_enclosed_elements')
return None
text_list = []
for ele in enclosing_element.find_all(selector_type):
if ele and ele.get_text():
text_list.append(ele.get_text().strip().replace("\n", "").replace("\r", ""))
return ", ".join(text_list) if text_list and kwargs.get("output_str") else text_list
find and concatenate all neighboring sibling selectors (even if there are none present):
def flatten_neigboring_selectors(enclosing_element, selector_type):
if not enclosing_element:
logging.warning('no enclosing element for flatten_neighboring_selectors')
return None
text_list = []
for ele in enclosing_element.find_all(selector_type):
next_s = ele.nextSibling
if not (next_s and isinstance(next_s, NavigableString)):
continue
elif next_s and str(next_s):
textlist.append(next_s.get_text().strip().replace("\n", "").replace("\r", ""))
return text_list
find all occurrences of a list of strings (SO)
import re
def get_tags_with_name_in_list(parsed, tag_type, name_strings_list):
re_pattern = re.compile("^" + "|".join(name_strings_list) + "$", re.I)
found_tags = parsed.find_all(tag_type, name=re_pattern, re.I))
found_strings = [x.get_text() for x in found_tags]
detect the site's language if text is found:
from langdetect import detect
def detect_language(text):
try:
return detect(text)
except:
pass
general tips
NavigableStrings are just more annoying strings. You may get them sometimes when using .nextSibling. You can convert to regular string with
text_str = str(maybe_ns) if isinstance(maybe_ns, NavigableString) else maybe_ns
.get_text()
, .getText()
, and .text
are the same thing
.get_text()
returns the text of a given tag and all child tags. If you just want a given tag's text, use .string
The contents of <script>
, <style>
, and <template>
tags are not considered to be ‘text’, since they are not human visible. Use .string
instead of the above .text
methods (According to the docs)