mastering BeautifulSoup

Thu 31 October 2019

tagged software, python, scraping, reference

tl;dr: BeautifulSoup selectors and code snippets

Once you've become familiar with scraping websites with Python, requests, and BeautifulSoup (if not, read this first), you'll want to start creating reusable components to speed build time and improve data reliability.

Below I've included reference snippets for

extracting data
cleaning data
picking a parser
handling links
handling tables
general tips
general functions
gotchas

extracting data

Notation	Type	Comments
`.attrs`	Macro	All attributes of a selected element (print this)
`.div`	Element	The div element inside the currently selected element
`.a`	Element	The a element. inside the currently selected element Note you'll have to use .get('href') to get the link associated
`.p`	Element	Get the p element (paragraph) inside the currently selected element
`.span`	Element	Get the span element inside the currently selected element
`.title`	Element	Get the title element inside the currently selected element
`.svg`	Element	Get the SVG inside the currently selected element
`.img`	Element	Get the image inside the currently selected element
`.get('src')`	Image Link	The link associated with an image
`.get('alt')`	Image Link	The alt text associated with an image
`.text`	Attribute	The text associated with an element (string)
`.string`	Attribute	Similar to .text, but supports some navigation (eg with .children)
`.strong`	Attribute	Get the bolded text, also link display text sometimes
`.get_text("\n", strip=True)`	Attribute	Get text that is broken up (e.g. by newlines)
`.get('href')`	Attributes	Get the link value (e.g. of a parent `a` tag)
`.nextSibling`	Navigation	Get the next item in the DOM tree (right below in the console)
`.contents`	Attribute	The elements inside the current element (in bytes)
`.contents[0]`	Attribute	The first element inside the current element
`.contents[0].contents[1]`	Navigation	You can navigate this way if the sitemap is consistent
`.a.strong.text`	Navigation	You can stack dotwise queries if sitemap consistent
`[0] / [1]`	Navigation	If your elements are a list (eg with a find_all), you can query through them with the index
`.extract`	Misc	If you need a tag and you want to throw away the parsed. Rare. (Explanation)

cleaning data

Notation	Comments
`.split("/", 1)[1].strip()`	Split by predictable substring e.g. "/"
`", ".join(set(data_as_list))`	Deduplicate list and convert to string
`.strip().replace("\n", "").replace("\r", "")`	Remove whitespace and newlines
`' '.join(data_as_str.split())`	Replace blocks of whitespace with single whitespace
`" ".join() -> .replace(' ', ', ') -> .split(',, ')[1]`	Split by spaces in a string (converting it to a list first)
`.rstrip(",")`	Delete trailing commas or other delimiters
`str(HTMLSnippet)`	Convert to a string, then parse with regex or `substr in` (generally inadvisable)
`True if parsed.find("div", {"class", "features"}) else False`	Convert tag presence to a boolean

parser

the default is html.parser

from bs4 import BeautifulSoup
parsed = BeautifulSoup(response.content, "html.parser")

lxml is faster, but you have to manage the dependency (which is 12 MB unzipped)

from bs4 import BeautifulSoup
import lxml
parsed = BeautifulSoup(response.content, "lxml")

(once you've created the parsed object) if you want to iterate across all text elements as text (instead of HTML tags), you can use

text = parsed.get_text(separator=" ", strip=True)

handling links

to get all links on a page:

links = parsed.select('a[href]')

get the values from those links

link_text = [x.get_text() for x in parsed.find_all("a")]
link_href = [x.get("href") for x in parsed.select('a[href]')]

if you only want internal subsite links:

all_links = parsed.find_all('a', href=True)
internal_links = []
for n, link in enumerate(all_links):
    if link.startswith("/"):
        internal_links.append(f"https://example.com{link}")
    elif "example.com" in link:
        internal_links.append(link)

if you only want links to one external domain

internal_links = parsed.select('a[href^="http://othersite.com"]')

handling tables

iterate through each row in a table:

rows = [x for x in parsed.body.find_all('tr')]

get the text from every cell in a row of a table:

result_list = [td.text.strip() for row in rows for td in row.find_all('td')]

you may also run into tables created out of divs. to parse, find the parent tag, and iterate through child divs

table_parent = parsed.find("div", {'class': "parent-class"})
for table_cell_tag in table_parent.find_all("div", {"class": "cell-class"})
    print(table_cell_tag.get_text())

general tips

to get the next selector at the same level of the tree:

find_next_sibling("div")

you can turn recursive off in a find_all:

all_divs_at_that_level = parsed.find_all("div", recursive=False)

you can also specify only tags with some attribute are included, for example all images with alt text (SO):

data_as_str = (", ".join([img['alt'] for img in parsed.find_all('img', alt=True)]))

to get a selector with spaces in the name, you can use select with periods instead of spaces (Docs)

element = parsed.select('div.container-lg.clearfix.px-3.mt-4')

alternately, you can take the most specific substring

element = parsed.find("div", {"class" : "clearfix"})

you can use regex on tag elements, but it will be hard to troubleshoot

import re
element = parsed.find("div", {"id":re.compile('foo|bar')})

get elements that have a non-standard attribute present:

currency_option_tags = parsed.select('option[data-currency]')

functions

if you want to find elements that have consistent text (you can use this to chain .nextSibling) (SO)

contacts = parsed.find(lambda elm: elm.name == "h2" and "Contact" in elm.text)

find and concatenate all matching selectors (even if there are none present):

def flatten_enclosed_elements(enclosing_element, selector_type, **kwargs):
    if not enclosing_element:
        logging.warning('no enclosing element for flatten_enclosed_elements')
        return None

    text_list = []
    for ele in enclosing_element.find_all(selector_type):
        if ele and ele.get_text():
            text_list.append(ele.get_text().strip().replace("\n", "").replace("\r", ""))

    return ", ".join(text_list) if text_list and kwargs.get("output_str") else text_list

find and concatenate all neighboring sibling selectors (even if there are none present):

def flatten_neigboring_selectors(enclosing_element, selector_type):
    if not enclosing_element:
        logging.warning('no enclosing element for flatten_neighboring_selectors')
        return None

    text_list = []
    for ele in enclosing_element.find_all(selector_type):
        next_s = ele.nextSibling
        if not (next_s and isinstance(next_s, NavigableString)):
            continue
        elif next_s and str(next_s):
            textlist.append(next_s.get_text().strip().replace("\n", "").replace("\r", ""))
    return text_list

find all occurrences of a list of strings (SO)

import re
def get_tags_with_name_in_list(parsed, tag_type, name_strings_list):
    re_pattern = re.compile("^" + "|".join(name_strings_list) + "$", re.I)
    found_tags = parsed.find_all(tag_type, name=re_pattern, re.I))
    found_strings = [x.get_text() for x in found_tags]

detect the site's language if text is found:

from langdetect import detect
def detect_language(text):
  try:
    return detect(text)
  except:
    pass

general tips

NavigableStrings are just more annoying strings. You may get them sometimes when using .nextSibling. You can convert to regular string with

text_str = str(maybe_ns) if isinstance(maybe_ns, NavigableString) else maybe_ns

.get_text(), .getText(), and .text are the same thing

.get_text() returns the text of a given tag and all child tags. If you just want a given tag's text, use .string

The contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since they are not human visible. Use .string instead of the above .text methods (According to the docs)