Parsing HTML — why does this document have to parsed by text and not by tags?

  beautifulsoup, html, python, python-3.x

I’m using a Python module that scrapes a site and noticed in the below code that it handles different tables differently:

def player_stats(request, stat, numeric=False, s_index=False):
    """
    """

    supported_tables = ["totals", "per_minute", "per_poss", "advanced",
                        "playoffs_per_game", "playoffs_totals", "playoffs_per_minute",
                        "playoffs_per_poss", "playoffs_advanced"]

    if stat == "per_game":
        soup = BeautifulSoup(request.text, "html.parser")
        table = soup.find("table", id="per_game")
    elif stat in supported_tables:
        soup = BeautifulSoup(request.text, "html.parser")
        comment_table = soup.find(text=lambda x: isinstance(x, NavigableString) and stat in x)
        soup = BeautifulSoup(comment_table, "html.parser")
        table = soup.find("table", id=stat)
    else:
        raise TableNonExistent

An example of the a page this would be used on: https://www.basketball-reference.com/players/j/jamesle01.html

If one were to do soup.find_all("table"), only the first table would be found. The above code seems to check for "comments" in the HTML and then applies BeautifulSoup to that again. I have a few questions:

  1. Why aren’t the other tables found? They are also HTML tags (not commented out) so I’m struggling to understand the difference.

  2. What is the comment_table line of code really doing? To me, it looks like it’s checking for text attributes that are NavigableStrings that contain an element in supported_tables?

  3. If I’m right about the above, how does BeautifulSoup simply parse that block of text? Is it "magic" or does that text have to be of a specific form…and we’re, therefore, lucky in this case?

Let me know if you need more information to answer the questions. Thanks!

Source: Python Questions

LEAVE A COMMENT