Optimalizing code for extracting information from many html files

I am trying to extract specific information from a directory of html files that I earlier extracted using the request library of python. Extraction of the htmls was already slow since I built in a random wait timer but now that I want to itterate over each retrieved html file, it seems like my script is not very well optimalized. This is a problem since I want to itterate over 42000 html files, each with > 8000 lines. This would take a lot of time probably.

Since I have never run into these problems that were this demanding for my computer, I do not know where to start learning to optimalize my code. My question to you, should I approach this problem differently, possibly in a more time efficient way?? Your suggestions would be very much appreciated.

Here is the code I am using, I changed some sensitive information:

#empty lists of features of houses
link = []
name_list = []
agent_list = []
description_list = []
features_list = []

#here link_list is a list that I previously retrieved and holds all the links to the original html files extracted in a previous step.
for i in range(1,len(link_list)):
    html = open_html('C:UsersDocumentsfile_p{}.html'.format(i))
    soup = BeautifulSoup(html, 'html.parser')
    link.append(link_list[i])
    name_list.append(soup.select_one('.object-header__title').text)
    agent_list.append(soup.select_one('.object-contact-agent-link').text)
    description_list.append(soup.select_one('.object-description-body').text)
    features_list.append(soup.select_one('.object-features').text)
    

d = {'Link_P': link, 'Name_P': name_list, 'Features_P': features_list, 'Description_P': description_list, 'Agent_P': agent_list}
df = pd.DataFrame(data=d)
df

Source: Python Questions

LEAVE A COMMENT