Category : parsing

i have a big file (2.5 GB, 50 million of lines) and from this file i need to extract different type of data. Lines are composed like these: ‘timestamp,id;i1,quantity,***,***;i2,quantity2,***,***n’, ‘o1,quantity1′ Every line could have different length (timestamp and id are always one, but i don’t know how many i or o are in every line) ..

Read more

I’m using a customization of the HTMLParser in order to remove HMTL tags and other things. The problem came out with this string Q&A from html.parser import HTMLParser from io import StringIO class MyParser(HTMLParser): def __init__(self): super().__init__() self.reset() self.strict = False self.convert_charrefs= True self.text = StringIO() def handle_data(self, d): self.text.write(d) def get_data(self): return self.text.getvalue() def ..

Read more

I’m trying to parse JSON of the site. Link to the page: https://street-beat.ru/cat/man/obuv/krossovki/nike/air-force/winters/?sort=create&order=desc& Found it in the HTML code of the page (in dev tools very first request): window.__INITIAL_STATE__ = JSON.parse(…). But when I try to execute the same request in code, I get a different respose and consequently window.__INITIAL_STATE__ = JSON.parse(…) is not there. ..

Read more

For some reason my code broke for this website https://www.scmagazine.com/ransomware where it parses with Code 200 and it does have info but when I try to parse it nothing gets returned. Please see my code below: import requests import lxml.html ransomware = "https://www.scmagazine.com/ransomware" page = requests.get(ransomware, headers={‘User-Agent’: ‘Mozilla/5.0’}) doc = lxml.html.fromstring(page.text) rm_title = doc.xpath(‘//h5[@class="ContentTeaser_title__3Gv3Q"]/text()’)[:3] rm_descrip ..

Read more

I am trying to parse this website: https://resources.register-iri.com/CorpEntity/Corporate/Search I tried fully imitate all get and post requests with headers and queries. Parsed viewState value to pass it throught the post request. Prepared Referrer value for the header. But still receiving error: The UIViewRoot is null. Fatal exception during PhaseId: RESTORE_VIEW 1. Could you please help ..

Read more

The code I used is from pyparsing import Word, hexnums, WordEnd, Optional, alphas, alphanums hex_integer = Word(hexnums) + WordEnd() # use WordEnd to avoid parsing leading a-f of non-hex numbers as a hex line = ".text:" + hex_integer + Optional((hex_integer*(1,))("instructions") + Word(alphas,alphanums)("opcode")) for source_line in source: result = line.parseString(source_line) if "opcode" in result: print(result.opcode, result.instructions.asList()) ..

Read more

I am trying to perfom parsing but when I send POST method to get searching results, getting page with error: The requested URL was rejected. Please consult with your administrator. Your support ID is: 6397656432730259025 Website: https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx I’ve collected data like viewstate, viewstategenerator etc.. to pass throught form but doesn’t work. What am I missing? ..

Read more