Category : scrapy

I am using Scrapy and I want to extract each topic that has at least 4 posts. I have two separate selectors : real_url_list in order to get the href for each topic nbpostsintopic_resp to get the numbers of posts real_url_list = response.css("td.col-xs-8 a::attr(href)").getall() for topic in real_url_list: nbpostsintopic_resp = response.css("td.center ::text").get() nbpostsintopic = nbpostsintopic_resp[0] ..

Read more

I’m currently trying to webscrape specific data from this website but when I crawl using the cmd after this the json and csv files just end up being blank what am I doing wrong? import scrapy class RatesSpider(scrapy.Spider): name = ‘rates’ allowed_domains = [‘https://www.ratehub.ca/best-mortgage-rates/5-year/fixed’] start_urls = [‘http://https://www.ratehub.ca/best-mortgage-rates/5-year/fixed/’] def parse(self, response): for row in response.xpath(‘//*[@id="AllRatesTable_SpQFd"]//tr’): yield ..

Read more

I’m a Python NOOB trying to write Scrapy results to Firebase Firestore using Python 3. Spider results are logging correctly to the console, but I can’t seem to write to my Firestore DB. Any help is greatly appreciated. ERROR Message: db = firestore.client() AttributeError: module ‘google.cloud.firestore’ has no attribute ‘client’ Pipeline File: import firebase_admin from ..

Read more

I’m trying make ContextFactory, which change TLS and CIPHERS when server pass 403 error I though it would work. But not I pass the spider to the context and when it gets some bad responses the TLS settings change. Why doesn’t it work? @implementer(IPolicyForHTTPS) class FutusContext(ClientContextFactory): METHODS = { ‘SSLv2’: 1, ‘SSLv3’: 2, ‘SSLv23’: 3, ..

Read more

I have just started with Scrapy. I want to scrap the all job title from this page and save it in CSV file. But when I run the command: scrapy crawl jobscraper -o file.csv It made an empty file. What am I doing wrong? import scrapy class JobScraper(scrapy.Spider): name = "jobscraper" start_urls = [ ‘https://www.pracuj.pl/praca/it%20-%20rozw%c3%b3j%20oprogramowania;cc,5016/%c5%82%c3%b3dzkie;r,5?rd=0’, ..

Read more

I am trying to login to chewy.com using scrapy. Here is the code which I am using to create new account details from random import getrandbits import uuid def new_account(): email = ‘test+{}@gmail.com’.format(getrandbits(40)) username = email.split(‘@’)[0] password = uuid.uuid4().__str__() return {’email’:email,’username’:username,’password’:password} And then I tried to send formdata as follows: ac = new_account() formdata = ..

Read more

This spider is supposed to loop through https://lihkg.com/thread/`2169007 – i*10`/page/1. But for some reason it skips pages in the loop. I looked through the item scraped in Scrapy Cloud, the items with the following urls were scraped: … Item 10: https://lihkg.com/thread/2479941/page/1 Item 11: https://lihkg.com/thread/2479981/page/1 Item 12: https://lihkg.com/thread/2479971/page/1 Item 13: https://lihkg.com/thread/2479931/page/1 Item 14: https://lihkg.com/thread/2479751/page/1 Item 15: ..

Read more

# -*- coding: utf-8 -*- import scrapy from ..items import HomedepotItem import re import pandas as pd import requests import json from bs4 import BeautifulSoup class HomedepotSpider(scrapy.Spider): name = ‘homeDepot’ start_urls = [‘https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560’] def parse(self, response): for item in self.parseHomeDepot(response): yield item pass def parseHomeDepot(self, response): item = HomedepotItem() #items from items.py soup = BeautifulSoup(requests.get(response.json()).content, ..

Read more