Scrapy spider isn't iterating over list of start-urls

Question

Scrapy spider isn't iterating over list of start-urls

asked Apr 14, 2022 in Education by JackTerrance

I'm attempting to build an email scraper that takes in a csv file of urls, and returns them with email addresses; including additional urls/addresses that get scraped in the process. I can't seem to get my spider to iterate through each row in the csv file, even through they're returned fine when I test the function I'm calling. Here's the code; which I adapted from here: import os, re, csv, scrapy, logging import pandas as pd from scrapy.crawler import CrawlerProcess from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor from googlesearch import search from time import sleep # Avoid getting too many logs and warnings when using Scrapy inside Jupyter Notebook. logging.getLogger('scrapy').propagate = False # Extract urls from file. def get_urls(): urls = pd.read_csv('food_urls.csv') url = list(urls) for i in url: return urls # Test it. # get_urls() # Create mail spider. class MailSpider(scrapy.Spider): name = 'email' def parse(self, response): # Search for links inside URLs. links = LxmlLinkExtractor(allow=()).extract_links(response) # Take in a list of URLs as input and read their source codes one by one. links = [str(link.url) for link in links] links.append(str(response.url)) # Send links from one parse method to another. for link in links: yield scrapy.Request(url=link, callback=self.parse_link) # Pass URLS to the parse_link method — this is the method we'll apply our regex findall to look for emails def parse_link(self, response): html_text = str(response.text) mail_list = re.findall('\w+@\w+\.{1}\w+', html_text) dic = {'email': mail_list, 'link': str(response.url)} df = pd.DataFrame(dic) df.to_csv(self.path, mode='a', header=False) df.to_csv(self.path, mode='a', header=False) # Save emails in a CSV file def ask_user(question): response = input(question + ' y/n' + '\n') if response == 'y': return True else: return False def create_file(path): response = False if os.path.exists(path): response = ask_user('File already exists, replace?') if response == False: return with open(path, 'wb') as file: file.close() # Combine everything def get_info(root_file, path): create_file(path) df = pd.DataFrame(columns=['email', 'link'], index=[0]) df.to_csv(path, mode='w', header=True) print('Collecting urls...') urls_list = get_urls() print('Searching for emails...') process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'}) process.crawl(MailSpider, start_urls=urls_list, path=path) process.start() print('Cleaning emails...') df = pd.read_csv(path, index_col=0) df.columns = ['email', 'link'] df = df.drop_duplicates(subset='email') df = df.reset_index(drop=True) df.to_csv(path, mode='w', header=True) return df At the end, when I call df = get_info('food_urls.csv', 'food_emails.csv'), the scraper takes quite a while to run. When it finished, I ran df.head() and got this: email link 0 NaN NaN 1 [email protected] https://therecipecritic.com/food-blogger/ 2 [email protected] https://therecipecritic.com/terms/ So, it's working, but it's only crawling the first url in the list. Does anyone know what I'm doing wrong? Thanks! JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

Related questions

0 votes

Q: How do I delete a row from a file while iterating over rows in a file?

I'm Running Ubuntu 16.04 LTS with Python 3.6.8 and I have the following code that allows me to ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked May 8, 2022 in Education by JackTerrance

0 votes

Q: Iterating over dictionaries using 'for' loops

I am a bit puzzled by the following code: d = {'x': 1, 'y': 2, 'z': 3} for key in d: print key ... Python? Or is it simply a variable? Select the correct answer from above options...

asked Jan 24, 2022 in Education by JackTerrance

0 votes

Q: Scrapy post request form data

I want to get the search result using scrapy post request after giving the input to CP Number as 16308 ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Jun 1, 2022 in Education by JackTerrance

0 votes

Q: Iterating through a multidimensional Javascript array filled with Google Maps markers objects

I store all of the Google Maps marker objects in a single array. Right now I am trying to set up ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Jul 30, 2022 in Education by JackTerrance

0 votes

Q: When using a HashMap are values and keys guaranteed to be in the same order when iterating?

When I iterate over the values or keys are they going to correlate? Will the second key map to ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 12, 2022 in Education by JackTerrance

0 votes

Q: Iterating through a SPListItem/SPList

I'm making a checklist for people to do tasks on our company website. The feature I'm working on ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Feb 27, 2022 in Education by JackTerrance

0 votes

Q: Which is better in terms of performance for iterating an array?

Which is better in terms of performance for iterating an array? (a) for(int i=0; i=0; i-) (c) ... Autoboxing & Miscellaneous of Java Select the correct answer from above options...

asked Feb 23, 2022 in Education by JackTerrance

0 votes

Q: Iterating through python numpy array when used with scipy.optimize.curve_fit

I need to iterate through the elements in a numpy array so I can treat any zero elements separately. ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Feb 18, 2022 in Education by JackTerrance

0 votes

Q: Iterating through python numpy array when used with scipy.optimize.curve_fit

I need to iterate through the elements in a numpy array so I can treat any zero elements separately. ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Feb 18, 2022 in Education by JackTerrance

0 votes

Q: Which is better in terms of performance for iterating an array?

Which is better in terms of performance for iterating an array? (a) for(int i=0; i=0; i-) ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Oct 24, 2021 in Education by JackTerrance

0 votes

Q: Why isn't my images popping up when I touch the screen?

Can anyone help me, Im trying to make a simple app where you touch the screen and 4 images are ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Jul 20, 2022 in Education by JackTerrance

0 votes

Q: Why isn't my images popping up when I touch the screen?

Can anyone help me, Im trying to make a simple app where you touch the screen and 4 images are ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Jul 3, 2022 in Education by JackTerrance

0 votes

Q: preferredStatusBarStyle isn't called

I followed this thread to override -preferredStatusBarStyle, but it isn't called. Are there any options that ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Jun 18, 2022 in Education by JackTerrance

0 votes

Q: Why isn't my Main element to the right of my nav bar when on a computer full screen?

I am working on a code pen and am having an issue where initially my main element would be exactly ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Apr 26, 2022 in Education by JackTerrance

0 votes

Q: Why isn't my click handler being applied in a Greasemonkey script? [duplicate]

This question already has answers here: jQuery in Greasemonkey 1.0 conflicts with websites using jQuery (3 ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Apr 23, 2022 in Education by JackTerrance

JackTerrance · Answer 1 · 2022-04-14T22:00:03+0000

Created a python dict with a nested list and imported it: from Base_URLS import URL_List Then I called it with: def get_urls(): urls = URL_List['urls'] return urls Worked like a charm! Thanks for the help @rodrigo-nader