in Education by
I'm attempting to build an email scraper that takes in a csv file of urls, and returns them with email addresses; including additional urls/addresses that get scraped in the process. I can't seem to get my spider to iterate through each row in the csv file, even through they're returned fine when I test the function I'm calling. Here's the code; which I adapted from here: import os, re, csv, scrapy, logging import pandas as pd from scrapy.crawler import CrawlerProcess from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor from googlesearch import search from time import sleep # Avoid getting too many logs and warnings when using Scrapy inside Jupyter Notebook. logging.getLogger('scrapy').propagate = False # Extract urls from file. def get_urls(): urls = pd.read_csv('food_urls.csv') url = list(urls) for i in url: return urls # Test it. # get_urls() # Create mail spider. class MailSpider(scrapy.Spider): name = 'email' def parse(self, response): # Search for links inside URLs. links = LxmlLinkExtractor(allow=()).extract_links(response) # Take in a list of URLs as input and read their source codes one by one. links = [str(link.url) for link in links] links.append(str(response.url)) # Send links from one parse method to another. for link in links: yield scrapy.Request(url=link, callback=self.parse_link) # Pass URLS to the parse_link method — this is the method we'll apply our regex findall to look for emails def parse_link(self, response): html_text = str(response.text) mail_list = re.findall('\w+@\w+\.{1}\w+', html_text) dic = {'email': mail_list, 'link': str(response.url)} df = pd.DataFrame(dic) df.to_csv(self.path, mode='a', header=False) df.to_csv(self.path, mode='a', header=False) # Save emails in a CSV file def ask_user(question): response = input(question + ' y/n' + '\n') if response == 'y': return True else: return False def create_file(path): response = False if os.path.exists(path): response = ask_user('File already exists, replace?') if response == False: return with open(path, 'wb') as file: file.close() # Combine everything def get_info(root_file, path): create_file(path) df = pd.DataFrame(columns=['email', 'link'], index=[0]) df.to_csv(path, mode='w', header=True) print('Collecting urls...') urls_list = get_urls() print('Searching for emails...') process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'}) process.crawl(MailSpider, start_urls=urls_list, path=path) process.start() print('Cleaning emails...') df = pd.read_csv(path, index_col=0) df.columns = ['email', 'link'] df = df.drop_duplicates(subset='email') df = df.reset_index(drop=True) df.to_csv(path, mode='w', header=True) return df At the end, when I call df = get_info('food_urls.csv', 'food_emails.csv'), the scraper takes quite a while to run. When it finished, I ran df.head() and got this: email link 0 NaN NaN 1 [email protected] https://therecipecritic.com/food-blogger/ 2 [email protected] https://therecipecritic.com/terms/ So, it's working, but it's only crawling the first url in the list. Does anyone know what I'm doing wrong? Thanks! JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

0 votes
by
Created a python dict with a nested list and imported it: from Base_URLS import URL_List Then I called it with: def get_urls(): urls = URL_List['urls'] return urls Worked like a charm! Thanks for the help @rodrigo-nader

Related questions

0 votes
    I'm Running Ubuntu 16.04 LTS with Python 3.6.8 and I have the following code that allows me to ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked May 8, 2022 in Education by JackTerrance
0 votes
    I am a bit puzzled by the following code: d = {'x': 1, 'y': 2, 'z': 3} for key in d: print key ... Python? Or is it simply a variable? Select the correct answer from above options...
asked Jan 24, 2022 in Education by JackTerrance
0 votes
    I want to get the search result using scrapy post request after giving the input to CP Number as 16308 ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Jun 1, 2022 in Education by JackTerrance
0 votes
    I store all of the Google Maps marker objects in a single array. Right now I am trying to set up ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Jul 30, 2022 in Education by JackTerrance
0 votes
    When I iterate over the values or keys are they going to correlate? Will the second key map to ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Mar 12, 2022 in Education by JackTerrance
0 votes
    I'm making a checklist for people to do tasks on our company website. The feature I'm working on ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Feb 27, 2022 in Education by JackTerrance
0 votes
    Which is better in terms of performance for iterating an array? (a) for(int i=0; i=0; i-) (c) ... Autoboxing & Miscellaneous of Java Select the correct answer from above options...
asked Feb 23, 2022 in Education by JackTerrance
0 votes
    I need to iterate through the elements in a numpy array so I can treat any zero elements separately. ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Feb 18, 2022 in Education by JackTerrance
0 votes
    I need to iterate through the elements in a numpy array so I can treat any zero elements separately. ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Feb 18, 2022 in Education by JackTerrance
0 votes
    Which is better in terms of performance for iterating an array? (a) for(int i=0; i=0; i-) ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Oct 24, 2021 in Education by JackTerrance
0 votes
    Can anyone help me, Im trying to make a simple app where you touch the screen and 4 images are ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Jul 20, 2022 in Education by JackTerrance
0 votes
    Can anyone help me, Im trying to make a simple app where you touch the screen and 4 images are ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Jul 3, 2022 in Education by JackTerrance
0 votes
    I followed this thread to override -preferredStatusBarStyle, but it isn't called. Are there any options that ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Jun 18, 2022 in Education by JackTerrance
0 votes
    I am working on a code pen and am having an issue where initially my main element would be exactly ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Apr 26, 2022 in Education by JackTerrance
0 votes
    This question already has answers here: jQuery in Greasemonkey 1.0 conflicts with websites using jQuery (3 ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Apr 23, 2022 in Education by JackTerrance
...