in Education by
I have a Nutch index crawled from a specific domain and I am using the solrindex command to push the crawled data to my Solr index. The problem is that it seems that only some of the crawled URLs are actually being indexed in Solr. I had the Nutch crawl output to a text file so I can see the URLs that it crawled, but when I search for some of the crawled URLs in Solr I get no results. Command I am using to do the Nutch crawl: bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000 This command is completing successfully and the output displays URLs that I cannot find in the resulting Solr index. Command I am using to push the crawled data to Solr: bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* The output for this command says it is also completing successfully, so it does not seem to be an issue with the process terminating prematurely (which is what I initially thought it might be). One final thing that I am finding strange is that the entire Nutch & Solr config is identical to a setup I used previously on a different server and I had no problems that time. It is literally the same config files copied onto this new server. TL;DR: I have a set of URLs successfully crawled in Nutch, but when I run the solrindex command only some of them are pushed to Solr. Please help. UPDATE: I've re-run all these commands and the output still insists it's all working fine. I've looked into any blockers for indexing that I can think of, but still no luck. The URLs being passed to Solr are all active and publicly accessible, so that's not an issue. I'm really banging my head against a wall here so would love some help. JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

0 votes
by
I can only guess what happend from my experiences: There is a component called url-normalizer (with its configuration url-normalizer.xml) which is truncating some urls (removing URL parameters, SessionIds, ...) Additionally, Nutch uses a unique constraint, by default each url is only saved once. So, if the normalizer truncates 2 or more URLs ('foo.jsp?param=value', 'foo.jsp?param=value2', 'foo.jsp?param=value3', ...) to the exactly same one ('foo.jsp'), they get only saved once. So Solr will only see a subset of all your crawled URLs. cheers

Related questions

0 votes
    I have an issue while building my Solr index (Lucene & Solr 3.4.0 on an Apache Tomcat 6.0.33) ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Feb 15, 2022 in Education by JackTerrance
0 votes
    The strategy can retrieve a single record if the equality condition is on a key; multiple records may ... Operation in section Query Processing Techniques of Database Management...
asked Oct 10, 2021 in Education by JackTerrance
0 votes
    I am trying to make search on my database using Solr, and i need to build a facet for the date ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Feb 27, 2022 in Education by JackTerrance
0 votes
    Can you recommend a faceted query browser that I can point at a SOLR index? Ideally this would be ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Feb 11, 2022 in Education by JackTerrance
0 votes
    The command git add [--all|-A] appears to be identical to git add .. Is this correct? If not, how do they differ?...
asked Jan 8, 2021 in Technology by JackTerrance
0 votes
    There is no summary available of the big O notation for operations on the most common data structures ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Mar 22, 2022 in Education by JackTerrance
0 votes
    There is no summary available of the big O notation for operations on the most common data structures ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Mar 21, 2022 in Education by JackTerrance
0 votes
    Which of the following indexing capabilities is used as a concise means of selecting data from a pandas object? ... and answers pdf, Data Science interview questions for beginners...
asked Oct 31, 2021 in Education by JackTerrance
0 votes
    What is the use of Staging area or Indexing in Git?git...
asked Nov 4, 2020 in Technology by JackTerrance
0 votes
    What is the use of Staging area or Indexing in Git?...
asked Nov 4, 2020 in Technology by JackTerrance
0 votes
    What is Splunk Indexer? What are the stages of Splunk Indexing?...
asked Oct 31, 2020 in Technology by JackTerrance
0 votes
    My Cloud Storage signed download URLs fail after three days. I think I've fixed the problem, so this ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Jun 2, 2022 in Education by JackTerrance
0 votes
    I am using BrowserRouter as Router from react-router-dom. I want to host my application at a path ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Apr 14, 2022 in Education by JackTerrance
0 votes
    I'm attempting to build an email scraper that takes in a csv file of urls, and returns them with ... JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Apr 14, 2022 in Education by JackTerrance
0 votes
    Following code scrapes comments and customer country from each product page for example this product from aliexpress ... for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Apr 7, 2022 in Education by JackTerrance
...