Jump to content

Web Crawler on all IPS


fredreload

Recommended Posts

So I've played around with web crawlers before and Selenium. I am currently looking for a way to download web pages to collect texts. I am using python and is thinking of scraping through IPS. Which I believe ranges from 0.0.0.0 to 999.999.999.999? I think most people have port 80 open?

So I would have a loop like 0.0.0.0 to 999.999.999.999 and just use urllib to download the webpages. Let me know if this is correct or if there is a better way, thanks

Link to comment
Share on other sites

Alright I've managed to use urllib2 with Python to traverse from 0.0.0.0 to 256.256.256.256, but this does not go to the sub web pages ie. 0.0.0.0/subfolder/. So I would like to know if there's a tree traversal into all the contents listed by a IP. Or does urllib2 already does that?

Link to comment
Share on other sites

9 hours ago, fredreload said:

Which I believe ranges from 0.0.0.0 to 999.999.999.999?

OMG, your computer knowledge is near zero...

If each of them is unsigned byte what will be range.. ? 0 ... 2^8-1 = 0....255

0.0.0.0 .... 255.255.255.255

(NOT 256.256.256.256 !)

How about starting from reading Wikipedia pages about IP v4 addresses for example.. ?

You will learn which IP addresses must be skipped because they have special meaning..

 

Scanning the all IP v4 from the start is silly idea. It's ~4.3 billions IPs. Visiting one per second would take 136 years..

The majority of them contain no servers nor computers, so you will be just wasting time.

One IP can be hundred or thousands of computers. Connecting to port 80,443,8080 won't give you much. Virtual servers are configured typically that they REQUIRE host name to reveal their content.. (Did you ever configure Virtual Server in Apache? https://httpd.apache.org/docs/current/vhosts/examples.html )

You should start from collecting host names.. That's why web crawlers are analyzing web pages, to find A HREF HTML tags..

 

Google made special technique for web admins for revealing what are pages that should be visited or not visited by their bot. But it can be used to examine what are pages hosted on server, if you will pretend Google bot.. If you ever set up website with optimization for Google, using their panel, it should be there in instructions how to optimize and how to fight with 404 errors..

Edited by Sensei
Link to comment
Share on other sites

Ya well, you guys are right. I just need a text dump of science articles and they need to be repeating. For instance I expect 100 articles talking about the same feature for lizard. I tried the Wikipedia dump file, but it is non repeating.

So if any of you know of a huge text dump of science articles let me know. Else I'll have to scrape the ips

Link to comment
Share on other sites

On 18/12/2017 at 10:53 AM, fredreload said:

I am currently looking for a way to download web pages to collect texts.

There are large text corpora available with you trying to write your own web crawler. For example: https://corpus.byu.edu or https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/index2.html

 

 

Link to comment
Share on other sites

8 hours ago, Sensei said:

There was no AI in your idea since the beginning..

 

Ya well, I thought n-gram is a good way to be applied to an AI, based on the discussion it turns out not to be. So I am going with thought bubbles and neural network now

Edited by fredreload
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.