Jump to content

An interesting new search engine idea


Cap'n Refsmmat

Recommended Posts

I've spotted a very interesting new concept for a search engine.

 

Simply put, it uses distributed computing rather than one centralized cluster of servers doing the crawling. This means that the system is getting ridiculous amounts of new URLs every day (because there are numerous crawlers at once, rather than a few big ones) and lots of new data. I think it's a rather nice idea.

 

Unfortunately, their alpha search engine component (the bit for actually searching the stuff the crawlers have gotten) is a bit lacking. A lot of various searches turn up Microsoft as the first result - no idea why - and other irrelevant things. They have multiple algorithms, however, so I think progress is being made there.

 

Thoughts? I think it could be much better at crawling as much internet content as possible.

 

link: http://www.majestic12.co.uk/

Link to comment
Share on other sites

Wow. Just to test, I searched "star wars" on both this and Google. This new site got me 1,345,560 results, google gets me "about 144,000."

 

Granted, I would never search through all the results, but that shows that there's likely a better chance of finding specific things you may be looking for under a subject.

Link to comment
Share on other sites

No, the individual with the most URLs for today has 4,869,963.

 

There are multiple ways because you can fiddle with your own algorithm and see if you can make it better than the default one. The default is rather pathetic for relevancy, although that's what the owner plans on improving now that the crawler works well.

Link to comment
Share on other sites

isn't that the future of computing? distrubted systems - what with that bbc climate program (which though rubbish at process management) seems to be hooking onto the idea that the world itself makes up a bigger computer than the worlds current most powerfull machine - btw what its the world's most powerful computer these days?

 

A few years ago i was told it was the mainframe being installed into the met office in the new Exeter head office, though obviously that won't be true anymore!

Link to comment
Share on other sites

Cap'n and I got different results because when I put "star wars" I included the quotes, I believe. I get 174 million or so on google when I don't.

 

Oddly enough, retesting that, Majestic-12 gives me 679,774 results for "star wars" this time...

Link to comment
Share on other sites

The problem with delegating the construction of a search engine index is verifying the authenticity of the data returned. I think such a system would become immensely vulnerable to spam... think of how many spammers already control botnets with tens or hundreds of thousands of infected machines. How could you possibly protect a distributed search engine index from spam attacks from these systems?

 

I predict a search in such a system would yield results for porn and online gambling sites for virtually every search term.

Link to comment
Share on other sites

The problem with delegating the construction of a search engine index is verifying the authenticity of the data returned. I think such a system would become immensely vulnerable to spam... think of how many spammers already control botnets with tens or hundreds of thousands of infected machines. How could you possibly protect a distributed search engine index from spam attacks from these systems?

 

I predict a search in such a system would yield results for porn and online gambling sites for virtually every search term.

The actual indexing is done on the server. All the client does is gather up URLs and their content. Only the server can decide what the content is, and what searches it will show up in.
Link to comment
Share on other sites

I honestly believe this is placing the wrong emphasis on where search engines need to go. The number of sites trawled isn't that much of an issue whether you have lots of standalone machines dotted about or a smaller number of central clusters. The real issue is how the data is transformed to information that is actually useable and google is still by far the best on that front, albeit far from perfect.

 

But regardless, searching needs to become a great deal more personalised and I have some designs in mind on how to achieve this over the next 10 years. Not a replacement for google and the like, just an additional method.

Link to comment
Share on other sites

I honestly believe this is placing the wrong emphasis on where search engines need to go. The number of sites trawled isn't that much of an issue whether you have lots of standalone machines dotted about or a smaller number of central clusters. The real issue is how the data is transformed to information that is actually useable and google is still by far the best on that front, albeit far from perfect.

If a search engine could get a huge index and relevant results, it would be a world-beater.

 

But regardless, searching needs to become a great deal more personalised and I have some designs in mind on how to achieve this over the next 10 years. Not a replacement for google and the like, just an additional method.

The problem with personalization is that sometimes people break out of the personalized "mold." If, for example, they always search for pet information, the engine might personalize to bring up more relevant results, but then their evil sibling gets on and tries to find instructions for nuclear weapons and only gets guinea pig feeding directions.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.