To much text?
Go right to SMIDER's experimental results.
Note that we had to shut down part of those results recently for reasons of high traffic!
 

A Paper!

Here is a paper which contains some stuff about Smider. Though it is mainly in "Living Documents".

SMIDER the SMart spIDER


What is it?

Buzzwords

Well, it is so important to use the right buzzwords nowadays, so here are the ones which apply to SMIDER:
  • Internet Data Mining: That is we search the internet for useful information.
  • Topology Analyzer: That is the evaluation SMIDER produces is based not only on the content, but (mostly in fact) on the links found. Also: Modelling the web as a graph, Hypertext Classifikation, Mining link structures.
  • Focused Crawler: That is the evaluation and crawling are combined, so that SMIDER can immediatly concentrate search arround pages which receive high scores. This is aimed to achieve a High retrieval rate, i.e. about 50% of the pages SMIDER looks at have at least s.th. to do with the searched theme. Also: Topic specific search.
  • Mining Communities: Since the search is guided by a certain goal (keywords) the found pages are thought to form a community of pages belongig to that particular theme. Also: Resource discovery.
  • IBM's Project CLEVER: is another approach which is fairly similar to SMIDER but much better funded. But it is different in many detail aspects and it currently seems to lack the keyword identification feature of SMIDER.
You can find links below on this page.

This is a keyword guided search engine. I.e. you give the engine a bunch of words, the engine now tries to sift out most relevant pages for those keywords from the internet.

Other authors would state that SMIDER ist not really a search engine in the usual sense but rather an automated resource compiler. You give it a theme and it looks, what is relevant.

The strategy of SMIDER is ment to mimic human behaviour on the net. (Only slightly dumb but therefore fast and cheap.) This method is also called focussed crawling.

The machine uses different sources to get information:
  1. Search engines (Google, Altavista and Lycos, unless they change their interface again, its kind of a hack)
  2. Found pages and the Links they contain.
  3. Pages which link to allready found ones (found by means of Google Relation and Altavista Link searches).
While the search seeds itself by some search engines, the further proceeding is guided by rather complex priorities. The engine simultaneously scores its results and proceeds exploring. So there is no particular order, and each time a threads needs a new url, a complex choice is made.

What results do you get?

The pages found are classified in three ways. Every page gets an

The first two scores are calculated by one algorithm and the citation score by a different one. Most famous usage of citation score is currently probably Google. Unlike the original algorithms in literature the Authority Scoring of SMIDER uses page content, but only to a very small amount.

There is also one more score internally used to priorize search. This score is derived from the other ones for normal web pages, but also from very different sources, e.g. for Search Engines searches.

Note: Authorities are supposed to contain good content, but a page can get a large authority-score by the algorithm although the content of the page has not even retrieved yet! But this is rarely seen, since high authorities get ranked high on the algorithms list of urls to spider next. Current experimental features are

How good is it?

Well, it depends. The code is still not perfectly and still contains a few minor bugs. On the other hand some searches I tried, actually resulted in interesting findings. Remember it is a machine which surfs for you, looks at a few hundret pages as fast as bandwidth allows and then points you to a few recommended pages. Remember also that this research has been conducted on a rather conventional internet connection, not to be compared with the super server farm real search engine or highly funded research projects have. You might want to see for yourself on the page of recent test runs. Note this page changes automagically and sometimes, while spidering a link to a test result might not even work or behave erratically...

Mixed observations:

  1. In some sense Smider is very, very cheap it runs on a standard pc with less than 64MB main memory and much less on disk (MB not GB!) with no special web connectivity and is able find a reasonable link lists in a quarter of an hour (And I'll get it faster when I get work arround some of the synchronization issues). Recently I've seen a paper, where they had the whole japanese language internet on a local disc storage. Well, ...
  2. commercial sites seem to have mechanisms to trap the search. Although I tried to include some methods to counter this, not enough yet. And it seems to get better with every version of Smider.
  3. If the domain to search is to wide, search sometimes looses focus. E.g. once I asked for fat reduction and ended up the general health domain. Remember the textual content of found pages is nearly ignored. So sometimes, the search is lead into completely different themed areas. In this case it often starts well but then suddenly shifts its focus. But maybe this is not a problem of the search but of the asked question?!
  4. Sometimes pages with very many links are evaluated good "hubs", it is not so hard to have 10 good links in a page with few hundred. And the human user afterwards again has the problem to find the good links from the whole bunch.
  5. In more research based narrow areas I'm fairly satisfied with some of the results.
  6. The presentation of the final results is more appropriate for debugging then to guide you into your area of interest.
  7. With the inclusion of automatically generated keywords, the generated result pages have finally achieved the ability to cheat Lycos into believing that those automated indices are highly valuable resources. Actually there are quite a few searches where Smider pages get the pole position from Lycos.

Miscellaneous details about the engine:

  • One of the fun points of this method is, that the textual content of the found pages is (nearly) ignored, link structure is the main guide.
  • The engine tries to be host friendly, i.e. it delays hitting the same host twice by a configurable miniumum gap. It does not respect robots.txt, since it should act like a human surfer rather than a engine :-)
  • The engine is configurable multithreaded.
  • It is written in Java, actually I once started with Perl, but I didn't manage to cope with the complexities of the algorithm in such a script like language, which is probably a sign of my lack of knowledge of Perl.

Applications

Probably there are many, a few which come to mind immediately:

Test Runs

Here is an automatic compiled list of testruns.

Future Plans, Goals and Ideas

Of course there are many paths left to improve this raw prototype.
  1. Include more search engines as starting points.
  2. Include user supplied URLs/Pages as starting points.
  3. It should be easily possible to focus the search on more recent pages, by adjusting certain scores with time dependant factors. Needs evaluation.
  4. Searching usenet newsgroups would be a way to find there latest information about certain stuff.
  5. The content of found pages should have more influence on the result. How? (Text analysis based on word, n-grams?)
  6. A stopping criterion for the search should be derived and evaluated, e.g. limit on interest of the most interesting page to be searched next...
  7. The "commercial site" problem which largely interconnected sites should be solved.
  8. A PDF engine should be included, to allow indexing PDF.
  9. Presentation of results should be improved. E.g. text snippeds from found pages would help, better layout of the result, the information in those score numbers should be humanized.
  10. The graphical representation (there is a easily overlooked applet link in the samples, way down) should be improved and could help visualize the result.
  11. Search should be interruptable, engine should have a low bandwidth background mode. Currently it runs preferably short time, high bandwidth. A background database would help.
  12. Maybe it is possible to find different notions for the search keys from the found pages.
  13. Search should be updateable, e.g. old results can be updated by later searches to maintain a recommendation instead of recalculate.
  14. A tighter inclusion with a indexing machine like the fabulous Lucene could help. Lucene could be able to provide more information to evaluate pages. We need language adapted keywords.
  15. A means of user interaction would be cool. E.g. while the machine recommends, the user surfs in parallel. If the user finds a certain page especially bad or poor, it can override the machines judgement about that page. Due to the interconnected evaluation of pages, a small change could cause very different pages being explored by the engine.
  16. Another form of user interaction would be the "living theme portal", i.e. a page which is based on a very slow running SMIDER (i.e. crawling only a few pages a hour) which allows visitors to rate the seen pages themselves and allows to feed back those "peer ratings" to go back into SMIDERs evaluation mechanism.
  17. I'd like to replace the HTML output by a XML based. XSLT could then render whatever result one needs.
  18. ...
  19. Some ideas can also be found in the comments on the Versions Page..

Seems if I had a sponsor I could make quite a few research projects from it.

Versions

There is a coarse
Version Log of Changes to Smider.

Literature?

Well, there are many ideas in the same research direction to be found on the net. But I was to lazy to compile them (not really in fact I've even read a few and believe to have understood a few). But here is what SMIDER compiled with an ad hoc theme to its own Technology:
Actually this was a good run and even if it does not list all pages I know, it contains so many links that you will easily find them.

Remember today the Computer is only an aid, it's still our duty to think ourselves. It such a nice world.


Back to Frank's Playground  | Versions of Smider  | Send email

Access count: 13717,  gener ated Wed Aug 24 15:38:09 CEST 2016; prev access=Wed Aug 24 03:32:14 CEST 2016; s ervlet up since Thu Jun 02 22:44:26 CEST 2016