A Paper!
Here
is a paper which contains some stuff
about Smider. Though it is mainly in "Living Documents".
|
SMIDER the SMart spIDER
What is it?
Buzzwords
Well, it is so important to use the right buzzwords nowadays, so
here are the ones which apply to SMIDER:
- Internet Data Mining: That is we search the internet
for useful information.
- Topology Analyzer: That is the evaluation SMIDER produces
is based not only on the content, but (mostly in fact) on the links
found. Also: Modelling the web as a graph, Hypertext Classifikation,
Mining link structures.
- Focused Crawler: That is the evaluation and crawling are
combined, so that SMIDER can immediatly concentrate search arround
pages which receive high scores. This is aimed to achieve a
High retrieval rate, i.e. about 50% of the pages SMIDER
looks at have at least s.th. to do with the searched theme. Also:
Topic specific search.
- Mining Communities: Since the search is guided by a certain
goal (keywords) the found pages are thought to form a community of
pages belongig to that particular theme. Also: Resource discovery.
- IBM's Project CLEVER: is another approach which is fairly
similar to SMIDER but much better funded. But it is different in many
detail aspects and it currently seems to lack the keyword identification
feature of SMIDER.
You can find links below on this page.
|
This is a keyword guided search engine. I.e. you give the engine
a bunch of words, the engine now tries to sift out most relevant
pages for those keywords from the internet.
Other authors would state that SMIDER
ist not really a search engine in the usual
sense but rather an automated resource
compiler. You give it a theme and it looks, what is relevant.
The strategy of SMIDER is ment
to mimic human behaviour on the net. (Only slightly dumb
but therefore fast and cheap.) This
method is also called focussed crawling.
The machine uses different sources to
get information:
-
Search engines (Google,
Altavista and
Lycos, unless they change
their interface again, its kind of a hack)
-
Found pages and the Links they contain.
-
Pages which link to allready found ones (found by means of
Google Relation and Altavista Link searches).
While the search seeds itself by some search engines, the further
proceeding is guided by rather complex priorities. The engine
simultaneously scores its results and proceeds exploring. So there is
no particular order, and each time a threads needs a new url,
a complex choice is made.
What results do you get?
The pages found are classified in three ways. Every page
gets an
- Hub-score for linking to valuable pages,
- a Authority-score for being linked by good pages and
- a Citation-score for being linked from pages which
themselves have a high Citation-score.
The first two scores
are calculated by one algorithm and the citation score by
a different one. Most famous usage of citation score is
currently probably Google. Unlike the original algorithms in
literature the Authority Scoring of SMIDER uses page content,
but only to a very small amount.
There is also one more score internally used to priorize search.
This score is derived from the other ones for normal web pages,
but also from very different sources, e.g. for Search Engines searches.
Note: Authorities are supposed to contain good content, but a
page can get a large authority-score by the algorithm although the
content of the page has not even retrieved yet! But this is
rarely seen, since high authorities get ranked high on the
algorithms list of urls to spider next. Current experimental
features are
- Instead of giving three rankings (authority, hub and cite)
give a smart accumulation of all.
- Pure application of the citation score does not seem to be
to good, compared with the other two.
- Give a nicer denser representation of the result.
- Try to find additional keywords belonging to the
searched theme.
How good is it?
Well, it depends. The code is still not perfectly and still contains
a few minor bugs. On the other hand some searches I tried, actually
resulted in interesting findings. Remember it is a machine which surfs
for you, looks at a few hundret pages as fast as bandwidth allows
and then points you to a few recommended pages. Remember also that
this research has been conducted on a rather conventional internet
connection, not to be compared with the super server farm real
search engine or highly funded research projects have.
You might want to see for yourself on the
page of recent test runs.
Note this page changes automagically and sometimes, while spidering
a link to a test result might not even work or behave erratically...
Mixed observations:
-
In some sense Smider is very, very cheap it runs on a standard pc
with less than 64MB main memory and much less on disk (MB not GB!)
with no special web connectivity
and is able find a reasonable link lists in a quarter of an hour
(And I'll get it faster when I get work arround some of the synchronization
issues).
Recently I've seen a paper, where they had the whole japanese language
internet on a local disc storage. Well, ...
- commercial sites seem to have mechanisms to trap
the search. Although I tried to include some methods to counter this,
not enough yet. And it seems to get better with every
version of Smider.
- If the domain to search is to wide, search sometimes looses focus.
E.g. once I asked for fat reduction and ended up the general health domain.
Remember the textual content of found pages is nearly ignored. So sometimes,
the search is lead into completely different themed areas. In this case
it often starts well but then suddenly shifts its focus. But maybe this is
not a problem of the search but of the asked question?!
-
Sometimes pages with very many links are evaluated good "hubs",
it is not so hard to have 10 good links in a page with few hundred.
And the human user afterwards again has the problem to find the
good links from the whole bunch.
-
In more research based narrow areas I'm fairly satisfied with some of
the results.
-
The presentation of the final results is more appropriate for debugging then
to guide you into your area of interest.
-
With the inclusion of automatically generated keywords, the generated
result pages have finally achieved the ability to cheat Lycos into
believing that those automated indices are highly valuable resources.
Actually there are quite a few searches where Smider pages get the
pole position from Lycos.
-
This leads to nice secondary effects, if you
start smider on the a previous researched theme or on s.th. close to
a previous scene, then its its own earlier results scores them, ranks
them evaluates them and uses the links from them. Some AI researcher
said that feedback is a prerequisite to consciousness ...
Miscellaneous details about the engine:
-
One of the fun points of this method
is, that the textual content of the found pages is (nearly)
ignored, link structure is the main guide.
-
The engine tries to be host friendly,
i.e. it delays hitting the same host twice by a configurable
miniumum gap. It does not respect robots.txt, since it should
act like a human surfer rather than a engine :-)
-
The engine is configurable multithreaded.
-
It is written in Java,
actually I once started with Perl, but
I didn't manage to cope with the complexities of the algorithm
in such a script like language, which is probably a sign of my
lack of knowledge of Perl.
|
Applications
Probably there are many, a few which come to mind immediately:
- Provide a specialized search service to people.
- Create web directories to themes in a new, hopefully better way.
- Evaluate search engines, e.g. a engine is better when it
has better ranking pages on many different themes. In my small
sample of tests Google was usually ahead of Lycos which seemed
slightly better than Altavista.
Test Runs
Here is an automatic compiled list of testruns.
Future Plans, Goals and Ideas
Of course there are many paths left to improve this raw prototype.
-
Include more search engines as starting points.
-
Include user supplied URLs/Pages as starting points.
-
It should be easily possible to focus the search on more recent
pages, by adjusting certain scores with time dependant factors.
Needs evaluation.
-
Searching usenet newsgroups would be a way to find there latest
information about certain stuff.
-
The content of found pages should have more influence on the result. How?
(Text analysis based on word, n-grams?)
-
A stopping criterion for the search should be derived and evaluated,
e.g. limit on interest of the most interesting page to be
searched next...
-
The "commercial site" problem which largely interconnected sites
should be solved.
-
A PDF engine should be included, to allow indexing PDF.
-
Presentation of results should be improved. E.g. text snippeds from
found pages would help, better layout of the result, the information
in those score numbers should be humanized.
-
The graphical representation (there is a easily overlooked applet link in the
samples, way down) should be improved and could help visualize the
result.
-
Search should be interruptable, engine should have a low bandwidth
background mode. Currently it runs preferably short time, high bandwidth.
A background database would help.
-
Maybe it is possible to find different notions for the search keys
from the found pages.
-
Search should be updateable, e.g. old results can be updated by
later searches to maintain a recommendation instead of recalculate.
-
A tighter inclusion with a indexing machine
like the
fabulous Lucene could help. Lucene could be able to provide more
information to evaluate pages. We need language adapted keywords.
-
A means of user interaction would be cool. E.g. while the machine
recommends, the user surfs in parallel. If the user finds a certain
page especially bad or poor, it can override the machines judgement
about that page. Due to the interconnected evaluation of pages, a
small change could cause very different pages being explored by the
engine.
-
Another form of user interaction would be the "living theme portal", i.e.
a page which is based on a very slow running SMIDER (i.e. crawling only
a few pages a hour) which allows visitors to rate the seen pages themselves
and allows to feed back those "peer ratings" to go back into SMIDERs
evaluation mechanism.
-
I'd like to replace the HTML output by a XML based. XSLT could then
render whatever result one needs.
-
...
-
Some ideas can also be found in the comments on the
Versions Page..
Seems if I had a sponsor I could make quite a few research projects
from it.
Versions
There is a coarse Version Log of Changes to Smider.
Literature?
Well, there are many ideas in the same research direction to be
found on the net. But I was to lazy to compile them (not really in
fact I've even read a few and believe to have understood a few).
But here is what
SMIDER compiled with an ad hoc theme to its own Technology:
Actually this was a good run and even if it does not list all pages
I know, it contains so many links that you will easily find them.
|
Remember today the Computer is only an aid, it's still our duty to think
ourselves. It such a nice world.
|
Back to Frank's Playground
| Versions of Smider
| Send email
Access count: 8390, gener
ated Wed Jun 19 19:00:10 CEST 2013; prev access=Wed Jun 19 07:48:25 CEST 2013; s
ervlet up since Fri May 24 08:18:31 CEST 2013