SMIDER versions
Well, here is small log about the changes, I did to the versions.
Unfortunately the version was allready up to 0.37 when I started this
log, therefore previous changes are kind of a reconstruction from
my mind.
0.39 (March 2002)
- Properly synchronized the
HostInfo PriorityQueue wrapper class.
Having had my first poor experiences with jdk 1.0.2 synchronization, I
had a tendency to "undersynchronize". But jdk 1.3+ synch really helps.
- Citation Score turns out to be pretty poor. Using it to priorize
pages actually increased the probabilty to loose focus during a search.
As a remedy Citation Score weights were decreased und Citation Score
calculation got additional damping.
- Citation Score part in Total Score was decreased.
- Windowing of search scores was modified to fight commercial clusters
even more.
-
The parsing of files need improvement. Currently many urls are fetched
in their x-www-form-urlencoded form and do not get decoded. In this
version a special case fix for "Google related" pages is introduced.
- Cleaned up a few exceptions. (Running SMIDER still throws lots of
exception but most are HTTP problems with the spidered servers.)
0.38 (Early October 2001)
- Extended use of Lucene disabled due to a bug.
- Modified functions for determination of total score and search score
- Moved over to new apache base lucene
0.37 (September 2001)
-
Changed definition to totalScore for Output to give hubscore less weight.
-
Reincreased damping for calculation if hubscore. A fast increasing hubscore
increases the risk for the algorithm to get trapped in a commercial site
early.
-
Slight increase of search engines liklihood to be searched. Also to
avoid the commercial nest trap.
-
Fixed a bug, that caused TotalScore list in results being sorted only
roughly correct (two different but similar methods used in different
places). Minor refactor to put this into one method and moved some
classes in more appropriate packages.
-
The citeation score now gets minor impact on search priorities, i.e.
pages with large citation score gain priority. From now on all three
dynamic scoring methods have influence no search balance.
0.36
There were quite a few changes with 0.36, not all at once...
-
Removed the output of the internal search information into [i] pages.
These sooner or later consumed all the CPU which is needed for
search iteration. Will change that to a separate thread. The search
now scales well to about 30 fetcher threads.
-
Improved the regular expression to fetch URLs from Altavista Pages.
- Some of those versions were fairly defunc due to a wrong scoring
of search scores.
0.32
- Included a third scoring method, weightet citation score (that
is approximately what google uses). This search currently does not
influence search order. A superficial test seems to confirm what
is actually stated in literature, that this score performes poorer
than the other two in this particular environment. Google is a
different story of course.
-
Cleaned up the logic how backlinks found by search engines are used.
They now create actual links with (estimated) weights instead of
estimated score adjusts.
-
Searched pages are now simultaneously indexed with
Lucene. The only
use of the index is currently to create a list of "important"
keywords in the found pages. Future applications might be use of
lucene for improved static scoring and therefore going over to a
sophisticated blend between static (text based) and dynamic (link
based) scoring.
0.30
0.29
0.25
-
Corrected various regular expression
0.21
-
First version numbered version.
previous Versions
Some month of trying testing and fiddling. Lots of tuning
to get the balance between use of search engines and surfing
by the algorithm itself. Implemented primitive static scoring.
Back to Frank's Playground
| General remarks about SMIDER
| SMIDER sample results
| Send email
Access count: 1463, gener
ated Sun Feb 05 13:16:00 CET 2012; prev access=Fri Feb 03 11:44:07 CET 2012; s
ervlet up since Fri Jan 13 19:54:14 CET 2012