BibSort Internals



BibSort main goal is to make the sorting of search results faster.
It does this by creating several buckets that are then loaded by
the search_engine and cached.

This is what happens per sorting method (which can be either
a ranking method or a sorting field/list of MARC tags):

    I. Rebalancing the buckets (reconstructing the buckets):
	- (1) the data for all the records is loaded into a dictionary
	- (2) based on the washer method (if any is defined in the
        cfg file) date is adjusted (dates normalized, removing
        leading articles..)
	- (3) the data dictionary (containing recid:value) is then
        sorted based on values
	- (4) after the sorting, each record receives a weight (it can
        be the index in the sorted list or a multiple of the index)
	- (5) based on the sorted list of recids, the buckets are created,
        by splitting this list in CFG_BIBSORT_BUCKETS parts
	- (6) the maximum weight per bucket will also be stored in
        the database, along with each bucket
	- (7) everything is stored in the database

    II. Updating the buckets (can run each 5 min, to update
    the buckets with the latest changes. Probably it would be best
    to run it after BibIndex, in order for the user to see the new
    changes as soon as they are indexed)
	- (1) detect which methods should be updated: either
        ranking - from rnkMETHOD tables, either usual sorting
		(by field or MARC)
	- (1i) if RNK method -> then run rebalancing, since anyway,
        we have to read all the data for all the record
		(check also if the method needs to be deleted
        rather then updated) - END
	- (1ii) if sort by field -> detect which recids have been
        modified since the last run of bibsort
	- (2) based on the washer method (if any is defined in
        the cfg file) date is adjusted (dates normalized,
        removing leading articles..)
	- (3) compute the records that need to be inserted, modified and
        deleted (with the help of the data dictionary (see I.2)
        that is loaded from db)
	- (3i) insert: new recids with their values will be inserted into
        the data dictionary; binary search will be used to compute
        the index of the new record in the sorted list of records;
        a new weight will be given to this record based on this index
        and on the weights of its neighbours (ex: recid 3 needs to be
        inserted between recid1-weight=4 and recid2: weight=8 =>
        recid3 will have weight=6); initially, the weights are given
        as such that there are free weights between any 2 recids;
        when these spaces are filled, the weights will be recomputed,
        to recreate spaces between them.)
	- (3ii) delete: the recids will be deleted from all the data structures
		(except the buckets, since there is no actual need for that,
        the intersection with the search_engine output will guarantee
        that the record will  not be displayed)
	- (3iii) modify: combination between insert and delete
	- (4) update the buckets to reflect the latest changes; here, we will
        take advantage of the maximum weight per bucket = the highest
        weight value per bucket (I.6); since the buckets are not ordered
		(are intbitsets, for taking advantage of fast intersections)
        we will add the new records based on their weight (the main idea
        is to compare the weight of the modified records with the max weight
        and decide in which bucket to be added/deleted.