DocStore Architecture Design Issues

ARCHITECTURE

DOCSTORE ARCHITECTURE

Please read OLE DocStore wiki for more detailed architecture design and data modeling and organizing in DocStore. Currently DocStore hosts the bib, instance, license/agreement data.

COMMENTS ON DOCSTORE ARCHITECTURE

JOHN PILLANS thought we shouldn't use JackRabbit for storing Bib and Instance data, since we can not fully use the features that JackRabbit provides, and JCR brings bad performance here. He suggested we may put Bib and Instance data to database (blob field), instead of DocStore. The architecture would be much simple, with much faster performance.

WOULD LIKE John provides more detailed information about the IU library system architecture and performance evaluation on database and Solr. 

COMMENTS ON UUID FOR BIB AND INSTANCE

JOHN PILLANS: Library would like to keep 16 digital identifier for Bib and instance new records, not just UUID. So, new ingested records should have this 16 digital identifier generated automatically.

More comments on the identifier

PERFORMANCE ISSUES

SLOW INGEST PERFORMANCE ON INSTANCE

Currently the develop team work on investigating the following thing:

  1. Apache Camel: if OLE DocStore uses Camel in the right way
    1. multi-threading in Camel code -- facing concurrency issues while updating docstore nodes, now the feature is turned off during bulk ingest.
  2. Feed docs to Solr: If there is a more efficient way to feed the incoming records to Solr for indexing
    1. Pranitha has observed that converting from pojo to SolrInputDocuments is taking very less time when compared to converting xml into pojos.
      Solr commit operation has two flags "waitFlush" and "waitSearcher". Turning these flags off can make commit faster.
      Pranitha is looking into it.
  3. Where to keep the linkages

CURRENT INGEST PERFORMANCE & REQUIREMENT

INGEST PERFORMANCE MINIMUM REQUIREMENT

From John Pillans: Ingest about 20 million legacy data (including bib, instance..) need to finish in one week!

CURRENT PERFORMANCE FOR INGESTING BIB DATA

Legacy data ingest:

Ingest 6 million bib records, processing time = 60hrs

Incremental data ingest:

On 2012-04-18 05:16:51,738 on DEV with batch size = 1000 (with about 1.5 million records already ingested)

Bulk Ingest Process for 10,000 bib records:

Ingesting Time: 0:3:41.562(H:M:S.ms)

Indexing Time: 0:1:10.201(H:M:S.ms)

Total Process Time: 0:4:51.854(H:M:S.ms)

CURRENT PERFORMANCE FOR INGESTING INSTANCE DATA

Legacy data ingest:

Ingest 10 million instance records, processing time may take 42 days

Detailed time breakdown for ingesting 1000 records:

Time taken for ingest 1000 instance records :

Ingesting Time: 3.33 minutes

Time for Linking to Bib records: 2.51minutes

For more detailed time breakdown, please read the spread sheet.

 

Operated as a Community Resource by the Open Library Foundation