Archive for the ‘Lucene’ Category

Using Lucene.Net with Microsoft Azure

Sunday, January 16th, 2011

Lucene indexes are usually stored on the file system and preferably on the local file system. In Azure there are additional types of storage with different capabilities, each with distinct benefits and drawbacks. The options for storing Lucene indexes in Azure are:

  • Azure CloudDrive
  • Azure Blob Storage

Azure CloudDrive

CloudDrive is the obvious solutions, as it is comparable to on premise file systems with mountable virtual hard drives (VHDs). CloudDrive is however not the optimal choice, as CloudDrive impose notable limitations. The most significant limitation is; only one web role, worker role or VM role can mount the CloudDrive at a time with read/write access. It is possible to mount multiple read-only snapshots of a CloudDrive, but you have to manage creation of new snapshots yourself depending on acceptable staleness of the Lucene indexes.

Azure Blob Storage

The alternative Lucene index storage solution is Blob Storage. Luckily a Lucene directory (Lucene index storage) implementation for Azure Blob Storage exists in the Azure library for Lucene.Net. It is called AzureDirectory and allows any role to modify the index, but only one role at a time. Furthermore each Lucene segment (See Lucene Index Segments) is stored in separate blobs, therefore utilizing many blobs at the same time. This allows the implementation to cache each segment locally and retrieve the blob from Blob Storage only when new segments are created. Consequently compound file format should not be used and optimization of the Lucene index is discouraged.

Code sample

Getting Lucene.Net up and running is simple, and using it with Azure library for Lucene.Net requires only the Lucene directory to be changes as highlighted below in Lucene index and search example. Most of it is Azure specific configuration pluming.

Lucene.Net.Util.Version version = Lucene.Net.Util.Version.LUCENE_29;

CloudStorageAccount.SetConfigurationSettingPublisher(
    (configName, configSetter) =>
        configSetter(RoleEnvironment
        .GetConfigurationSettingValue(configName)));

var cloudAccount = CloudStorageAccount
    .FromConfigurationSetting("LuceneBlobStorage");

var cacheDirectory = new RAMDirectory();

var indexName = "MyLuceneIndex";
var azureDirectory =
    new AzureDirectory(cloudAccount, indexName, cacheDirectory);

var analyzer = new StandardAnalyzer(version);

// Add content to the index
var indexWriter = new IndexWriter(azureDirectory, analyzer,
    IndexWriter.MaxFieldLength.UNLIMITED);
indexWriter.SetUseCompoundFile(false);

foreach (var document in CreateDocuments())
{
    indexWriter.AddDocument(document);
}

indexWriter.Commit();
indexWriter.Close();

// Search for the content
var parser = new QueryParser(version, "text", analyzer);
Query q = parser.Parse("azure");

var searcher = new IndexSearcher(azureDirectory, true);

TopDocs hits = searcher.Search(q, null, 5, Sort.RELEVANCE);

foreach (ScoreDoc match in hits.scoreDocs)
{
    Document doc = searcher.Doc(match.doc);

    var id = doc.Get("id");
    var text = doc.Get("text");
}
searcher.Close();

Download the reference example which uses Azure SDK 1.3 and Lucene.Net 2.9 in a console application connecting either to Development Fabric or your Blob Storage account.

Lucene Index Segments (simplified)

Segments are the essential building block in Lucene. A Lucene index consists of one or more segments, each a standalone index. Segments are immutable and created when an IndexWriter flushes. Deletes or updates to an existing segment are therefore not removed stored in the original segment, but marked as deleted, and the new documents are stored in a new segment.

Optimizing an index reduces the number of segments, by creating a new segment with all the content and deleting the old ones.

Azure library for Lucene.Net facts

  • It is licensed under Ms-PL, so you do pretty much whatever you want to do with the code.
  • Based on Block Blobs (optimized for streaming) which is in tune with Lucene’s incremental indexing architecture (immutable segments) and the caching features of the AzureDirectory voids the need for random read/write of the Blob Storage.
  • Caches index segments locally in any Lucene directory (e.g. RAMDirectory) and by default in the volatile Local Storage.
  • Calling Optimize recreates the entire blob, because all Lucene segment combined into one segment. Consider not optimizing.
  • Do not use Lucene compound files, as index changes will recreate the entire blob. Also this stores the entire index in one blob (+metadata blobs).
  • Do use a VM role size (Small, Medium, Large or ExtraLarge) where the Local Resource size is larger than the Lucene index, as the Lucene segments are cached by default in Local Resource storage.

Azure CloudDrive facts

  • Only Fixed Size VHDs are supported.
  • Volatile Local Resources can be used to cache VHD content
  • Based on Page Blobs (optimized for random read/write).
  • Stores the entire VHS in one Page Blob and is therefore restricted to the Page Blob maximum limit of 1 TByte.
  • A role can mount up to 16 drives.
  • A CloudDrive can only be mounted to a single VM instance at a time for read/write access.
  • Snapshot CloudDrives are read-only and can be mounted as read-only drives by multiple different roles at the same time.

Additional Azure references

CNUG Lucene.Net presentation

Monday, January 10th, 2011

I have just held another presentation about Lucene.Net, this time in Copenhagen .Net user group. I hope everyone enjoyed the presentation and walked away with newfound knowledge how to implement full text search into their applications.

I love the presentations, like this one, where everyone participates in the discussion. It makes the experience so much enjoyable and everyone benefits of the collective knowledge sharing.

The presentation and code samples can be downloaded below:

I recommend the book “Lucene in Action” by Eric Hatcher. The samples in this book are all in Java, but they apply equally to Lucene.Net, as it is a 1:1 port of the Java implementation.

ANUG Solr/Lucene presentation

Wednesday, October 27th, 2010

Aarhus .NET user groupI am on the train to Copenhagen after a successful presentation of Solr/Lucene at the Aarhus .NET user group.

The presentation went very well judging by the number of questions during the almost 2½ hour long presentation and the feedback afterwards. Love it – thanks :-)

The presentation and code samples can be downloaded below:

Please do contact me if you have any further questions – I’ll love to help out.

Check for breaking changes in APIs

Tuesday, June 8th, 2010

Have you ever had the need to compare interfaces of two versions of the same framework?

If you have, then ApiChange is a tool for you. It’s open source, powerful and easy to use :-)

I gave it a spin comparing current trunk version 2.9.2 of Lucene.Net with the latest official release version 2.4.0.

I downloaded ApiChange and ran the following command in a command prompt:

ApiChange.exe -Diff -old C:\Lucene.Net_2_4_0\Lucene.Net.dll -new C:\trunk\Lucene.Net.dll

The output lists all the differences, but here is a summary:

  • 23 public types where removed
  • 96 public types where added
  • 158 public types where changed

Cool little tool with other features such as:

  • Diff public types for breaking changes.
  • Who uses a method?
  • Who uses a type?
  • Who uses implements an interface?
  • Who references me?
  • What format has the binary (32/64, Managed C++, Pure IL, Unmanaged)?
  • Search for all event subscribers and unsubscribers.

It’s based on Mono Cecil – a free IL parser, and not reflection as I initial thought. Go check it out…

Ageing pictogram

Wednesday, May 19th, 2010

I’m in Prague, Czech for the Apache Lucene EuroCon 2010; wandered around, where I saw this drawing on a house wall.

I find it hilarious – especially the natural shadow over the coffins. It’s just by pure coincidence that I was there, at the time of day where the doorway cast its shadow over the coffins :-)

Miracle Open World 2010 Lucene Presentation

Tuesday, April 20th, 2010

The conference is over and it was a great success. I meet a lot of new people and had lots of technical discussions about .Net, graph databases, freetext search, SQL Server, Oracle Service Bus, debugging with WinDbg and extensions.

The slides and demo code for my Lucene session is available here:

My session “Making freetext search with Lucene.Net work for you” abstract:

Lucene is an open source full-featured text search engine library, making searching in large amounts of text lightning fast. Lucene are in use by many large sites like Wikipedia, LinkedIn, MySpace etc.

It is easy to get started with Lucene, but there are many pitfalls… In this session you will learn about the do’s and don’t’s for indexing and searching, tools, scaling, new features in version 2.9 and some of the more advanced features.

This presentation will use the Microsoft .Net implementation of Lucene named Lucene.Net, but the content of this presentation applies for ported versions of Lucene.

Speaking about Lucene at Miracle Open World 2010

Sunday, April 4th, 2010

The conference Miracle Open World 2010 is soon upon us at Legoland (April 14.-16.) :-)

There will be four tracks this year: Oracle track, SQL Server track, .Net track and a workshop track.

The conference is legendary because time spend at the conference is divided between 80% technical stuff and 80% social networking. No kidding’ socializing is a big part of this conference with gala-dinner and the not-to-miss beach party at Lalandia Aquadome (including drinks).

This year I only have one session where I’ll be presenting Lucene.Net.

Session abstract:

Lucene is an open source full-featured text search engine library, making searching in large amounts of text lightning fast. Lucene are in use by many large sites like Wikipedia, LinkedIn, MySpace etc.

It is easy to get started with Lucene, but there are many pitfalls… In this session you will learn about the do’s and don’t's for indexing and searching, tools, scaling, new features in version 2.9 and some of the more advanced features.

This presentation will use the Microsoft .Net implementation of Lucene named Lucene.Net, but the content of this presentation applies for ported versions of Lucene.

At the time of writing, 207 participants have registered for the conference. You can still register – it’s not too late.

See more at the Miracle Open World 2010 site.

Lucene.Net and Transactions

Thursday, December 3rd, 2009

Lucene Search Engine Logo

Lucene.Net is an open source full text search engine library (a port from Java). It is stable and works like a charm – I’ve been using Lucene.Net for a couple of years now and implement a handful of solutions. Lucene is awesome.

If you want to try working with Lucene.Net, then the DimeCast.Net crew has recently made two short webcasts introducing Lucene.Net.

.Net 2.0 made it simple to use transactions with the System.Transactions namespace. Two of the great features are automatic elevation to distributed transactions (and utilize the Distributed Transaction Coordinator) and the other is the simplicity of creating your own transactional resource managers.

The .Net Framework defines a resource manager as a resource that can automatically enlist in a transaction managed by System.Transactions – which means that any object that implements any of the following interfaces can enlist in a transaction:

  • IEnlistmentNotification for the two-phase-commit protocol
  • IPromotableSinglePhaseNotification for the single-phase-commit protocol (non-distributed transactions)

To implement a resource manager for the Lucene.Net IndexWriter, and therefore make it transactional, all you have to do is the following:

public class TransactionalIndexWriter : IndexWriter, IEnlistmentNotification
{
    #region ctor
    public TransactionalIndexWriter(Directory d, Analyzer a, bool create, MaxFieldLength mfl)
        : base(d, a, create, mfl)
    {
        EnlistTransaction();
    }
    /* More constructors */
    #endregion

    public void EnlistTransaction()
    {
        // Enlist in transaction if ambient transaction exists
        Transaction tx = Transaction.Current;
        if (tx != null)
            tx.EnlistVolatile(this, EnlistmentOptions.None);
    }

    #region IEnlistmentNotification Members
    public void Commit(Enlistment enlistment)
    {
        base.Commit();
        enlistment.Done();
    }

    public void InDoubt(Enlistment enlistment)
    {
        // Do nothing.
        enlistment.Done();
    }

    public void Prepare(PreparingEnlistment preparingEnlistment)
    {
        base.PrepareCommit();
        preparingEnlistment.Prepared();
    }

    public void Rollback(Enlistment enlistment)
    {
        base.Rollback();
        enlistment.Done();
    }
    #endregion
}

You can use it like so:

IndexWriter indexWriter = null;
TransactionScope tx = null;

try
{
    tx = new TransactionScope();
    indexWriter = new TransactionalIndexWriter(...);

    // Perform transactional work
    indexWriter.AddDocument(new Document());
    indexWriter.AddDocument(new Document());
    indexWriter.AddDocument(new Document());

    // Connect to Database, MSMQ etc. to elevate to a distributed transaction

    // Commit transaction
    tx.Complete();
}
finally
{
    if (tx != null)
        tx.Dispose();

    if (indexWriter != null)
        indexWriter.Close();
}

Fairly simply uh? Just remember to instantiate the TransactionalIndexWriter or call the public method EnlistTransaction within the scope of an ambient transaction.
You might consider implementing IDisposable for TransactionalIndexWriter so you can take advantage of the using statement.

I will leave it to the reader to implement a TransactionalIndexReader.

Lucene.Net is an open source full text search engine library (a port from Java). It is stable and works like a charm – I’ve been using Lucene.Net for a couple of years now and implement a handful of solutions. Lucene is awesome.

If you want to try working with Lucene.Net, then the DimeCast.Net crew has recently made two 10 short webcast introducing Lucene.Net (http://dimecasts.net/Casts/ByTag/Lucene).