Posts Tagged ‘.Net’

Using Lucene.Net with Microsoft Azure

Sunday, January 16th, 2011

Lucene indexes are usually stored on the file system and preferably on the local file system. In Azure there are additional types of storage with different capabilities, each with distinct benefits and drawbacks. The options for storing Lucene indexes in Azure are:

  • Azure CloudDrive
  • Azure Blob Storage

Azure CloudDrive

CloudDrive is the obvious solutions, as it is comparable to on premise file systems with mountable virtual hard drives (VHDs). CloudDrive is however not the optimal choice, as CloudDrive impose notable limitations. The most significant limitation is; only one web role, worker role or VM role can mount the CloudDrive at a time with read/write access. It is possible to mount multiple read-only snapshots of a CloudDrive, but you have to manage creation of new snapshots yourself depending on acceptable staleness of the Lucene indexes.

Azure Blob Storage

The alternative Lucene index storage solution is Blob Storage. Luckily a Lucene directory (Lucene index storage) implementation for Azure Blob Storage exists in the Azure library for Lucene.Net. It is called AzureDirectory and allows any role to modify the index, but only one role at a time. Furthermore each Lucene segment (See Lucene Index Segments) is stored in separate blobs, therefore utilizing many blobs at the same time. This allows the implementation to cache each segment locally and retrieve the blob from Blob Storage only when new segments are created. Consequently compound file format should not be used and optimization of the Lucene index is discouraged.

Code sample

Getting Lucene.Net up and running is simple, and using it with Azure library for Lucene.Net requires only the Lucene directory to be changes as highlighted below in Lucene index and search example. Most of it is Azure specific configuration pluming.

Lucene.Net.Util.Version version = Lucene.Net.Util.Version.LUCENE_29;

CloudStorageAccount.SetConfigurationSettingPublisher(
    (configName, configSetter) =>
        configSetter(RoleEnvironment
        .GetConfigurationSettingValue(configName)));

var cloudAccount = CloudStorageAccount
    .FromConfigurationSetting("LuceneBlobStorage");

var cacheDirectory = new RAMDirectory();

var indexName = "MyLuceneIndex";
var azureDirectory =
    new AzureDirectory(cloudAccount, indexName, cacheDirectory);

var analyzer = new StandardAnalyzer(version);

// Add content to the index
var indexWriter = new IndexWriter(azureDirectory, analyzer,
    IndexWriter.MaxFieldLength.UNLIMITED);
indexWriter.SetUseCompoundFile(false);

foreach (var document in CreateDocuments())
{
    indexWriter.AddDocument(document);
}

indexWriter.Commit();
indexWriter.Close();

// Search for the content
var parser = new QueryParser(version, "text", analyzer);
Query q = parser.Parse("azure");

var searcher = new IndexSearcher(azureDirectory, true);

TopDocs hits = searcher.Search(q, null, 5, Sort.RELEVANCE);

foreach (ScoreDoc match in hits.scoreDocs)
{
    Document doc = searcher.Doc(match.doc);

    var id = doc.Get("id");
    var text = doc.Get("text");
}
searcher.Close();

Download the reference example which uses Azure SDK 1.3 and Lucene.Net 2.9 in a console application connecting either to Development Fabric or your Blob Storage account.

Lucene Index Segments (simplified)

Segments are the essential building block in Lucene. A Lucene index consists of one or more segments, each a standalone index. Segments are immutable and created when an IndexWriter flushes. Deletes or updates to an existing segment are therefore not removed stored in the original segment, but marked as deleted, and the new documents are stored in a new segment.

Optimizing an index reduces the number of segments, by creating a new segment with all the content and deleting the old ones.

Azure library for Lucene.Net facts

  • It is licensed under Ms-PL, so you do pretty much whatever you want to do with the code.
  • Based on Block Blobs (optimized for streaming) which is in tune with Lucene’s incremental indexing architecture (immutable segments) and the caching features of the AzureDirectory voids the need for random read/write of the Blob Storage.
  • Caches index segments locally in any Lucene directory (e.g. RAMDirectory) and by default in the volatile Local Storage.
  • Calling Optimize recreates the entire blob, because all Lucene segment combined into one segment. Consider not optimizing.
  • Do not use Lucene compound files, as index changes will recreate the entire blob. Also this stores the entire index in one blob (+metadata blobs).
  • Do use a VM role size (Small, Medium, Large or ExtraLarge) where the Local Resource size is larger than the Lucene index, as the Lucene segments are cached by default in Local Resource storage.

Azure CloudDrive facts

  • Only Fixed Size VHDs are supported.
  • Volatile Local Resources can be used to cache VHD content
  • Based on Page Blobs (optimized for random read/write).
  • Stores the entire VHS in one Page Blob and is therefore restricted to the Page Blob maximum limit of 1 TByte.
  • A role can mount up to 16 drives.
  • A CloudDrive can only be mounted to a single VM instance at a time for read/write access.
  • Snapshot CloudDrives are read-only and can be mounted as read-only drives by multiple different roles at the same time.

Additional Azure references

CNUG Lucene.Net presentation

Monday, January 10th, 2011

I have just held another presentation about Lucene.Net, this time in Copenhagen .Net user group. I hope everyone enjoyed the presentation and walked away with newfound knowledge how to implement full text search into their applications.

I love the presentations, like this one, where everyone participates in the discussion. It makes the experience so much enjoyable and everyone benefits of the collective knowledge sharing.

The presentation and code samples can be downloaded below:

I recommend the book “Lucene in Action” by Eric Hatcher. The samples in this book are all in Java, but they apply equally to Lucene.Net, as it is a 1:1 port of the Java implementation.

Microsoft Julekalender låge #7

Tuesday, December 7th, 2010

Sorry – this post is in Danish.

Dagens opgave handler om Windows Communication Foundation. WCF er kompleks pga. mængden af funktionalitet og kan derfor virke indviklet. Kompleksiteten afspejles også i størrelsen på WCF assembly System.ServiceModel.dll, som er klart den største assembly i hele .Net Framework Class Library (FCL) … selv større end mscorlib.dll.

Opgaven:

Implementer en klient til nedstående service, som benytter WSHttpBinding med default settings.

[ServiceContract(Namespace = "www.lybecker.com/blog/wcfriddle")]
public interface IMyService
{
    [OperationContract(ProtectionLevel =
        ProtectionLevel.EncryptAndSign)]
    string LooongRunningMethod(string name);
}

public class MyService : IMyService
{
    public string LooongRunningMethod(string name)
    {
        Console.WriteLine("{0} entered.", name);

        // Simulate work by random sleeping
        var rnd = new Random(
            name.Select(Convert.ToInt32).Sum() +
            Environment.TickCount);
        var sleepSeconds = rnd.Next(0, 100);
        System.Threading.Thread.Sleep(sleepSeconds * 1000);

        var message = string.Format(
            "{0} slept for {1} seconds in session {2}.",
            name,
            sleepSeconds,
            OperationContext.Current.SessionId);
        Console.WriteLine(message);

        return message;
    }
}

Klienten må meget gerne være smukt struktureret og skal:

  • Implementeres i .Net 3.x eller .Net 4.0
  • Simulere et dusin forskellige klienter
  • Være så effektiv som mulig (tænk memory, CPU cycles, GC)

Beskriv kort jeres valg af optimeringer.

For at gøre opgaven nemmere at løse, så har jeg allerede løst den for jer… dog ikke optimalt. Download min implementation.

Send løsning til anders at lybecker.com inden midnat; vinderen vil bliver offentligt i morgen og vil blive den lykkelige ejer af en fjernstyrret helikopter med tilbehør, så den er klar til af flyve. En cool office gadget. Helikopteren er nem at flyve og kan holde til en del. Det ved jeg af erfaring :-)

Se helikopteren flyve nedefor.

ANUG Solr/Lucene presentation

Wednesday, October 27th, 2010

Aarhus .NET user groupI am on the train to Copenhagen after a successful presentation of Solr/Lucene at the Aarhus .NET user group.

The presentation went very well judging by the number of questions during the almost 2½ hour long presentation and the feedback afterwards. Love it – thanks :-)

The presentation and code samples can be downloaded below:

Please do contact me if you have any further questions – I’ll love to help out.

Java 4-ever

Sunday, July 4th, 2010

I find this video hilarious…

You should use the best tools at hand to solve the problem. That said; choosing between Java or .Net doesn’t really matter in most cases. There are however some areas where Java is a better choice and vice versa.

I can’t wait to see it in the cinema :-)

PS. I do develop with Java even though I do not blog much about it.

Update: YouTube removed the video due to copyright claims. You can still see it JavaZone.

Removing SVN folders with PowerShell

Saturday, April 24th, 2010


I need to remove.svn folders from an existing Visual Studio Solution a customer email me, so I could commit it to another SVN repository.

If I had access to the original SVN repository, I could have used the export function, as it does not include the .svn folders – but no, it should not be that easy.

What the heck, I have been putting it off way too long to start working with PowerShell. It should be a familiar environment as it is object-oriented with a C# like syntax with full access to the .Net Framework Base Class Libraries (BCL).

Here it goes – my first PowerShell script…

function RemoveSvnFolders([string]$path)
{
    Write-Host "Removing .svn folders in path $path recursive"

	Get-ChildItem $path -Include ".svn" -Force -Recurse |
		Where {$_.psIsContainer -eq $true} |
		Foreach ($_)
		{
			Remove-Item $_.Fullname -Force -Recurse
		}
}

The Write-Host Cmdlet just writes the content to console window.

If you are like me, a PowerShell novice – start with the Getting Started with Windows PowerShell article and use the free tool PowerGUI from Quest Software. It’s PowerShell IDE with an integrated syntax highlighter editor and debugger.

In line 5 the Get-ChildItem Cmdlet iterates the path recursively and filtering the result to include only “.svn” files and folders. The force parameter allows the cmdlet to get items that cannot otherwise be accessed by the user, such as hidden or system files. Get-ChildItem Cmdlet can also iterate the registry.

Afterwards the result from Get-ChildItem Cmdlet is piped to the Where-Object Cmdlet (Where is an alias for Where-Object). The psIsContainer is a property on a folder. If it is equal to true pass it to the next pipe. I could have written the following instead:

Where {$_.mode -match "d"}

Use the below statement to list all properties for the files and folders in the current folder:

Get-ChildItem | format-list -property *

The foreach statement iterates every item and deletes the folder with the Remove-Item Cmdlet.

Calling the method is as simple as:

RemoveSvnFolders("c:\svn\My Solution")

On TechNet there is a myriad of articles with the root Windows PowerShell Core and more task oriented like A Task-Based Guide to Windows PowerShell Cmdlets and Piping and the Pipeline in Windows PowerShell.

Remove SVN folders PowerShell Script.

Happy PowerShelling… :-)