Lucene MoreLikeThis exemple en Java

Lucene est moteur de recherche open source écrit en Java et porté en C#.
La query “MoreLikeThis” permet de trouver rapidement les entrées similaires depuis l’index.
Lucene est aussi capable de trouver des documents similaire en comparant une string à ces fields indexés.

Lucene MoreLikeThis depuis les fields indexés

Sur Stack Overflow lorsque l’on écrire une question des questions similaire nous sont proposé:

MoreLikeThis example réél sur Stack Overflow

MoreLikeThis example réél sur Stack Overflow

Pour ce prototype notre index Lucene sera composé des fields “id”, “title” and “content”.
Recherchons les documents où les champs “title” et “content” sont similaire à la string “doduck prototype”.

IndexReader reader = [...];
Analyzer analyzer = [...];
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);

Reader reader = new StringReader("doduck prototype idea");
Query query = mlt.like(reader, null);

TopDocs topDocs = indexSearcher.search(query,5);

for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
    Document aSimilarDocument = indexSearcher.doc( scoreDoc.doc );
    //The more similar document come first
}

Lucene MoreLikeThis sur un document ID

Souvant le document est déjà dans l’index Lucene. C’est la cas lorsque l’utilisateur est entrain de lire ce document en question.
Lucene est capable de rechercher le documents similaire depuis un document Id.

C’est aussi le cas sur Stack Overflow:

MoreLikeThis real example similar from document

Example réel MoreLikeThis depuis document ID

IndexReader reader = [...];
Analyzer analyzer = [...];
int currentlyReadyDocumentID = [...];
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);

Query query = mlt.like(currentlyReadyDocumentID);

TopDocs topDocs = indexSearcher.search(query,5);

for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
    Document aSimilarDocument = indexSearcher.doc( scoreDoc.doc );
    //The more similar document come first
}

Lucene MoreLikeThis example complete

	public static void main(String[] args) throws IOException {
		Main m = new Main();
		m.init();
		m.writerEntries();
		m.findSilimar("doduck prototype");
	}

	private Directory indexDir;
	private StandardAnalyzer analyzer;
	private IndexWriterConfig config;

	public void init() throws IOException{
		analyzer = new StandardAnalyzer(Version.LUCENE_42);
		config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
		config.setOpenMode(OpenMode.CREATE_OR_APPEND);

		indexDir = new RAMDirectory(); //don't write on disk
		//indexDir = FSDirectory.open(new File("/Path/to/luceneIndex/")); //write on disk
	}

	public void writerEntries() throws IOException{
		IndexWriter indexWriter = new IndexWriter(indexDir, config);
		indexWriter.commit();

		Document doc1 = createDocument("1","doduck","prototype vos your idea");
		Document doc2 = createDocument("2","doduck","love programming");
		Document doc3 = createDocument("3","We do", "prototype");
		Document doc4 = createDocument("4","We love", "challange");
		indexWriter.addDocument(doc1);
		indexWriter.addDocument(doc2);
		indexWriter.addDocument(doc3);
		indexWriter.addDocument(doc4);

		indexWriter.commit();
		indexWriter.forceMerge(100, true);
		indexWriter.close();
	}

	private Document createDocument(String id, String title, String content) {
		FieldType type = new FieldType();
		type.setIndexed(true);
		type.setStored(true);
		type.setStoreTermVectors(true); //TermVectors are needed for MoreLikeThis

		Document doc = new Document();
		doc.add(new StringField("id", id, Store.YES));
		doc.add(new Field("title", title, type));
		doc.add(new Field("content", content, type));
		return doc;
	}

	private void findSilimar(String searchForSimilar) throws IOException {
		IndexReader reader = DirectoryReader.open(indexDir);
		IndexSearcher indexSearcher = new IndexSearcher(reader);

		MoreLikeThis mlt = new MoreLikeThis(reader);
	    mlt.setMinTermFreq(1);
	    mlt.setMinDocFreq(1);
	    mlt.setFieldNames(new String[]{"title", "content"});
	    mlt.setAnalyzer(analyzer);

	    Reader sReader = new StringReader(searchForSimilar);
	    Query query = mlt.like(sReader, null);

	    TopDocs topDocs = indexSearcher.search(query,10);

	    for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
	        Document aSimilar = indexSearcher.doc( scoreDoc.doc );
	        String similarTitle = aSimilar.get("title");
	        String similarContent = aSimilar.get("content");

	        System.out.println("====similar finded====");
	        System.out.println("title: "+ similarTitle);
	        System.out.println("content: "+ similarContent);
	    }

	}

Voici le résultat:

====similar finded====
title: doduck
content: prototype your idea
====similar finded====
title: doduck
content: love programming
====similar finded====
title: We do
content: prototype

Retrouvez le code source deLucene MoreLikeThis exemple sur github.

Laisser un commentaire