Since last couple of week i was working with Drupal and apache solr. There is very good module Apache solr search integration to integrate apache solr with Drupal search. This module provide number of hooks to modify solr search index, alter the document before it being indexed on solr, to alter query parameters, alter the result before it being displayed on result page and several other hooks. Here are my other posts to add custom field in apache solr index and add custom sort to apache solr query parameter.
I had the requirement to search within the drupal content as well as within the attached document with any of the node, if user search for some terms and if that trem exists within the attached document but not within node content itself then result should return that node in search result.
There are existing module apachesolr_file and apachesolr_attachments. Apachesolr attachment module was working quite near to my requirement, As apache solr create one document in index for each drupal node, but this module create separate document for attached file with any of node. So when i search for some keyword then it return that attached file as separate document not the node with which this file is attached, Then i have to apply some hack over that module to index the attached file content with the node itself not as a separate document.
Approach:
Before solr index the drupal content i simply alter the document being indexed. I check whether the node being index have file attached with iteself, if file exits then add a new custom field to solr index and using the apachesolr_attachment module’s apachesolr_attachments_get_attachment_text() function grab the text of attached file and add this text to this custom field. Apachesolr attachement module uses tika library to extract the text from attached document.
I created custom module which uses hook_apachesolr_index_document_build() and hook_apachesolr_query_alter(). These two hooks are provided by Apache solr search integration module.
Let say you have a custom file field with node which contain the file(doc, pdf etc). In my case this field is field_download_file.
<?php
function apache_solr_attachments_custom_apachesolr_index_document_build(ApacheSolrDocument $document, $entity, $entity_type, $env_id) {
if ($entity_type == 'node') {
if(isset($entity->field_download_file)) {
module_load_include('inc', 'apachesolr_attachments', 'apachesolr_attachments.index');
$text = apachesolr_attachments_get_attachment_text((object)$entity->field_download_file['und'][0]);
$document->addField('ts_attachment_text', $text,3);
}
}
}
function apache_solr_attachments_custom_apachesolr_query_alter($query) {
$query->addParam('qf', 'ts_attachment_text');
}
?>
To use apachesolr_attachments_get_attachment_text function you have to install apachesolr_attachement module and enable it.