Glen Mazza's Weblog

https://glenmazza.net/blog/date/20181027 Saturday October 27, 2018

ElasticSearch Notes: Complex Date Filtering, Bulk Updates

Some things learned this past week with ElasticSearch:

Advanced Date Searches: A event search page my company provides for its Pro customers allows for filtering by start date and end date, however some events do not have an end date defined. We decided to have differing business rules on what the start and end dates will filter based on whether or not the event has an end date, specifically:

  • If an event has both start and end dates:
    1. The start date of the range filter, if provided, must be before the end date of the event
    2. The end date of the range filter, if provided, must be after the start date of the event
  • If an event does not have an end date:
    1. The start date of the range filter, if provided, must be before the start date of the event
    2. The end date of the range filter, if provided, must be after the start date of the event

The above business logic had to be implemented in Java but as an intermediate step I first worked out an ElasticSearch query out of it using Kibana. Creating the query first helps immensely in the subsequent conversion to code. For the ElasticSearch query, this is what I came up with (using arbitrary sample dates to test the queries):

GET events-index/_search
{
    "query": {
    "bool": {
        "should" : [
        {"bool" :
            {"must": [
                { "exists": { "field": "eventMeta.dateEnd" }},
                { "range" : { "eventMeta.dateStart": { "lte": "2018-09-01"}}},
                { "range" : { "eventMeta.dateEnd": { "gte": "2018-10-01"}}}
                ]
            }
        },
        {"bool" :
            {"must_not": { "exists": { "field": "eventMeta.dateEnd"}},
             "must": [
                { "range" : { "eventMeta.dateStart": { "gte": "2018-01-01", "lte": "2019-12-31"}}}
                ]
            }
        }
    ]
    }
}
}

As can be seen above, I first used a nested Bool query to separate the two main cases, namely events with and without and end date. The should at the top-level bool acts as an OR, indicating documents fitting either situation are desired. I then added the additional date requirements that need to hold for each specific case.

With the query now available, mapping to Java code using ElasticSearch's QueryBuilders (API) was very pleasantly straightforward, one can see the roughly 1-to-1 mapping of the code to the above query (the capitalized constants in the code refer to the relevant field names in the documents):

private QueryBuilder createEventDatesFilter(DateFilter filter) {

    BoolQueryBuilder mainQuery = QueryBuilders.boolQuery();

    // query modeled as a "should" (OR), divided by events with and without an end date,
    // with different filtering rules for each.
    BoolQueryBuilder hasEndDateBuilder = QueryBuilders.boolQuery();
    hasEndDateBuilder.must().add(QueryBuilders.existsQuery(EVENT_END_DATE));
    hasEndDateBuilder.must().add(fillDates(EVENT_START_DATE, null, filter.getStop()));
    hasEndDateBuilder.must().add(fillDates(EVENT_END_DATE, filter.getStart(), null));
    mainQuery.should().add(hasEndDateBuilder);

    BoolQueryBuilder noEndDateBuilder = QueryBuilders.boolQuery();
    noEndDateBuilder.mustNot().add(QueryBuilders.existsQuery(EVENT_END_DATE));
    noEndDateBuilder.must().add(fillDates(EVENT_START_DATE, filter.getStart(), filter.getStop()));
    mainQuery.should().add(noEndDateBuilder);

    return mainQuery;
}

Bulk Updates: We use a "sortDate" field to indicate the specific date front ends should use for sorting results (whether ascending or descending, and regardless of the actual source of the date used to populate that field). For our news stories we wanted to rely on the last update date for stories that have been updated since their original publish, the published date otherwise. For certain older records loaded it turned out that the sortDate was still at the publishedDate when it should have been set to the updateDate. For research I used the following query to determine the extent of the problem:

GET news-index/_search
{
   "query": {
   "bool": {
      "must": [
         { "exists": { "field": "meta.updateDate" }},
         {
            "script": {
               "script": "doc['meta.dates.sortDate'].value.getMillis() < doc['meta.updateDate'].value.getMillis()"
            }
         }
      ]
   }
    }
}

For the above query I used a two part Bool query, first checking for a non-null updateDate in the first clause and then a script clause to find sortDates preceding updateDates. (I found I needed to use .getMillis() for the inequality check to work.)

Next, I used ES' Update by Query API to do an all-at-once update of the records. The API has two parts, an optional query element to indicate the documents I wish to have updated (strictly speaking, in ES, to be replaced with a document with the requested changes) and a script element to indicate the modifications I want to have done to those documents. For my case:

POST news-index/_update_by_query
{
   "script": {
   "source": "ctx._source.meta.dates.sortDate = ctx._source.meta.updateDate",
   "lang": "painless"
},
   "query": {
      "bool": {
         "must": [
            { "exists": { "field": "meta.updateDate" }},
            {
               "script": {
                  "script": "doc['meta.dates.sortDate'].value.getMillis() < doc['meta.updateDate'].value.getMillis()"
               }
            }
         ]
      }
   }
}

For running your own updates, good to test first by making a do-nothing update in the script (e.g., set sortDate to sortDate) and specifying just one document to be so updated, which can be done by adding a document-specific match requirement to the filter query (e.g., { "match": { "id": "...." }},"). Kibana should report that just one document was "updated", if so switch to the desired update to confirm that single record was updated properly, and then finally remove the match filter to have all desired documents updated.

Posted by Glen Mazza in Programming at 03:00AM Oct 27, 2018 | Comments[0]

Post a Comment:

Calendar
« December 2024
Sun Mon Tue Wed Thu Fri Sat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Today
About Me
Java Software Engineer
TightBlog project maintainer
Arlington, Virginia USA
glen.mazza at pm dot me
GitHub profile for Glen Mazza at Stack Overflow, Q&A for professional and enthusiast programmers
Blog Search


Blog article index
Navigation
About Blog
Blog software: TightBlog 4.0.0
Application Server: Tomcat
Database: MySQL
Hosted on: Linode
SSL Certificate: Let's Encrypt
Installation Instructions