Published October 30, 2019
Leveraging elastic for unstructured text analysis
Here are a few helpful hints to get you analysing unstructured text in Elasticsearch.
A few months ago, Aginic held a demo session with Brisbane based executives featuring Elastic, the search engine powering Twitter, eBay, and many other web-based apps.
The demo had a focus on health analytics, and showcased Elasticsearch’s unstructured text analysis capability. But did you know the tool is also renowned for its powerful logging and monitoring capabilities? If that sounds right up your alley, there’s plenty more juicy material in our other blog post, Elasticsearch and IoT, or at the official Elastic site here.
To kick things off, we had to get our hands on some unstructured medical data. This proved to be quite challenging, as medical notes aren’t available for the general public to access (probably for the best).
Fortunately, we came across a website with solid sample notes from a variety of specialties. To download the samples, we used the Beautiful Soup package from Python and scraped MTSamples (with permission). Depending on the type of website, you can choose from a range of packages and/or extensions (e.g. pyquery, selenium) to get the data.
So how do we analyse unstructured text in elasticsearch?
Step One: Cleaning up the data
Although the Pandas package from Python is primarily used for numeric and time series analysis, it is also a very powerful general library for data manipulation and storage. This is why we chose to use Pandas DataFrame to clean and give structure to the raw html files. Alternatively, Elasticsearch also possesses data cleaning and manipulation capabilities for people who are comfortable with Java syntax.
Step Two: Ingesting the data using Python
To demonstrate Elasticsearch’s Python API we used Python to ingest our text into Elasticsearch, reusing Michael’s script for his IoT project (check out his post on Elasticsearch and IoT). You can also complete this step using Logstash.
We used Elastic dynamic mapping capability to ingest data without explicitly listing all field data types. This was a huge timesaver, as there turned out to be hundreds of fields. Last but not least, we turned on field data to perform analysis on text fields in addition to search, by default the field data is turned off due to potential large memory usage.
Step Three: Data Analysis
Kibana is the default data visualisation platform for Elasticsearch. In addition to standard visualisations, it also possesses vega plugin ‘Elastic Maps’, and machine learning capabilities to name a few. Below we have attached a simple Kibana dashboard for our medical sample notes, with the number of documents ingested. Things like the top 15 medical specialties, a word cloud and saved search showing the words association with “surgery”.
One of the Kibana plugins is the Graph function. Graph from Elasticsearch is a tool that has the capability to discover and show relevant relationships and connections. It has been used extensively in fraud detection and can be used as a recommendation engine. The graph enables you to discover how items in an Elasticsearch index are related. We have attached a graph which shows the relation of “surgery” with other words in different related fields.
One thing we really wanted to do, was to combine data-based diagnosis with real-time search capabilities. This could help doctors combine demographic data with their professional medical experience and make more informed diagnoses. Mayo Clinic are currently exploring this approach. Unfortunately, we weren’t able to experiment with this capability, as our dataset didn’t include a time or individual patient component.
Being able to combine search and text analytics in a single platform makes Elasticsearch ideal for unstructured text analysis. At the same time, Elasticsearch handles large volumes of data by using denormalised data in a flat database.
It also comes as a stack that has a visualisation platform, ingestion pipelines and external APIs. For unstructured text, there are many inbuilt tools you can use to optimise your search results and text analytics. One of these is the custom analyzer, a powerful option that lets you write your own rules, combine parts of different analysers, and tailor functionality to your own dataset.
We hope this brief introduction to Elastic increased your understanding of the different capabilities within the Elastic stack. Feeling inspired to leverage the value of your own unstructured data?
Drop us a line and we’ll be happy to discuss any ideas!
Get in touch with Elynor Liu
Mathematical physics PhD turned data consultant! Love to provide clients data insights as well as challenging myself to understand their fields. Absolutely enjoy to be immersed in a practical problem that can be tackled by new technologies! Love the supportive environment and the growth mindset promoted at Aginic.Get in touch