Online integration into PubPsych

Everything is implemented with respect to Solr 6.6.5.

QueryFieldRewriter class

This is the main class of the query translation module. Here, the actual translation takes place. Since it is the most interesting class, it is described first here, but details that are needed to fully understand the usage of the other classes involved will be given only in their respective paragraphs which can be found below in the text.

Inner classes

public class QueryNodeComparator implements Comparator<QueryNode>

QueryNode objects are compared using their nodeId: A QueryNode A is bigger than a QueryNode B if and only if A’s nodeId is bigger than B’s nodeId.

Fields

Note: In this section, only the fields used for the real (translation) functionality will be explained. There are other fields only used for gathering statistics for the evaluation of the system. The statistics will be explained in section 4.4.

private static final Logger LOG =
LoggerFactory.getLogger(QueryFieldRewriter.class);

The logger for this class.

private final Map<String, Map<String, String>> meshDict = new HashMap<>();

The high-quality (domain) MeSh dictionary is stored in meshDict. The first key is the string (token or phrase) one queries, the second key is a language code (“en”, “de”, “fr”, or “es”; these codes are used in the dictionaries) and the value is the respective translation string. Every inner map contains exactly three entries (for all languages except the source language).

private final Map<String, Map<String, String>> mixedDict = new HashMap<>();

The low-quality (out-of-domain) dictionary is stored in mixedDict. The structure is the same as in meshDict.

private final Map<String, Map<String, List<String>>> laFieldNames =
new TreeMap<>();

Some fields have language-specific versions in the PubPsych core schema, some do not (e.g. TI has language-specific versions like TI_D, whereas text does not have language-specific versions). Only those fields that have language-specific versions are stored in this map.
The first key is a field name, the second key is a language code (“E”, “D”, “F”, “S”; these codes are used in the PubPsych database schema) and the value is a list of language-specific field names. It is a list, because the exact number of language-specific field names varies (there is not only TI_D, but also TI_D_from_E, but there is only one German version of the field SH).

private int nodeIdCounter = 1;

Every QueryNode is assigned a unique node ID. This ID is used as a sorting criterion.

Methods

public QueryFieldRewriter(String pathToMeshDict, String pashToMixedDict,
            String pathToLaFieldNames) throws FileNotFoundException, IOException

In the constructor, the dictionaries and the mapping between fields and language-specific field names (see above) are read in.

private void readInFieldNames(String pathToFieldNames) throws FileNotFoundException, IOException

Reads in the mapping between fields and language-specific field names. The expected format of the file containing the information on the mapping is:

one line containing the field name
next line starts with a language code (D, E, F or S), then all language-specific names follow separated by whitespaces
after this is done for all languages for which a language-specific version of the current field exists, a blank line follows, then the schema is repeated for the next field
comments: lines starting with //

private void readInDictionary(String pathToDictionary, boolean mesh) throws FileNotFoundException, IOException

Reads in the MeSh dictionary or the low-quality dictionary. The Boolean value mesh indicates in which map the dictionary is stored (meshDict or mixedDict).

private void splitEntry(String entry, Map<String, String> translationMap, boolean mesh)

A method used in readInDictionary to split an entry like “en::translation” and add it to the translationMap with the language code as its key and the translation as its value.

private Translation matchFieldNamesAndTranslations(Map<String, String> translationMap, String field)

Gets a map with language codes as key and translations for a whole phrase as values. Returns a map of language-specific field names and corresponding translations. Does not check whether a translation equals the original string, since this way, more fields can be searched. If a combination of field and language doesn’t have a specific field name (like SH and French), this is marked by a tilde (e.g. SH~F) to be able to set apart language-specific field names that exist in the schema from those that are only needed in this code to distinguish different languages.

private Boolean gatherTransInfo(String fieldName, String laFieldCode,
            String laTransCode, Map<String, String> fieldsAndTranslations,
            Map<String, Boolean> fieldNamesLASpecific,
            Map<String, String> translationMap)

Returns true if a language-specific name for fieldName and the language encoded by laFieldCode exists. If yes, each language-specific field name is used as a key to create a new entry of fieldsAndTranslations, each with translationMap’s value for the key laTransCode as its value. If no, false is returned and nothing else happens.
This method is used in matchFieldNamesAndTranslatons.

private void addDummyField(Map<String, String> translationMap,
            String field, Map<String, String> fieldsAndTranslations,
            Map<String, Boolean> fieldNamesLASpecific, String laTransCode,
            String laFieldCode

Builds a dummy language-specific field name following the pattern field + “~” + laFieldCode and uses this as as key to create a new entry of fieldsAndTranslations with translationMap’s value for the key laTransCode as its value.

private String removePluralEnding(String plural, String language)

Following simple rules that depend on the language indicated by language (codes: en, de, fr, and es), the method tries to remove plural endings. The implementation follows tradQueries.py in DBTranslator/scripts/utils.

private Translation translateWholeString(String original, String field)

Tries to translate original as a whole. The methods first looks it up in the MeSh dictionary. If this is not successful, it tries to find the string in the low-quality dictionary. Returns null if none of the dictionaries contains the string original.

private boolean checkSingular(String possibleSingular, String laCode,
            String token, boolean countInStats, boolean isEntireCopy,
            boolean mesh, Map<String, Map<String, String>> translationByToken)

Looks up a possible singular form derived with the method removePluralEnding in both dictionaries. If one of the dictionaries contains the possible singular form, an additional check whether the found entry has the same source language as the one that was assumed to derive the singular form is performed. This is done to avoid mistakes where a possible singular form of one language looks like a word in another language (e.g. a possible singular form of German Weber would be, according to our simple rules, web, which would induce strange translations due to the English word web).
The methods returns true if an entry with the correct source language could be found.

private Translation translate(String original, String field)

First tries to translate original as a whole (using translateWholeString). If there is no translation, the string is split into tokens. Each token is looked up separately. If a token cannot be found in any of the dictionaries, removePluralEnding is used to derive a possible singular form. If no singular form with correct source language (see checkSingular) is found in any of the dictionaries, the respective token simply gets copied.

public QueryNode buildQueryNode(Query query, BooleanClause.Occur occLvl)

Builds a tree displaying the structure of query and its subqueries if they exist. If a new class is added to the subclasses of Query that the ZpidQParser uses, one needs to implement a mapping into the QueryNode structure here. Otherwise, queries of this new class just get copied and will never be translated. For details on the QueryNode class: see below.

public QueryNode translateQueryNode(QueryNode old)

Returns old after all structural changes with respect to the translations have been made. First, it is checked whether old has to be translated at all. If not, nothing is changed. If yes, the queryType of old determines what is happening. In the following, I use expressions like “BooleanQueries” instead of writing “QueryNodes representing a BooleanQuery” each time which would be technically correct. It should be clear from the context whether I refer to the actual Solr BooleanQuery class or my QueryNode class.

BooleanQueries:
- Children that have occurrence level SHOULD are replaced with the result of calling translateQueryNode on them.
- The search strings of all children that are TermQueries or PhraseQueries and that query the same field and have occurrence level MUST (which means they are combined with AND) are concatenated. The result of this concatenation is looked up as a whole. If a dictionary entry has been found, the original nodes are removed from old’s list of children and a new intermediate node inBetween with occurrence level MUST is created. The QueryNodes that are translations are added as inBetween’s children, each with occurrence level SHOULD. The original nodes are grouped under another new node named parentOfOldNodes. This node is then added as another child of inBetween (also with occurrence level SHOULD). inBetween is then added as a child of old (with occurrence level MUST). If the concatenated string is not contained in any of the dictionaries, every child node is replaced with the result of translateQueryNode called on it.
- Children that are DisjunctionMaxQueries are collected separately. If there is more than one child that is a DisjunctionMaxQuery, we try to concatenate the search strings of disjuncts of different DisjunctionMaxQueries. These disjuncts must query the same field. If they are BoostQueries, they must also have the same boost factor. If such disjuncts are found and there is an entry for the concatenated string in one of the dictionaries, the structure of the respective DisjunctionMaxQueries is changed using the method translateGroup. In any case, all DisjunctionMaxQueries are added to nodesToBeTranslatedAlone (a list of nodes that have to be translated alone). If some disjuncts could be translated as a group, each of these nodes is marked as a node that does not have to be translated (anymore).
- Children that are BoostQueries or BooleanQueries are added to nodesToBeTranslatedAlone.
- In the list of children of old, each element n of nodesToBeTranslatedAlone is replaced with the result of translateQueryNode(n). old is returned.
DisjunctionMaxQueries:
- Children that are BooleanQueries or DisjunctionMaxQueries are translated alone.
- For each child that is a BoostQuery, translateWrappedQueryOfBoostQuery is called to create a list of new nodes that will be added to old’s list of children.
- For each child that is a TermQuery or a BoostQuery, buildNewChildren is called to create a list of new nodes that will be added to old’s list of children.
- old is returned.
BoostQueries: translateWrappedQueryOfBoostQuery is called to create a list of new nodes. If this list is empty, the wrapped query was a BooleanQuery or a DisjunctionMaxQuery and has then directly been replaced by its translation. Otherwise, old and its translations (new BoostQueries) are turned into children of a new BooleanQuery inBetween. All of them get the occurrence level SHOULD, whereas inBetween’s occurrence level is set to the original occurrence level of old. inBetween is returned.
PhraseQueries and TermQueries: A Translation object is created using the method translate on old’s text and field. Based on this object, a list of new (translated) nodes with occurrence level SHOULD is created and added to the children list of a new node inBetween, which is also made the parent of old. The occurrence level of inBetween is set to the occurrence level of old, whereas old’s occurrence level is set to SHOULD. inBetween is returned.
All other subclasses of Query: A warning is logged that there is an unhandled QueryType and old is returned with no changes at all.

private void translateGroup(QueryNode representative, List<QueryNode> group,
            Map<String, String> translationMap, List<QueryNode> disjunctionMaxQueries, List<Integer> idsOfDMQs)

This method is used to translate disjuncts of different DisjunctionMaxQueries that have been matched together (cf. translateQueryNode). The translation of the concatenated tokens is split into single tokens for which new QueryNodes are created and distributed as disjuncts among the DisjunctionMaxQueries to which the original nodes belonged. All original nodes are then marked as “does not have to be translated” (thus avoiding double translations).

private Map<QueryNode, List<QueryNode>> matchDMQsWithSameField(List<QueryNode> disjunctionMaxQueries)

This method is used to find disjuncts of different DisjunctionMaxQueries that query the same field (and have the same boost factor if applicable). In the returned map, the key of each entry is a representative QueryNode and the value is a list of all the QueryNodes that query the same field (and have the same boost factor if applicable) as the representative QueryNode.

private List<QueryNode> buildNewChildren(Translation trans,
            QueryNode.QueryType typeOfChildren, boolean moreThanOneOriginalNode,
            float boost, QueryNode.QueryType typeOfWrappedQuery)

Returns a list of QueryNodes representing the translations. If typeOfChildren is TERM and moreThanOneOriginalNode is true (so the translation is based on the concatenation of at least two TermQueries or PhraseQueries), then the translation string of each target language is split into tokens. If there is more than one token, then a new QueryNode with occurrence level MUST is created for every token. These new nodes are then made children of a new intermediate node with occurrence level SHOULD which is added to the returned list. If there is only one token or the translation is based on just one TermQuery or PhraseQuery, a new node created with buildNewChild is added to the returned list. In case of BoostQueries, a warning is logged if moreThanOneOriginalNode is true and a new node of type BOOST is built whose wrapped query is created with buildNewChild.

private QueryNode buildNewChild(QueryNode.QueryType typeOfChild, String fieldName, String text, BooleanClause.Occur occLvl)

Returns a new QueryNode of type typeOfChild with field set to fieldName, text set to text and the occurrence level set to occLvl.

private boolean hasToBeTranslated(String fieldName)

Returns true if one of the following conditions holds:

fieldName is a key of laFieldNames
fieldName equals “text”
fieldName equals “SW”.

public QueryNode simplify(QueryNode toSimplify)

Simplifies QueryNodes with redundant structure: If the child c of a BooleanQuery is a BooleanQuery and has occurrence level MUST, then all its children (grandchildren of toSimplify) that also have occurrence level MUST, are removed from c’s list of children and added to toSimplify’s list of children. If all children of c have occurrence level MUST, c itself is removed from toSimplify’s list of children.

public Query manipulateQuery(Query query)

Maps query to a QueryNode object, simplifies this object with simplify and then runs translateQueryNode on it. The result of translateQueryNode is mapped back to a Query object which is then returned.

public List<QueryNode> translateWrappedQueryOfBoostQuery(QueryNode wrappedQuery, QueryNode.QueryType wqType, QueryNode bq)

If wqType (type of wrappedQuery) is BOOLEAN or DISJUNCTIONMAX, the wrapped query of bq is replaced with translateQueryNode(wrappedQuery) and an empty list is returned. Otherwise, new nodes are created with buildNewChildren.

QueryNode class

The purpose of the QueryNode class is to provide an interface to manipulate the queries more easily than with Solr’s Query class. The main advantage is that one onley has to deal with one class for all query objects (and not with the many subclasses of Query that all provide different methods/fields). This way, one doesn’t have to build new Query objects for each change, but change the respective QueryNode object instead and then map it back to a Query object once all changes due to translation have been performed.

Inner classes

public enum QueryType {
   BOOLEAN, TERM, DISJUNCTIONMAX, PHRASE, BOOST, DUMMY
}

These are the subclasses of Query that the ZpidQParser uses (version as of November 2018).

BOOLEAN: BooleanQuery
TERM: TermQuery
DISJUNCTIONMAX: DisjunctionMaxQuery
PHRASE: PhraseQuery
BOOST: BoostQuery
DUMMY: Represents Query subclasses for which there is no implementation yet. This is important in case the ZpidQParser implementation starts using new/different classes. Somebody then would have to think about how to translate these classes.

Fields

private final int nodeId;

Every QueryNode object is assigned a unique ID in the constructor. This is needed e.g. for sorting purposes.

private final List<QueryNode> children;

This list contains QueryNode objects that are somehow subordinate to the given QueryNode. For instance, they can be subqueries of a BooleanQuery or disjuncts of a DisjunctionMaxQuery. TermQuery and PhraseQuery objects cannot have children.

private final QueryType queryType;

The type of the Query that the given QueryNode object represents.

private String fieldName;

This field is only set for TermQuery and PhraseQuery objects and contains the name of the corresponding field in the PubPsych core schema (e.g. TI).

private String text;

This field is only set for TermQuery and PhraseQuery objects and contains the actual text associated with that Query object.

private BooleanClause.Occur occLvl;

The occurrence level of the given QueryNode object is set if and only if the QueryNode object is the child of another QueryNode object with type BOOLEAN.

private Boolean toTranslate;

This field is set to true if fieldName if the QueryNode object (or the associated representative, cf. method translateQueryNode of the QueryFieldRewriter class) has not been translated yet and one of the following conditions holds:

The QueryNode object represents a TermQuery or a PhraseQuery and fieldName is one of the translatable fields.
The QueryNode object has children.

private float tieBreakerMultiplier;

This field is set only for QueryNode objects representing a DisjunctionMaxQuery. It is a tie breaker value for multiple matches. For more details, see https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html.

private float boost;

This field is set only for QueryNode objects representing a BoostQuery.
The higher the boost, the more important the wrapped query will be to the computed score. See https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/BoostQuery.html.

private Query originalQuery;

The original Query object is stored in case it is an instance of an unhandled subclass of Query. When mapping QueryNode objects with type DUMMY back to Query objects, originalQuery will be returned.

private int slop;

This field is set only for QueryNode objects representing a PhraseQuery. It is the maximum edit distance for which terms in the PhraseQuery will still be matched to strings in the documents (see https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/PhraseQuery.html). In the current implementation, the slop of the translated PhraseQuery objects is simply copied from the original Query.

private int parentId;

The node ID of the parent node if the latter one exists. It is needed for matching disjuncts querying the same field but belonging to different DisjunctionMaxQuery objects.

private static final Logger LOG = LoggerFactory.getLogger(QueryNode.class);

The logger object.

Methods

public QueryNode(QueryType type, int id)

Every QueryNode object needs a type and a unique ID. These two cannot be changed afterwards.

public int getId()

Returns the unique node ID.

public QueryType getType()

Returns the QueryType.

public void addChild(QueryNode child)

Adds child to children if this is an allowed operation for the given queryType.

public void addChildren(List<QueryNode> newChildren)

Adds all elements of newChildren to children if this is an allowed operation for the given queryType.

public List<QueryNode> getChildren()

Returns the children list.

public QueryNode getChildAtIndex(int index)

Returns the element of children that is at position index.

public void removeChildAtIndex(int index)

Removes the element of children that is at position index.

public void replaceChildAtIndex(int index, QueryNode newChild)

Replaces the element of children that is at position index with newChild.

public void setFieldName(String name){

Sets fieldName to name if the given QueryNode is of type TERM or PHRASE.

public String getFieldName(){

Returns fieldName.

public void setText(String text)

Sets this.text to text if the given QueryNode is of type TERM or PHRASE.

public String getText()

Returns text.

public void setOccurrenceLevel(BooleanClause.Occur lvl)

Sets occLvl to lvl.

public BooleanClause.Occur getOccurrenceLevel()

Returns occLvl.

public void markAsToTranslate()

Sets toTranslate to true.

public void unmarkToTranslate()

Sets toTranslate to false.

public boolean hasToBeTranslated()

Returns toTranslate.

public void setTieBreakerMultiplier(float tbm)

Sets tieBreakerMultiplier to tbm.

public float getTieBreakerMultiplier()

Returns tieBreakerMultiplier.

public void setOriginalQuery(Query query)

Sets originalQuery to query.

public Query getOriginalQuery()

Returns originalQuery.

public void setBoost(float boost)

Sets this.boost to boost.

public float getBoost()

Returns boost.

public void setParentId(int id)

Sets parentId to id.

public void setSlop(int sl)

Sets slop to sl.

public int getSlop()

Returns slop.

public Query buildQuery(){

Maps the given QueryNode object back to an object of one of the subclasses of Solr’s Query class. Given the QueryType, the mapping works as follows:

BOOLEAN: Each element child of children recursively calls buildQuery(). The resulting Query objects are added as subqueries with child.getOccurrenceLevel() to the new BooleanQuery object. If a child does not have an occurrence level, a warning is logged and the childQuery (respective Query object) is added with the occurrence level SHOULD.
DISJUNCTIONMAX: Each element child of children recursively calls buildQuery(). The resulting Query objects are all added to a list of disjuncts which is then used to build the new DisjunctionMaxQuery object.
PHRASE: text is split into tokens. Each token is then used (together with fieldName) to build a Term for the new PhraseQuery.
TERM: text and fieldName are used to construct a new Term object, with which the new TermQuery is then built.
BOOST: A new BoostQuery with the element of children that is at position 0 as its wrapped query and a boost with value boost is returned.
DUMMY: originalQuery is returned.

Translation class

The Translation class is used to store information about the translation of a specific field and string.

Fields

private static final Logger LOG = LoggerFactory.getLogger(QueryNode.class);

The logger object.

private String originalFieldName;

The original field name.

private Map<String, String> fieldsAndTranslations;

The keys of this map are the language-specific versions of originalFieldName, the values are the translations.

private Map<String, Boolean> fieldNamesLanguageSpecific;

The keys of this map are the language-specific versions of originalFieldName. The values indicate whether these versions also exist in the Pubpsych core schema (true) or whether they are just dummy values to maintain the distinction between the different languages (false). We need these dummy values, because using originalFieldName (which will in this case be later used in the query itself) would overwrite existing translations in the map.

Methods

public Translation(Map<String, String> translations,
Map<String, Boolean> fieldNamesLASpecific, String fieldName)

Initializes the class fields with the given values.

public boolean isLanguageSpecificFieldName(String fieldName)

Returns true if fieldName is a language-specific field name in the PubPsych core schema, false otherwise. Logs a warning if fieldName is not a key in fieldNamesLanguageSpecific.

public Map<String, Boolean> getFieldNamesLASpecificMap()

Returns fieldNamesLanguageSpecific.

public Map<String, String> getTranslations()

Returns fieldsAndTranslations.

public String getOriginalFieldName()

Returns originalFieldName.

PreTranslationInfo class

Objects of the PreTranslationInfo class store information that is gathered before the actual translation happens. It is used for the translation of BooleanQuery objects. The texts of all children of type TermQuery or PhraseQuery are concatenated and then stored in a PreTranslationInfo object to be able to look up the whole text related to the BooleanQuery in the dictionaries.

Fields

private StringBuilder sb;

The StringBuilder is used to concatenate the text strings of all subqueries of the BooleanQuery for which the PreTranslationInfo is gathered.

private Map<Integer, QueryNode> originalNodes;

The keys are the indices in the children list (list of subqueries) of the parent node (BooleanQuery), the values are the respective subqueries.

Methods

public PreTranslationInfo()

Initializes sb and originalNodes with an empty StringBuilder respectively an empty map.

public void addString(String s)

Appends s to sb.

public String buildString()

Builds the string that has been gathered by sb.

public void addOriginalNode(int index, QueryNode node)

Adds node with key index to the originalNodes map.

public Set<Integer> getIntegers()

Returns all keys of originalNodes.

public Collection<QueryNode> getNodes()

Returns all values of originalNodes.

public Map<Integer, QueryNode> getOriginalNodesMap()

Returns originalNodes.