What you didn’t know about searching in Vault
When working with our customers, from time to time, we’ll get questions on why they see unexpected results in some of their searches. This typically happens when they search without wildcards (I’ll explain later). In this blog post, I hope to shed some light on what can be a confusing experience for some Vault users.
The search engine in Vault operates on a on a general computer science principle called general Tokenization. This process essentially chops up the indexed properties into chunks called tokens. When a user searches in Vault (either quick search or advanced find), the search engine will attempt to match the tokens in the search string to the tokens in the appropriate properties. Before going further, I’ll explain how Vault does the slicing and dicing.
First, there are three categories of characters (for our purposes, at least); alpha [a-z, A-Z], numeric [0-9], and special [#^$, blank space, etc.]. Vault will parse the string and sniff out groups of characters belonging to a category. For instance, ABC123$@# would be tokenized into 3 individual tokens:
- ABC
- 123
- $@#
Again, what happened is that Vault saw the first character, A, and understood it to be an alpha character. Vault then asked “Is the next character an alpha, too?” to which the answer was yes, so the token became AB. C was then added to the initial token, as it too was an alpha character. However, the answer was “No”, when it came to the character 1. Vault finished its first token and began the next one, now that it sensed a different category of character. Vault continued this line of questioning with the subsequent characters.
Another example might be a file name like SS Bearing Plate-6×6.ipt. Here, we have 8 tokens:
- SS
- Bearing
- Plate
- –
- 6
- x
- 6
- ipt
Now, you may have caught the missing period. Vault will only tokenize six special characters – all others are ignored. These special special characters (sorry, had to do it) are:
- $ (dollar sign)
- – (dash)
- _ (underscore)
- @ (at symbol)
- + (plus)
- # (octothorpe, aka number sign)
So now where do the unexpected results come in? This usually happens when an incomplete token is used without wild cards. For example, a user wants to find a specific mounting bracket. This user then types in “mount,” expecting that to be enough. In our hypothetical Vault environment, the results would return “Fan mount.ipt” but not “Mounting bracket.ipt” like they intended. Why? Remember that Vault is trying to match exact tokens (again, without wild cards).
If the user had entered mount*, the results would return the expected “Mounting bracket.ipt” as the user intended.
Moral of the story? Always use wild cards…always. No, really, all the time. For everything.