The Kusto Detective Agency 2022 has been an immersive and interesting gamified way to learn the use of KQL to analyze big data. Here I will summarize my taking from this brilliant game.
Kusto Query Language (KQL) is a powerful tool to explore your data and discover patterns, identify anomalies and outliers, create statistical modeling, and more. KQL is used in Azure Log Analytics and Application Insights and is an important tool for advanced hunting in Azure.
The Kusto Detective Agency game is a free offer from Microsoft where new cases are released every two weeks on Sundays. The promotion period for the game was from the 18th of September 2022 to the 31st of March 2023, and there was a total of five cases to solve with a limited number of prizes for those who solve the case first. Every player who correctly solves a case will get a digital badge.
This blog post has been brewing for quite some time. Now that all challenges has been released, the big final challenges is over, and several other has shared their thoughts it seems to be a good time for me to share my experiences to learn even more!
The game can be accessed by everybody with a Microsoft Account or and AAD identity at the address https://detective.kusto.io/.
If you haven’t tried it yet, I recommend you put some effort to solve the cases yourself BEFORE reading this blog post!
Table of Content
- The onboarding
- Challenge #0 – Initial entrance task
- Challenge #1 – The rarest book is missing!
- Challenge #2 – Election fraud?
- Challenge #3 – Bank robbery
- Challenge #4 – Part I: Play with prime numbers
- Challenge #4 – Part II: The trees of NY
- Challenge #4 – Part III: The real detective!
- Challenge #5
- The Appendix
- The END!
The first task as a rookie is to create a free Kusto cluster which will be the primary investigation tool throughout the game available at https://aka.ms/kustofree.
Challenge #0 – Initial entrance task
The initial task includes importing a data table with data from an Azure blob storage. The code for importing this data is given in your detective inbox.
The data will import easily but be aware that you should only run this command once. If you run it several times, the final result will be wrong. IF you suspect to have run it several times, you can allways run “.drop table Onboarding” and try once more.
The entrance exam is to be able to sum up the data that has been imported.
This is my saved code from the initial onboarding task where the output will make up the answer:
As soon as you have delivered a correct answer to your first message in the Inbox, you will receive your first badge proving you have passed the Rookie test!
Challenge #1 – The rarest book is missing!
As long as the release date has passed for the next challenge, you will find a new message in your inbox. Now it’s time for the first real challenge! The following text was found in the inbox:
As always when dealing with assignments, it is smart to read the text properly to get a good understanding of the assignment. In this challenge it is a good idea to picturize the library and the system they have implemented.
This time we are dealing with a pretty cool automated digital library system where each book has a well selected parameters stored in the database. Weight pr book is part of this information. This information is tied to the RFID identification sticker on each book. The library shelves have automated registration of the RFID stickers allowing to automatically register books inserted into the shelves. The shelves even got a mechanism registering the total amount of weight on the shelve.
The problem we are about to solve is related to a RFID sticker found on the museum floor. The rarest book has been placed in a shelve without its sticker. This means we can’t locate the book based on automatic RFID registration system in the shelves. We need to analyze big data to localize the book!
We have got some lines of code to import the dataset from the library to Azure Data Explorer:
I did start this challenge by examining the dataset. The data in the “Shelves” table shows 3 fields where I find the shelf number and the total measured weight in separate fields. There is also a field containing all the self-registered books on each and every shelve – I guess this is where we have to find a missing book!
I expand the original “rf_ids” field resulting in one line pr. book.
Early investigations of the imported library data showed me that each book data was available in the “Books” table.
I will now combine this information with a KQL “join” command. This gives me the weight of every single book mapped to the registered shelve, and I can summarize the weight of the registered books on each shelve.
By using the KQL “Diff” command, I can compare the value from the weight sensor in each shelve with the calculated sum of the checked in books on each shelve.
The result shown above made me doubt my own train of thought as none of the shelves had a match between the weighed weight and the calculated weight of all the books. I mean, in mathematics I should trust the equal sign.
I then realized this is a detective game, not a mathematical exam. By filtering out all the small differences, I got a single shelf to stand out!
After spending hours studying the problem, I got the answer in 0.86 seconds from the following KQL code:
I am now a proud “Detective 1” of the great Kusto Detective Agency 2022!
Challenge #2 – Election fraud?
The second challenge has a different approach related to some suspicious election result. The challenge has a lot of text and a big dataset focusing on time stamps.
Importing of data could be a challenge in the first challenges, but at this time this is easy-peasy with the provided code.
I did spend quite some time studying the large data set containing votes from more than 5 million people. I had to figure out how to find irregularities in the voting patterns.
I ended up using the KQL command “bin” to count of votes pr candidate pr IP within a second of time. By examining the query result, I noticed some high number of votes in the second sliced data.
The challenge was now related to how I could sort these anomalies out of the final result. I did try several pretty advanced KQL data analyze methods without having a valid answer.
After stepping out of the problem for a while and giving it some offline thoughts as a detective, I realized I could test a much simpler approach – I could just drop every votes pr. candidate which was over a given limit pr. second pr. IP. The simple line “where votes <= 9” gave me a valid result.
By running this through the provided code for percentage calculation, I got a final accepted answer. For the fun of it, I rendered the answer as a piecahart.
My complete set of code for this challenge looked as follows:
This result allowed me to rise in the ranks of the Kusto Detective Agency to “Detective II” – Hurray!
Challenge #3 – Bank robbery
The third challenge dropped in the inbox talking about a bank robbery where we need to assist in locating the thieves from a huge data set of traffic camera recordings.
The dataset with traffic information is imported by the provided code from the challenge. Loading this dataset did take a while caused by the huge amount of data.
Checking the content of the data in the Traffic table gave me this simple dataset: timestamped data with Vehicle Identification Number (VIN) and corresponding Avenue and Street for each timestamp.
This is a real detective challenge. How can we narrow this data down to pinpoint where the thieves are hiding. I found my way out of this while thinking of it while I cooked dinner for my family. My thoughts were scribbled down on a piece of paper as a pseudo code to work further on from my computer.
In short, I made the following plan to filter my way to the thieves:
- Limit the search the closest streets surrounding the bank
- Limit the timeframe to the time around 08:31 when the thieves left the bank and hit their separate parked cars.
- Put these distinct cars in a variable and use this as a filter on the complete Traffic dataset
- Filter on timestamps after the robery
At my computer I made progress from my offline detective mind work. The challenge was to track where all the vehicles did drive – where did they pass each other etc. I made a lot of effort in the dataset until my detective mind told me that moving cars was not of interest. I needed to find parked cars. I did filter out the latest registration in timestamp for each car and then summarized the number of cars at each position by avenue and street. By filtering this result to locations where number of cars was equal to the number of bank robbers (3), I got a unique address!
This query did take 1.122 second to run – I have no idea how many brain hours were invested before this. This is the complete set of KQL commands used solving this case:
At this time, I got promoted to “Detective III” of the Kust Detective Agency 2022 – Proud!
Challenge #4 – Part I: Play with prime numbers
Challenge 4 got released, and where the previous assignments had a lot of text, this one was rather sparse on information of what we were looking for. We got a crypted message, a URL to all prime numbers and a shortcut to information related to Special Prime Numbers.
This time there was no code available on how to import the data. I had to build this my self.
After importing the data, I did evaluate the content and found all prime numbers – simple as that.
The detective brain had to study the logic behind the “special prime number” term from the provided URL. I found that a special prime number is where the sum of two consecutive prime numbers and the number 1 becomes a new prime number.
I had to figure how to make a KQL formula for this formula, and stumbled across the KQL “next” command. The data set was therefore expanded with a new column that holds the next value from the prime number table. I then did a calculation pr. line in the dataset giving me a potential special prime number.
To narrow down the scope, I did a filtration to get only the large numbers. The task was to find the largest special prime number below 100M, remember?
Next up was a filtration of the complete dataset on my calculated values of special prime numbers and then grab the largest of these numbers.
This value was combined in a URL given by the challenge:
Challenge #4 – Part II: The trees of NY
Instead of a mission complete message, I got directed to the following message which actually is a completely new and different challenge! Challenge #4 is a multi-engine rocket! This time not as much information, but the but the words seem to be carefully chosen!
There was some codes provided this time. First a regular dataset import as seen earlier, but also creation of two functions!
The dataset contains a massive number of attributes of all trees found arround in NY! The following two screenshots shows some of the attributes pr. registered tree. It is a massive dataset which needs some time to get a minimum amount of familiarity with.
The text challenge tells me to find a specific location in NY. Locations are often referred by latitude and longtitude. This is data values I find for each tree in the dataset.
The text tells me that the location we are looking for has some set of flora regarding the trees:
“…there is a special place with Turkish Hazelnut and four Schubert Chokecherries within 66-meters radius area“
I am starting to search for the Schubert Chokecherries. After some searching, the detective in me find that these have a slightly different typography in the dataset compared to the challenging text.
The text is also mentioning to isolate the trees within a 66-meters radius area. Googling for this information combined with some information from the hints section in the challenge led me to information of the the KQL command “geo_point_to_h3cell()” which will calculates a H3 Cell token string value for a geographic location. This is a unique hexagon shape covering a part of a map. Using level 10 for this calculation results in an average hexagon edge length of 66m.
Next part is to summarize the number of trees on each H3 cell and filter out cells containing 4 instances of the Schubert Chokecherry.
Now I need to do the same take on the other specified three sorts and match these H3 cells towards the previous results. The Turkish Hazelnuts has no information regarding number in the challenge. The American Linden should be the smallest one in the area. This leads me to one single location in 0.369 seconds.
Challenge #4 – Part III: The real detective!
Once more – the specific values from a query doesn’t seem to fulfill the challenge for the detective. This is not the final result for Challenge #5 – it leads to a new part – this is a real detective part.
By supplying the lat/lon information to the VirtualTourLink function imported earlier, I receive a URL to a Google map street view.
By following this URL, I end up with the following view – How can I proceed with KQL!?
I have to return to the text, which this time seems to be particularly pointed and articulated. I have to “go ‘out’ and look for me there, near the smalles American Linden tree”.
I am not that sure of the look of an American Linden tree, but I assume it’s the tree to the right of me in the picture above. By moving out to that tree, my attention is drawn towards the amazing street art on the wall. After studying the wall I recognice the name “El Puente”. This is the same as the signature name in the challenge – “look for me” – could I have found the correct spot? The instructions tell me to find the bottom line, and the line bellow says “ASHES to ASHES”.
Lets put on the David Bowie track Ashes to Ashes while figuring out the rest of the challenge!
I assume the text from the code will be the key which should be used to decrypt the initial message from the challenge. I did some stumbling on the encrypted message since it contained the escape character \ which broke my commands. This got solved by wrapping the message string with @ as mentioned in this Microsoft Learn article on string data type.
This is the complete set of KQL commands used solving this case:
Finally! After several stages of this challenge, I got promoted as “Senior Detective” of the Kusto Detective Agency 2022!
The Lord’s ways are unsearchable – When one challenge is finished, all we had to do was to wait for the next assignment to reveal itself!
Challenge #5 is released, and I made a sneak peek at the text in the middle of the night while walking my dog.
The data was imported and examined. The structure seems pretty simple at first glance. 3.803.816 lines of log from a chat server.
My first take was to extend the content in Message to separate columns.
Now I could figure out all the different actions performed inside the chatlogs:
Now the detective work started. How should I figure out who the criminals are? The first hint didn’t tell my anything I didn’t realize myself, but it did reinforce the thoughts I had about the data set.
From this point I had a lot of attempts of finding the criminals. This is a list of some of my failing attempts:
- Narrow down the frequency of “joined the channel” to see if some users did highlight.
- Find users not joined to any channel and see if these send personal messages to a group of 4
- Find users having direct dialogues only with 3 other users.
- Find four users talking to only one channel
These attempts made up like 3-400 lines of codes testing without giving any result. Time well wasted!
The second hint was kind of supportive for all my tests on the dataset.
I had some small conversations about this problem with my colleagues in Cloudway which have long time experience in KQL. They had a really good approach based on the following:
- Find users sending message to a channel with only 4 members joined at a given time.
This was kind of high level to me at that point of the day, and I felt like I was close to an answer with some of my queries. I therefore wanted to follow some of my ideas before throwing them away.
This was the track I felt had most potential where I listed all channels with exactly 4 connections from users within one hour. I realize this approach has its weaknesses by users joining and leaving outside of the hours etc, but I had to run this further.
After fighting this for “some” time, I finally found the criminals and the corresponding IP addresses!
Here you have my complete query for this:
This was my take on solving this part of the challenge. I later on learned some alternative ways of making this code even more clean and efficient. I will return to this in the appendix.
The different URLs from the IP-addresses gave me some fragmented information, and I did understand there was some heave detective tasks ahead of me.
There was a utils.txt file on one of the devices containing two clues:
One PDF gave me a marked image of the location:
This picture was found on several of the IP address url’s:
By downloading this image and use it at the URL found in the TXT file, I got the LAT and LON location information needed as part of the answer.
I was surprised how fast and easy I got to this information. 2/3 of the details figured out – only the date left, and I had found information on how this should be located as well in an email on one the IP urls. Self-confidence was at its peak, but it was about to take a hit.
The first part of this information was related to finding the date the picture of the target was taken. The picture was used for finding the geo location, and the date was all listed in the webpage where the picture was found:
ehh…. wait. All pictures and files have the same date. Suspicious. Checking the picture properties gives me the correct date for when the picture was actually taken.
The second part of information from the email talks about a nonsense event happening, where I need the year of the happening. This is where I really stumbled around for a while knowing I was missing something but didn’t know what. In a PDF on one of the URLs, I found the year where Microsoft launched their datacenter which was the target and thought that was the “historical nonsense event” in an ironic tone.
Based on this information I headed on to the third part of the text found in the email sketching out a formula for calculating the date of the big heist. This gave me a formula based on the “datetime_add” function.
With my current variables, this gave me a date in the past. This must be wrong. I had to dig some more, but where?
It bothered me that I hadn’t used the code generator found in the utils.txt. When running this, I got a message that didn’t give me any specific information:
The third hint in this case put me on an idea:
I got the solution code from challenge #4 and used that as a key in the function. This did actually give me a new message:
“it reveals read find headache”
The problem was that I didn’t know what this should lead to, except for headaches! The email found had all the clues outlined in bold – so was the word Kastor the Elephant.
Kastor was part of challenge #2 about the election fraud, and I spent a lot of time looking for something in that case to be used now. Luckily my colleague Sandy Zeng told me to read the text for case #5 once more, and more, and more, and more… The goal was to see if there was some information provided which I hadn’t made use of.
I did see some information I had missed, but that did also point towards Kastor and Case #2. Time for a break – remember, breaks are also part of music!
The breakthrough was to replace the “PS”-sentence in the kusto decoder with the “PS”-sentence found in the task and combine this with the code from challenge #4 – this gave an URL to a bing search! This is so clever!
El Puente gave me a Bing search about an Elephant doing an historical nonsense event – I have a new year for my formula – how cool is that!
This gave me the fifth badge in the Kusto Detective Agency 2022!
But – the story did not end here. The date I got out of the case was naturally ahead of time, and I will not know if I made a correct assumption until that date of time. Did I figure the correct date, or not…? Time will show!
The date was entered as an alarm on my phone with the relevant time zone difference calculated. At my time 05:00:01 AM I did resubmit my answer. The effort did pay off. I got my self a gift!
And I got a new badge followed by a certificate of gratitude with hints of new challenges coming in 2023!
The Kusto Detective Certificate of Gratitude hints of new challenges coming in 2023!
The challenge has been completed. It has been a true joyful experience for me, and for the community where commitment has been great and new connections have occurred.
In conclusion, it can be fun to look at some alternative outcomes from the tasks and datasets.
Challenge #1 and Bendford’s law
Basel Salam had an interesting take on the first initial dataset where he suspect this to be a “real dataset” after doing a KQL analyze related to Benford’s law. Interesting analyze of Basel which concludes the quality work of the Kusto Detective Agency already at the initial task!
Basel’s work looks like this:
Here you have the code behind this graph:
Challenge #3 and how to map the escape route
I found a great idea from Warren Kahn long time after I finished the Kusto Detective Agency which I felt I needed to investigate more. On his blog, Waren has published an idea on how a scatterchart can be used to track the cars in the streets.
This made me curios on how the robbers in challenge #3 traversed the city in their cars before they met at the final destination. After tweaking my solution on challenge #3 I did find the VIN number of the cars which I could use to draw the routes for the three cars. This was an extra fun challenge!
My code for doing this map looks like this:
A scatterchart rendering can also be rendered on an actual geographical map with the parameter “kind=map”. This would however require latitude and longtitude or GeoJson points as output of the query.
I am triggered of this thought, but I haven’t followed up on this challenge of getting latitude and longtitude information from the known AVE and STREET info.
Challenge #4 and the wrong geo location
Personally, I had a misleading geolocation on challenge 4 which I couldn’t understand was wrong. I had found “you” and next to “the bottom line” there was a “key message” to me. How could THIS ever be wrong!?
It took me some time (and dialogue with my colleagues) to figure out I had a wrong command in use which led me to this wrong place 😜.
Challenge #5 more effective coding
Through the community I learned some more effective ways of coding my take on this challenge.
From David Brook I got a tip of using Kusto regex to extract information from the Message part of the dataset and he did also tip of regex101.com to help me build my query. This was a really good approach which I will take note of for later. This could look like this as an alternative to my approach:
This gives a clean outcome with data in separate columns!
This gives a much shorter and cleaner code and made me learn even more KQL. Sandys economic version of my approach looks as cool as this:
The Visual Studio Code extension
If you like me prefer to do coding in Visual Studio Code, please be aware of the Azure Log Analytics (Kusto) Syntax Highlighting extension. Syntax highlighting truly helps you reading the code.
Other blogposts covering the #kustodetectiveagency
- Kusto Detective Agency: Hints and my experience – OpsMan
- Kusto Detective Agency 2022. If you want to learn Kusto Query… | by exploit_daily | Nov, 2022 | Medium
- pthoor/KustoDetectiveAgencyHints: Hints for the Kusto Detective Agency (github.com)
This has been a truly positive experience leading me to embrace the joy of KUSTO, and it has strengthened my KQL skills that will benefit me in my further work. Kudos to the people behind this project!
Program Manager:Avner Aharoni
Program Manager: Drew Furgiuele
Designer: Oleksandr Lotarev
Program Manager: Iain Haywood
Software Developer:Batu Yıldız
Agency partner:The Competition Agency
Prof Smoke: Henning Rauch
If you are curious about the Kusto Query Language and want a more detailed introduction, I can recommend some sources:
- Microsoft Learn: Data analyses in Azure Data Explorer with Kusto Query Language
- The Amazing KQL – The Amazing KQL
- John Savill: Kusto Query Language (KQL) Overview
- Azure Data Explorer KQL cheat sheets – Microsoft Community Hub