CodeNewbie Community 🌱

Cover image for Parsing non-Latin based Twitch usernames in Kotlin
Tristan
Tristan

Posted on

Parsing non-Latin based Twitch usernames in Kotlin

Table of contents

  1. Introduction
  2. The Problem
  3. The Solution

The code

My app on the Google play store

Introduction

  • here is just a quick little reminder that if you are ever parsing usernames and or user based content, think if you can parse non-Latin based text

The Problem

  • Recently I have ran into an issue where the regex for my parsing code, simply does not work on non-Latin based alphabets. For example, if I wanted to parse the display-name from this string: display-name=CoalTheTroll;emotes=;flags=;id=3ceab6bd-de3f-4d05-8038-5cebdb2af1c7; :tmi.twitch.tv USERNOTICE #cohhcarnage
  • The typical code would look like this:
fun userNoticeParsing(text: String):String{
  val displaynamePattern = "display-name=([a-zA-Z0-9_]+)".toRegex()
   val displayNameMatch = displayNamePattern.find(text)
   return displayNameMatch?.groupValues?.get(1)!!
    }

Enter fullscreen mode Exit fullscreen mode
  • The code above works. However, there is a problem when the display name is non-latin based. For example, a Mandarin display name will not be parsed. So a display-name of 不橋小硐 will cause the code to crash

The solution

  • A simple solution (some might say lazy) is to not worry about ASCII character sets. With regex, we simply say, match all characters after display-name. The code would look like this:
fun userNoticeParsing(text: String):String{
        val displayNamePattern = "display-name=([^;]+)".toRegex()
        val displayNameMatch = displayNamePattern.find(text)
        return displayNameMatch?.groupValues?.get(1) ?: "username"
    }
Enter fullscreen mode Exit fullscreen mode
  • with the regex code above, display-name=([^;]+), we are stating. Match display-name= and any characters that follow one or more times, stop matching once you find a ;. The ()brackets allow us to break the regex expression into groups allowing for a easier match and quick retrieval of what we actually want. Lasty we us the ?: operator to say, if not match is found return "username"
  • Now, even with character based display names, such as Mandarin our code will work:
val text ="display-name=不橋小硐;emotes=;flags=;id=3ceab6bd-de3f-4d05-8038-5cebdb2af1c7; :tmi.twitch.tv USERNOTICE #cohhcarnage"

fun userNoticeParsing(text: String):String{
        val displayNamePattern = "display-name=([^;]+)".toRegex()
        val displayNameMatch = displayNamePattern.find(text)
        return displayNameMatch?.groupValues?.get(1) ?: "username"
    }
val expectedUsername = "不橋小硐"
val actualUsername = userNoticeParsing(text)
expectedUsername == actualUsername 

Enter fullscreen mode Exit fullscreen mode

Conclusion

  • Thank you for taking the time out of your day to read this blog post of mine. If you have any questions or concerns please comment below or reach out to me on Twitter.

Top comments (3)

Collapse
 
theplebdev profile image
Tristan

I have hidden a comment trying to convince users to click on a sketchy link. SCAMMER NO SCAMMING!!!

Collapse
 
larrymartin1job profile image
Info Comment hidden by post author - thread only accessible via permalink
Larry Martin

Simplifying the regex to match all characters after display name= seems like a pragmatic approach. Have you considered potential downsides or edge cases with this method?
Construction Services in San Antonio TX

Collapse
 
mikasa1412 profile image
Info Comment hidden by post author - thread only accessible via permalink
Mikasa1412

I appreciate the clarity and thoroughness of your explanation regarding the challenges of parsing non-Latin based Twitch usernames in Kotlin. Your wordle unlimited solution, while labeled by some as simple, is indeed pragmatic and effective.

Some comments have been hidden by the post's author - find out more