Quick way to decode unknown encoding

It happens that in a web browser, instead of normal text, we face something like:

— Выполните вход или зарегистрируйтесь

that is, completely unreadable characters.

Or so, when English characters are displayed normally, and instead of other characters, a percent sign and letters with numbers:

mat2%3A%20%D0%BD%D0%BE%D0%B2%D0%B0%D1%8F%20%D0%B2%D0%B5%D1%80%D1%81%D0%B8%D1%8F%20%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D1%8B%20%D0%B4%D0%BB%D1%8F%20%D1%83%D0%B4%D0%B0%D0%BB%D0%B5%D0%BD%D0%B8%D1%8F%20%D0%BC%D0%B5%D1%82%D0%B0%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85

There are lines consisting of large and small letters with numbers, at the end there can be one or two equal signs:

0J/QviDQstCw0YjQtdC80YMg0L3QvtC80LXRgNGDINGN0LvQtdC60YLRgNC+0L3QvdC+0LPQviDQutC+0YjQtdC70YzQutCwwqDQt9Cw0L/Rg9GJ0LXQvdC+wqDRgNCw0YHRgdC70LXQtNC+0LLQsNC90LjQtSEg0JLQviDQuNC30LHQtdC20LDQvdC40LUg0LjRgdGH0LXQt9C90L7QstC10L3QuNGPwqDQvNCw0YLQtdGA0LjQsNC70YzQvdGL0YXCoNGB0YDQtdC00YHRgtCyLCDQv9GA0L7RgdC40LzCoNGB0YDQvtGH0L3QvsKg0LfQsNCy0LXRgNGI0LjRgtGMwqDQuNC00LXQvdGC0LjRhNC40LrQsNGG0LjRjg0KDQrQl9Cw0LLQtdGA0YjQuNGC0YzCoNCy0LXRgNC40YTQuNC60LDRhtC40Y4=

Sometimes you come across a text in which a backslash with an x is regularly found followed by letters and numbers:

\xE2\x80\x94\x20\xD0\x92\xD1\x8B\xD0\xBF\xD0\xBE\xD0\xBB\xD0\xBD\xD0\xB8\xD1\x82\xD0\xB5\x20\xD0\xB2\xD1\x85\xD0\xBE\xD0\xB4\x20\xD0\xB8\xD0\xBB\xD0\xB8\x20\xD0\xB7\xD0\xB0\xD1\x80\xD0\xB5\xD0\xB3\xD0\xB8\xD1\x81\xD1\x82\xD1\x80\xD0\xB8\xD1\x80\xD1\x83\xD0\xB9\xD1\x82\xD0\xB5\xD1\x81\xD1\x8C

To quickly decode, even when you do not know how the string is encoded, use the free online service for determining and converting encoding. This service is copied from here http://0xcc.net/jsescape/.

The principle of operation is very simple – you insert an unreadable string into the window, and the service tries to convert it into each of the encodings it supports. That is, if you see readable text in the Plain Text field, then your string has been successfully decoded. I’ll try to understand the meaning of — Выполните вход или зарегистрируйтесь:

Got it! This line means:

— Выполните вход или зарегистрируйтесь

Now let's figure out the line:

mat2%3A%20%D0%BD%D0%BE%D0%B2%D0%B0%D1%8F%20%D0%B2%D0%B5%D1%80%D1%81%D0%B8%D1%8F%20%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D1%8B%20%D0%B4%D0%BB%D1%8F%20%D1%83%D0%B4%D0%B0%D0%BB%D0%B5%D0%BD%D0%B8%D1%8F%20%D0%BC%D0%B5%D1%82%D0%B0%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85

Its meaning turned out to be:

mat2: новая версия программы для удаления метаданных

Now look at the message from the scam email:

0J/QviDQstCw0YjQtdC80YMg0L3QvtC80LXRgNGDINGN0LvQtdC60YLRgNC+0L3QvdC+0LPQviDQutC+0YjQtdC70YzQutCwwqDQt9Cw0L/Rg9GJ0LXQvdC+wqDRgNCw0YHRgdC70LXQtNC+0LLQsNC90LjQtSEg0JLQviDQuNC30LHQtdC20LDQvdC40LUg0LjRgdGH0LXQt9C90L7QstC10L3QuNGPwqDQvNCw0YLQtdGA0LjQsNC70YzQvdGL0YXCoNGB0YDQtdC00YHRgtCyLCDQv9GA0L7RgdC40LzCoNGB0YDQvtGH0L3QvsKg0LfQsNCy0LXRgNGI0LjRgtGMwqDQuNC00LXQvdGC0LjRhNC40LrQsNGG0LjRjg0KDQrQl9Cw0LLQtdGA0YjQuNGC0YzCoNCy0LXRgNC40YTQuNC60LDRhtC40Y4=

How to determine the encoding

Some frequently encountered encodings can be easily determined “by eye”. Determining the encoding with the naked eye can greatly accelerate the process of decrypting a string or quickly understand the reason why the text is displayed in this form.

URL encoding

Let's start with the encoding that everyone saw – in the browser bar or on the sites you could see something like these addresses: https://kali.org.ru/%d0%b4%d1%80%d1%83%d0%b3%d0%b8%d0%b5-it-%d1%82%d0%b5%d0%bc%d1%8b/%d0%ba%d0%b0%d0%ba-%d0%bd%d0%b0%d1%87%d0%b0%d1%82%d1%8c-%d0%b7%d0%bd%d0%b0%d0%ba%d0%be%d0%bc%d1%81%d1%82%d0%b2%d0%be-%d1%81-%d0%ba%d0%be%d0%bc%d0%b0%d0%bd%d0%b4%d0%b0%d0%bc%d0%b8-linux-cygwin

The URL standard uses the US-ASCII character set. This has a serious drawback, since only Latin letters, numbers and a few punctuation marks are allowed. All other characters must be recoded. For example, letters of the Cyrillic alphabet, letters with diacritics, ligatures, hieroglyphs should be transcoded. The encoding is described in RFC 3986 and is called URL-encoding, URLencoded, or percent‐encoding.

Data from web forms when Content-Type is specified as application/x-www-form-urlencoded is also passed in URL encoding.

Base64

I am almost sure that you have ever seen messages in this encoding – they are written in capital and small Latin letters, as well as numbers. At the end, there can be one or two equal signs:

0J7QtNC90LDQttC00YssINCyINGB0YLRg9C00ZHQvdGD0Y4g0LfQuNC80L3RjtGOINC/0L7RgNGDLCDRjyDQuNC3INC70LXRgdGDINCy0YvRiNC10LsuINCR0YvQuyDRgdC40LvRjNC90YvQuSDQvNC+0YDQvtC3Lg==

In any case, almost certainly you use this encoding almost every day, even without knowing it, since Base64 is very often used by e-mail messages, especially for letters that have files (photos, documents, etc.) attached.

Base64 is a standard for encoding binary data using only 64 ASCII characters. The coding alphabet contains text-digital Latin characters A-Z, a-z and 0-9 (62 characters) and 2 additional characters, depending on the implementation system. Every 3 source bytes are encoded with 4 characters (increase by ¹⁄₃).

This system is widely used in e-mail to represent binary files in the text of the letter (transport coding).

The specified service can also decode from Base64, as well as encode to Base64, but there is a peculiarity: quite often a long Base64 string in an email is split into strings of the same length (for convenience reasons). In the service to which the link is given, you need to remove unnecessary line feeds, that is, the input data must be on one line, otherwise after the first character "new line" the message will be decoded incorrectly.

UTF-8 encoding

The incorrectly displayed UTF-8 encoding looks like capital letters N and D with additional lines, fractions of 3/4 are found.

— Выполните вход или зарегистрируйтесь

In this case, the UTF-8 encoding is processed as the ISO-8859-1 or CP1258 encoding. Using the specified service, such strings can be decrypted if you copy them to Quoted-printable windows or URL.

UTF-8 encoding processed as ANSI resembles strings from capital letters P, C and small letters r and s:

добавить чёрный список

Escaped sequences

Escaped sequences are especially often seen in the source code of programs. If you want to know what a string written this way means, then copy it to one of the fields:

  • \uXXXX - backslash and u followed by letters and numbers (hexadecimal number)
  • \UXXXXXXXX - backslash and big U followed by letters and numbers (hexadecimal number)
  • &#DDDD; - an ampersand sign and a number sign followed by four numbers
  • &#xXXXX; - ampersand sign, number sign and x followed by a hexadecimal number
  • \xXX is the backslash and x followed by a hexadecimal number
  • \OOO is the backslash and large O, followed by a number in the octal number system.

Such strings are used in situations where there is a danger that the lines written in the letters of the national alphabet will be distorted (for example, the browser will not correctly understand the encoding of the web page):

<script>
	alert ("\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82")
</script>

How to convert to escaped sequences

On the same page, as you might already guess, you can also convert to the escaped sequence of characters.

How to change the encoding of a string or document without third-party services

Although the service shown above does NOT send the entered data to the server, strings are computed exclusively using JavaScript running in the user's browser, it is quite possible that you want to change the encoding without using sites.

Double Commander when viewing text files (to do this, select the file and press F3) or while editing (F4) you can change the encoding after opening, and also save with a different encoding.

Another option for those with Linux is to use the command line. With it, you can find out the encoding of an incomprehensible string, and also change it to the correct one.

To reveal encoding of a file:

enca mypoem_draft.txt

To reveal encoding of a atring with chardet:

echo $'\xed\xe5 \xed\xe0 \xe9\xe4\xe5\xed\xf3\xea \xe0\xe7\xe0\xed\xed\xfb\xe9\xec\xee\xe4\xf3\xeb\xfc' | chardet
<stdin>: windows-1251 with confidence 0.970067019236

To reveal encoding of a atring with enca:

echo $'\xed\xe5 \xed\xe0 \xe9\xe4\xe5\xed\xf3\xea \xe0\xe7\xe0\xed\xed\xfb\xe9\xec\xee\xe4\xf3\xeb\xfc' | enca -L ru
MS-Windows code page 1251
LF line terminators

To reveal encoding of a atring with uchardet:

echo $'\xed\xe5 \xed\xe0 \xe9\xe4\xe5\xed\xf3\xea \xe0\xe7\xe0\xed\xed\xfb\xe9\xec\xee\xe4\xf3\xeb\xfc' | uchardet

To change encoding of a file with iconv:

enca -i mypoem_draft.txt
cat mypoem_draft.txt
iconv -f CP1251 -t UTF-8//TRANSLIT mypoem_draft.txt -o poem.txt
cat poem.txt
enca -i poem.txt

To change encoding of a file with enca:

enca -x UTF-8 mypoem_draft.txt

Converting a string to the correct encoding with iconv:

echo $'\xed\xe5 \xed\xe0\xe9\xe4\xe5\xed \xf3\xea\xe0\xe7\xe0\xed\xed\xfb\xe9 \xec\xee\xe4\xf3\xeb\xfc' | iconv -f 'Windows-1251'
не найден указанный модуль

Recommended for you:

Leave a Reply

Your email address will not be published.