Those who know me personally, are aware that my background is Mathematical
as I did Degree in Mathematics Honours
in the year 1993-95
. During my early days with COBOL
and Pascal
, I never dealt with Unicode
or played with bytes
. Even when I moved to C
, I was still no where near it. Those with Degree in Computer Science
always stay ahead in the understandings of these low level key aspects of programming.
Fortunately or unfortunately, I never got the opportunity to work with Unicode
. Having said, It always was there in the back of my mind, one day I will conquer the battle. Few weeks ago, I had conversation with a senior member of Perl Community
with regard to an issue in one of the CPAN
module, that I currently maintain. I was pleasantly surprised to see how comfortable he was playing with Unicode
and debugging using hexdump
. I decided to get hold of it rather than delaying it any further.
But the matter of fact is, my plate is always full, at any given point in time. So adding anything to the overloaded plate is going to tip over something already in the pipeline. As always the case, I went back to my Twitter handler and request for help. I did get some nice suggestions.
In this post, I am going to share my experience with you all.
So what exactly is the problem?
My initial blocker was that I am unable to decode the output of hexdump
.
So for the purpose of this blog post, I created sample plain text file sample.txt
.
Now time, to get the hexdump
dump some garbage (to me at least).
Here comes the trouble
, how the output relates to the actual text in the file?
My twitter friends again helped me with decoding.
6548 eH
6c6c ll
206f <s>o
6f57 oW
6c72 lr
2064 <s>d
2121 !!
0a21 <l>!
<s>
means space and <l>
is linefeed. That was all, I needed.
But then why it is the other way around?
I am told again by my twitter friends, it is the endianness
that is behind the order.
For me, this is another blocker that I had to deal with.
With little search on Google
, I found this post that explains the subject in details.
In summary, Big Endian (BE)
stores data MSbyte first
where as Little Endian (LE)
stores data MSbyte last
.
Now what is MSbyte
?
The term Most Significant Byte (MSbyte)
is the most common method of defining endianness
.
The byte
holding the greatest position is called MSbyte
. Similarly the bit
holding the greatest position is called MSbit
.
The byte
holding the smallest position is called LSbyte
. Similarly the bit
holding the smallest position is called LSbit
.
Going back to the original, why the hexdump output is not in the correct order?
The answer is my machine is configured/built as LE
.
So how do I know what endian
is my system build upon?
I started looking for Perl
way of figuring out.
In no time, I found this solution. It had a typo in the original solution, I fixed it here detect-endian-ness.pl
.
#!/usr/bin/perl
use v5.36;
my @b = unpack('C*', pack('I', 0));
my $sizeof_long = scalar(@b);
my @c = (1..$sizeof_long);
my $i = pack('I', hex('0x0'.join('0',reverse @c)));
my $big = pack('C'.$sizeof_long, reverse @c);
my $lit = pack('C'.$sizeof_long, @c);
if ( substr($i, 0, $sizeof_long) eq $big ) {
say 'big';
}
elsif ( substr($i, 0, $sizeof_long) eq $lit ) {
say 'little';
}
else {
say "strange";
}
Time to find out the endian
of my current system.
Before I end this discussion, I would like share another post that explains Byte Order Mark (BOM), if you are interested.
Last but not the least, I would to thank everyone who helped in this.
That’s it for now.