The thing with LaTeX, Unicode, PDF, Umlaute and OS-X

It is a well known critique against (PDF)LaTeX in the german speaking region that LaTeX tend to mess up umlaute (äöü) by composing them by the base letter (aou) followed by a diacritic character. The causes for this are buried deep somewhere in the inner guts of LaTeX/TeX (at least for me as a dumb user). However, this can be avoided by using "T1" fonts. Therefore not a general problem and i kept promoting the usage of LaTeX for scientific writing (Theses, Papers etc.) as the way to go to my colleagues.

Then i was struck when i experienced that copying and pasting text with diacritics (in fact umlaute in a german title of a reference) from my thesis pdf, opened in preview on my mac, in my text editor (sublime edit) produced this:

Clearly wrongly encoded diacritics, which can become pretty annoying if they are somehow injected in your bibtex library… Apparently, the critics was right and my reputation was flawed. Therefore i began to investigate the problem and searched the web for hours to find a solution but everything seems to be completely right in my LaTeX source. It should simply work correctly.

Then i asked a colleague to compile a test document on Windows with MiKTex instead of TeXlive, perhaps something in my TeX distribution was broken. Surprisingly, under Windows everything was ok, even in the document compiled by me. Thus, i opened the pdf with Acrobat Reader and got this result:

The problem is somehow connected to the pdf-reader not the pdf document itself. Further investigation revealed, that Preview and Skim under OS-X messes up all three umlaute (äöü) while Acrobat reader destroys only the ü. Under windows, none of the characters are affected. I looked into the bytes of the copied text and with a refreshing of my knowledge about UTF8 (thanks wikipedia) i was able to deduce that in fact the text copied under windows contains the correct unicode characters (C3BC for "ü" in UTF8) while the text copied under OS-X contains the composite characters.

Under OS-X the exact same phenomena occurs in pdfs produced by Word or Libreoffice, which makes pretty clear that this is a general problem under OS-X which is not connected to LaTeX. I assume it has something to do with unicode normalization and the exact datatype (plain text or rich text) in the OS-X clipboard and the exact handling of the different applications of this data. For example if the text is copied to Word and than to Sublime it is not broken.

However, LaTeX is not to blame in this case.