Friday, August 30, 2013

Turkish kinship vocabulary: Amca, Dayı, Enişte, Hala, Teyze, Yenge

Turkish has three commonly used words for “uncle”: Amca, Dayı, Enişte; and three for “aunt”: Hala, Teyze, Yenge. Here is how to use them.

Turkish words for father's (“Baba”) and mother's (“Anne”) siblings and their spouses.

Father's brother is “Amca”, mother's brother is “Dayı”. An uncle not related by blood is “Enişte”. Father's sister is “Hala”, mother's sister is “Teyze”. An aunt not related by blood is “Yenge”.

In all but formal circumstances, it is common to address considerably older strangers as “Amca” (uncle) or “Teyze” (aunt). However, I was once referred to as “Enişte” — probably due to the fact that I'm a foreigner and therefore I belong to the large family of all Turks by marriage, not by blood.

Friday, August 9, 2013

Tagging CS journals and conferences with arXiv subject areas

I have recently launched DBLPfeeds, a simple service providing RSS feeds with the latest papers from over a thousand DBLP-indexed computer science journals and conferences (one RSS feed per journal/conference). I had some positive feedback, which is of course very nice. Paul Groth suggested to group feeds into categories, e.g. journals and conferences related to AI. I liked this idea (a great feature), but I lacked the required data, i.e., the tags (e.g. “Information Processing and Management is about Digital Libraries”, or “ACM Transactions on (Office) Information Systems is about Information Retrieval”). Here is how I solved the problem.

Data at hand

DBLPfeeds are generated using XML dumps of DBLP, which are available under ODC-BY 1.0 license. For each article indexed in DBLP I have its title, authors, link to full text, publication year and publication venue (conference or journal). This is exactly what I need to generate RSS feeds for each venue.

There is another valuable source of information: arXiv. Many computer scientists deposit their preprints there, before the papers appear in journals or conference proceedings. Articles deposited in arXiv are classify using tags such as cs.AI (for Artificial Intelligence), cs.CC (for Computational Complexity), or cs.DL (for Digital Libraries). For a comprehensive description of the tags go here. Everyone has convenient access to metadata of articles deposited in arXiv via OAI-PMH protocol.

To sum up, there are two openly accessible sources of data (DBLP and arXiv), which – combined – contain the information I need.

Merging

I have harvested arXiv using OAI-PMH (metadataPrefix = arXiv, set = cs), which produced approx. 58,000 records. From each record I took the title and the categories starting with “cs.”  Next, I combined that with title and venue fields in approx. 2,100,000 records taken from DBLP.

To match the titles, I lowercased the strings and removed all non-alphanumeric characters (I also removed whitespace characters). Thus for example “Proof-Pattern Recognition in ACL2” became “proofpatternrecognitioninacl2”. I calculated the number of times when a given venue co-occurs with a given arXiv category. The table is available at figshare (CC-0 license).

Finally, I had to select the most representative venues (journals, conferences) for a given tag. I have arbitrarily chosen the following criterion: a given tag will bee assigned to a given venue if at least 30% and at least 5 of the papers at the venue have the tag.

Summary

Firstly, a trivial observation: Open Access is great. By combining two publicly available data sources I was able to add a nice feature to DBLPfeeds.

Now about methodology: I took a quick and dirty approach, which leaves a lot of room for improvement. The papers are joined by looking at titles only (author names are ignored), so one can easily imagine both false positives and false negatives. I used totally arbitrary criteria for assigning tags, but the complete data set it there, so feel encouraged to find better heuristic.

One more obvious shortcoming: if a journal does not permit self-archiving, preprints of its papers will not appear on arXiv and, consequently, it will not be tagged with arXiv subject area codes.  Oh well, that's another small reason to go green, I guess ;)

Open code, open data

The code is publicly available at GitHub on BSD license (take a look at code/tags.sh) while the table of co-occurrences is publicly available at figshare on CC-0 license.

Tuesday, November 13, 2012

Writing research papers can be a tiny bit easier

Recently I came across two useful web services that make it a bit easier to write research papers: Netspeak and Detexify².

As a non-native English speaker, I often have problems with choosing the right words and I used to ask Google to help me.  For example, I would formulate a query "our research * that", look for the most frequent words in the search results, and issue additional queries like "our research indicates that" and "our research shows that" to count hits.
With Netspeak, it is easier, I simply write: our research ? that and I instantly get the most popular phrases with their counts. Netspeak can also find the most popular synonyms of a given word in a given context, or find the most frequent order of given words:



Detexify² solves another small inconvenience: when I didn't remember the LaTeX instruction for a less-common math. symbol, I needed to consult looong lists of symbols and corresponding instructions.  Now I can simply draw the symbol and Detexify² will tell me the instruction and the package which I need to use!


Interestingly, the back-end is written in Haskell, and its source code is available on GitHub.

Wednesday, January 25, 2012

Typoglycemia in Haskell

Can you decipher the following sentences?
All hmuan biegns are bron fere and euqal in dgiinty and rgiths. Tehy are ednwoed wtih raeosn and cnocseicne and sohlud act tworads one aonhter in a sipirt of borhtreohod.
It's the first article of the Universal Declaration of Human Rights, with letters in each word rearranged in a certain way.

Typoglycemia

In the above sentences the first and the last letter of each word stay in their places, while the remaining letters are scrambled. Such text is surprisingly easy to read, especially when reading at a high speed. It seems that our brains need the first and the last letter as anchors, but order of the letters in between is not so important. This feature is dubbed typoglycemia.

Code in Haskell

To play with the above observation, I wrote a short Haskell program which obfuscates an input text as explained above.

import Data.Char (isLetter)

mapBlocks :: (a -> Bool) -> ([a] -> [a]) -> [a] -> [a]
mapBlocks _ _ [] = []
mapBlocks p f xs = f as ++ bs ++ mapBlocks p f zs where
  (as, ys) = span p xs
  (bs, zs) = break p ys

obfuscateSwap :: [a] -> [a]
obfuscateSwap [] = []
obfuscateSwap (c:cs) = c : ob cs where
  ob (a:b:c:ds) = b : a : ob (c : ds)
  ob xs         = xs

main = interact $ mapBlocks isLetter obfuscateSwap
The first function, mapBlocks, takes:
  • a function a -> Bool which classifies each element in a list as either "interesting" or not,
  • another function [a] -> [a] which does something with a contiguous block of "interesting" elements,
  • and a list of elements [a].
The result is a list of elements [a] in which all the blocks of "interesting" elements are transformed using the above-mentioned function.

The second function, obfuscateSwap, takes a list of elements and shuffles all the elements except the first and the last by swapping adjacent elements. For example: obfuscateSwap "towards" == "tworads". Here's a different implementation, which reverses the internal elements (e.g. obfuscateReverse "towards" == "tdrawos"):

obfuscateReverse :: [a] -> [a]
obfuscateReverse xs | length xs < 4 = xs
obfuscateReverse xs | otherwise     = [head xs]
    ++ (reverse . init . tail $ xs) ++ [last xs]

Notes

A few things to notice:
  • Both obfuscateSwap . obfuscateSwap and obfuscateReverse . obfuscateReverse are equivalent to id, so we can use the same function to "encrypt" and "decrypt" a message.
  • I like the type of mapBlocks: there is a nice separation of work, mapBlocks processes an input list but outsources obfuscation to another function. Also, it's not limited to strings. Only after partial application of isLetter we get a function operating on strings.
  • I don't like the implementation of mapBlocks: it's not tail recursive and we will run out of heap for larger inputs. It should definitely be rewritten. A straightforward tail recursive version would feature an additional accumulator.
  • Results of obfuscateReverse are harder to decipher than results of obfuscateSwap.
  • I'm thinking about (ab)using Parsec for an alternative implementation of mapBlocks.

Appendix

To see the difference between the two versions of obfuscate, read a bit longer text, the second article of UDHR. Here's scrambled using obfuscateSwap:
Eevyrnoe is etntield to all the rgiths and ferdemos set froth in tihs Dcealariton, wtiohut dsiitcniton of any knid, scuh as rcae, clouor, sex, lnauggae, rlegioin, ploticial or ohter oipinon, ntaoianl or scoail oirign, poreptry, brith or ohter satuts. Fruhtreomre, no dsiitcniton sahll be mdae on the bsais of the ploticial, jrusiidtcoianl or itnreanitnoal satuts of the cuotnry or treirotry to wihch a preosn blenogs, wehhter it be idnpeneednt, turst, non-slef-gvoreinng or udner any ohter lmititaoin of svoreiengty.
Here's scrambled using obfuscateReverse:
Enoyreve is eeltitnd to all the rthgis and fmodeers set ftroh in tihs Doitaralcen, wuohtit doitcnitsin of any knid, scuh as rcae, cuolor, sex, lgaugnae, roigilen, pacitilol or oehtr ooinipn, nanoital or saicol oigirn, ptrepory, btrih or oehtr sutats. Fromrehtrue, no doitcnitsin slahl be mdae on the bisas of the pacitilol, janoitcidsirul or ianoitanretnl sutats of the crtnuoy or trotirrey to wcihh a posren bgnoles, wehtehr it be inednepednt, tsurt, non-slef-gninrevog or uednr any oehtr loitatimin of stngierevoy.

Wednesday, January 11, 2012

Assorted curiosities: Geography

Fun facts learned while clicking through Wikipedia:
  • Treasure Island in Ontario, Canada is probably the largest island in a lake in an island in a lake.
  • Liechtenstein and Uzbekistan are doubly landlocked countries, i.e., all the neighbouring countries are landlocked.
  • Republic of Kalmykia is the only predominantly Buddhist region of Europe.
  • The Door to Hell is a 70 m (230 ft) wide hole near the village of Derweze in Turkmenistan, filled with natural gas, which has been burning since 1971.
The Door to Hell near Derweze, Turkmenistan

Wednesday, December 28, 2011

My first use of the state monad

While learning Haskell, I was looking for a concise implementation of a function which "reshapes" a list into a matrix.  Given the number of rows r, the number of columns c, and a list vs, the function should take r*c values from the list and create a r by c matrix out of them.  Here's the type:
toMatrix :: Int -> Int -> [a] -> [[a]]

First solution

First I wrote  a simpler function that would split a list into chunks of a given size, like this:
chunksOf :: Int -> [a] -> [[a]]
chunksOf _ [] = []
chunksOf c vs = h : (chunksOf c t)
    where (h, t) = splitAt c vs
Using the above, toMatrix could be implemented this way:
toMatrix r c = chunksOf c . take (r*c)
I had a feeling that a function like chunksOf should be already present somewhere in the standard library, so I asked Hoogle, but to no avail. There was chunksOf in Data.Text, but it operated on Text only (I retroactively named my function after the one in Data.Text).  However, Hoogle returned replicateM as well…

Second solution

… and I realized I could use it with the state monad to implement toMatrix.  The state could contain the list of values yet to be consumed, and the action to be replicated could be chopping off c values from the list:
splitOnce :: Int -> State [a] [a]
splitOnce c = do
    s <- get
    let (h, t) = splitAt c s
    put t
    return h
After a while I realized that the same function could be written in a much more concise form:
splitOnce' :: Int -> State [a] [a]
splitOnce' = state . splitAt
The solution was, therefore:
toMatrix r = evalState . replicateM r . state . splitAt

Summary

The second version is good enough for me and as a bonus it helped me understand the state monad. Note that the two implementations of toMatrix are not equivalent, as they handle lists shorter than r*c in different ways.  Future work: find a concise and preferably point-free implementation of chunksOf.

Update (2013-01-08)

This answer on StackOverflow contains a very nice implementation of chunksOf.