Bitcoin Seeder DoS Vulnerability

The DNS parser of the Bitcoin Seeder was vulnerable to a denial of service attack. A specially crafted DNS request could trigger infinite recursive function calls that lead to a stack overflow. See the exploit. The vulnerability was fixed in commit 11e935b.

Guessing Bitcoin’s P2P Connections

The paper Deanonymisation of clients in Bitcoin P2P network (2014) by Biryukov, Khovratovich and Pustogarov (BKP), who describe an attack on Bitcoin Core clients, has started some discussion lately. The main idea of the paper is to first get a set of nodes $E_v$ to which your victim $v$ is directly connected to (“entry nodes”). Second, for each transaction $t$ record the $10$ nodes $P_t$ which first propagate $t$. The authors experimentally show that if $|E_v \cap P_t|$ is bigger than $3$ then there is a pretty high chance that $t$ actually originated from $v$. However, both attack stages basically require a Sybil attack – the attacker has to be connected to a lot of nodes a lot of times. ‘A lot’ means that in their experiments they had 50 connection to each full node (~250) in the test network. As a result, such an attack seems to be powerful, but certainly won’t be undetected.

In this post I show that the first stage of the attack, namely learning the nodes a victim is directly connected to can be done with a single connection to the victim. In addition to BKP’s attack, knowing all outbound peers of a client could significantly increase the success probability of a double spend. Note that all experiments are based on Bitcoin Core 0.9.4, but 0.10.0 shows the same behavior.

TLDR The attacker can reliably guess all of the outbound connections of a victim by making a selection from the known addresses of a victim based on the timestamp of the addresses.

Update A fix has been merged to bitcoind. The timestamp is not updated anymore when receiving a message from a connected peer. Instead, it is only updated when the peer disconnects. The fix is released in bitcoin core 0.10.1.

Privacy in BitcoinJ

As part of my epic quest to apply supervised machine learning to the blockchain in order to discover transaction patterns, I reviewed various wallet implementations in the hope of finding privacy leaks.

tl;dr If you are using a wallet that is built upon BitcoinJ, such as Android Wallet, Multibit and Hive Wallet, you have almost zero wire privacy. An attacker who manages to connect to your wallet is easily able to figure out all addresses you control. This is not very likely to get fixed in the near future.

Update: Mike Hearn’s reply addresses additional problems and improvements. There was also accompanying discussion on reddit.

FCW Kernel

A Feature Coinciding Walk Kernel classifies nodes in a graph.

This project is a python port of Coinciding Walk Kernels (CWK) [1] and introduces an extension of the model called Feature-CWK (FCWK). If you want to jump right into some code see the benchmark.

CW-Kernels deal with the problem of node classification (aka link-based classification) in which a set of features and labels for items are given just as in regular classification. In addition, a node classification algorithm accepts a graph of of items and item-item links. It has been shown that the additional information that is inherent in the network structure improves performance for certain algorithms and datasets.

Read More

Reading Noun-Noun Compounds

Early Influences of Compound Frequency and Semantic Transparency

My bachelor thesis in Cognitive Science. Unfortunately, I am currently not allowed to release the data nor the analysis scripts, because the dataset is still under active research.

Abstract: This thesis evaluates psycholinguistic theories about the cognitive processing of words. Consequently, the time-course of compound reading is analyzed using generalized additive models in a dataset of eye movements. The theories to be contrasted are sublexical (Taft and Forster, 1975), supralexical (Giraudo and Grainger, 2001) vs. dual route processing (Schreuder and Baayen, 1995) and form-then-meaning (e.g. Rastle and Davis, 2008) vs. form-and-meaning (e.g. Feldman et al., 2009) processing.

As the goal is to find the best model given various predictors, some general mechanisms of eye movements will be demonstrated, e.g. the position in the line has substantial effects, single fixations last longer, are on shorter words, more in the center of the word and influenced differently by frequency measures.

Inspired by Kuperman et al. (2009) it is shown that already the early eye fixations on words are guided by first constituent and compound frequency, providing evidence for parallel dual route models.

Similar to Baayen et al. (2013), Latent Semantic Analysis (LSA) similarity scores (Landauer and Dumais, 1997) permit investigating the time point of semantic processing. The effect of LSA similarity not only shows up in the earliest word fixations, but the data reveals that semantics plays a role even before a word is fixated. In particular, the fixation position in the word is more to the right, when the semantic transparency, i.e. the similarity between compound and second constituent is high. This evidence of parafoveal semantic processing challenges opposing findings obtained with the eye-contingent boundary paradigm (Rayner et al., 1986). In the framework of naive discriminative learning (Baayen et al., 2011), the effect of transparency on fixation position reflects optimization of the landing position for accessing the orthographic information that is most discriminative for the compound.

Keywords: reading, eye-movements, compounds, semantic similarity, morphological processing, generalized additive model

Read PDF

Influence of Reputation on Gricean Maxims

As part of the course “Game Theory and Pragmatics” by Jason Quinley at the Institute of Linguistics, I wanted to explore the influence of reputation on obeying the Gricean Maxims using data from the Q&A website Stackoverflow.

Pragmatics is a subfield in linguistics, defined as “dealing with the origins, uses, and effects of signs within the total behavior of the interpreters of signs” (Morris, 1946). Pragmatics tries to explain why a simple sentence like “It’s raining.” has a lot of different interpretations, for example(Franke, 2009):

  • The speaker advises we should take an umbrella.
  • The speaker declares the picnic cancelled.
  • The speaker is sick of living in amsterdam.

Regressionsmodelle Und Parameterschätzverfahren

Eine Einführung, die als Ausarbeitung für das Seminar “Maschinelles Lernen” an der Universität Tübingen entstanden ist. Die Grundlagen linearer, nichtlinearer, logistischer und Bayes Regression werden behandelt, sowie Verfahren zur Schätzung der Modellparameter aus statistischen Annahmen vorgestellt. Anschließend wird die logistische Regression auf den Titanic Datensatz angewandt und unter anderem gezeigt, dass das Motto “Frauen und Kinder zuerst” bei der Katastrophe zutraf.

Every plot is produced using the open source statistic software R inside the $\LaTeX$ file (Sweave). Code is here.

Charakteristisch für überwachtes maschinelles Lernen, zu der auch die Regression gehört, ist das Beschreiben der Beziehung von Zielvariable und erklärender Variable aus vorliegenden Daten, also Realisierungen von Zufallsvariablen.

Das Regressionsmodell stellt $y$ durch die Summe einer Hypothese von $x$ und einem Fehlerterm $\epsilon$ dar.

$$ y = h(x) + \epsilon $$

PDF weiterlesen