All right. I said several times earlier, that Bitcoin is only pseudonymous and so all of your transactions or addresses could get linked together. Let's now go in and see how that might actually happen. Let's in fact start from WikiLeaks again. I showed you a quote from them saying Bitcoin is a secure and anonymous digital currency, and this is actually the page that was taken from. This is their donations page, and here you'll see that in addition to this blurb about Bitcoin being secure and anonymous, they have a donation address over here. This is of course, the hash of a public key. You've seen things like this in previous lectures, but they also have this interesting refresh button right next to that. What do you imagine this refresh button might do? Well, as you might expect, if you click on that refresh button, it'll give you an entirely new donation address. Let's go in and take a look at that. So, a totally new address popped up on the page. So what is going on here? What WikiLeaks is doing, is it's making sure that each time a person visits the page, each time a person wants to visit the page and make a donation. They send that donation to a totally new public key that WikiLeaks creates just for that purpose. So here Wikileaks is taking advantage of the ability to create new synonyms, new public keys to their maximum. Every single transaction that they receive they want to receive it a new address. And in fact, this is the Bitcoin best practice for anonymity, to always receive new transactions at a fresh address. So you might look at this and think, surely then these different addresses must be unlinkable. You receive a transaction over here and then much later you spend it by sending it to someone else. You receive another transaction at this address and then you send it to someone else over there. So how might somebody link? Well, here's the key. Let's imagine the scenario. Alice, a customer, goes to a big box store and wants to buy a teapot. So in the scenario Alice has a few Bitcoin's lying around with these different denominations, and the store lists the teapot for a price of eight Bitcoin's. That's a pretty expensive teapot at today's exchange rate, so imagine that's 70 Bitcoins or something, if you like. Any rate, Alice has these different addresses and wants to pay for the teapot. How is she going to accomplish this? She doesn't actually have an address with eight Bitcoins sitting in there. And so what she's going to do is, she's going to combine several different input transactions into a single transaction, in order to pay eight Bitcoins to the store. So this reveals something. For somebody who's looking at this transaction that gets recorded permanently in the blockchain, they're gonna think aha, two different inputs to this transaction. That could only happen because both of these input addresses are under the control of the same user, they were able to use their wallet software to create a transaction that combined both of them into one. In other words shared spending is evidence of joint control of two different addresses, and it doesn't stop there. This is not just about linking two different addresses that are inputs to a transaction. You can do that transitively, and every time Alice has a whole cluster of addresses that have been linked. And then she creates a new transaction that combines one of those addresses with a new address, you can add this new address to the cluster. So this is the first insight behind being able to link transactions together. And we'll see later on that an anonymity technique called coin join works by violating exactly this assumption. But, if you assume that people are just using regular Bitcoin wallet software, not doing anything special on top of it. Then this technique tends to be pretty robust, and this has been explored in a variety of research papers. And just a note about this lecture, a lot of what we're gonna be discussing today gets into the frontiers of where the research knowledge are. So a lot of this, the state of the art may have advanced in a few months or a few years. So every time I talk about a technique that we know from a particular research paper, I'll give you a reference to that paper. So that you can look it up. You can look up papers that cite it, and you can build up that knowledge on your own. Now in particular, one of the papers that used this technique used it for a particular purpose. There was a well publicized Bitcoin theft a few years ago, and what they wanted to do, the authors of this paper decided to see how this thief has been moving Bitcoins around between multiple addresses of his own. And so this is that paper in question, it's called An Analysis of Anonymity in the Bitcoin System. So this is one of the first major research efforts that did what we call transaction graph analysis. So, you can use the techniques that I showed you in previous slides, and you can draw a lot of these pretty graphs and deduce that this represents the thief moving many around between his own different addresses. This is the thief sending many to someone else and various things like that. I haven't yet shown you anything that allows you to link any of these clusters to real world identity, but let's defer that question for a bit. Let's defer that question and go back to the scenario of Alice and the teapot. So let's look at it again. Maybe the teapot has gone up in price to 8.5 centiBitcoins. So what is Alice gonna do now? She can't combine any subsets of her transactions or her addresses to produce the exact amount of change necessary for purchasing this teapot. So instead what she's going to do is exploit the fact that transactions can have any number of inputs and outputs, and create a single transaction that looks like this. It combines these two inputs to produce this output that goes over here and another output that goes to an address that she herself owns, and this is called a change address, which you saw in a previous lecture. This presents a conundrum for an adversary whose looking at this. The adversary might be able to deduce that these two addresses belong to the same user. You might suspect that one of these addresses also belongs to that same user, but has no way of knowing which one that is. In this particular example, the change address is a small amount, but it doesn't have to be that way at all. Alice might own an address that has 10,000 Bitcoins, and might spend a little bit on the teapot, and might send most of the rest of it back to her at her own change address. And these transaction outputs don't have any particular ordering in the blockchain. That order is not meaningful at all. So it's not clear what the adversary might do. It's not clear how the adversary might determine which address is changed in a multi output transaction. So what is the adversary to do? There's another pretty cool technique for this again, from a research paper, which I'll tell you about, but the technique is this. The authors call this idioms of use, and they exploit idiosyncratic features of different wallet software. For example, one thing they found is that most wallet software use an address as a change address only once. That means that this in fact seems to sort of follow Bitcoin best practice for anonymity in a sense. If you have a new transaction where you need to create a new change address, don't use an address that you've already used before as a change address. Create a new address and use it for this purpose. Now, not all addresses that are outputs of transactions might have this property. Going back to the example of the big box store, the store might advertise a long term address at which it wants to receive Bitcoins instead of receiving Bitcoins at a different address every time. So, not every non-change address has this property, that it's used only once as a change address, but every change address does have that property. So they used this, and they found that it works pretty well. On the other hand, this has some limitations. It just happens to be a feature of wallet software, and so there are a lot of false positives that might creep in to these clustering techniques, if you use techniques like this. So, it required a lot of manual intervention. Nevertheless, they were able to use the technique that I showed you before, which is clustering shared inputs together, as well as a few heuristics for a change address detection. And then what they were able to do, is they were able to look at the entire Bitcoin transaction graph, and create some giant clusters that they hypothesized belongs to various major service providers. And here's what that graph looks like, after applying these two heuristics, and here's the paper in question. This is by Sarah Meiklejohn and others, as a whole bunch of authors of this paper. Now, this graph looks very interesting here. The sizes of these circles represent the amount of money flowing into those clusters, and the number of edges going out of a cluster represent the number of transactions. Let's try to just stay with this for a second, and see if we can guess what some of these major service providers and other cluster of nodes might be. This huge one here that dominates in transaction volume compared to any other cluster, given that this paper was written in 2013, we might guess that it's Mt Gox, which was a very prominent exchange at the time, that later went under. Now, we might also guess that this little one here that only has a little bit of transaction volume, in spite of having a very large number of transactions sort of corresponds to the profile of the gambling service, the Satoshi Dice. Because the way that it works is you send a tiny amount of Bitcoins and you either win that bet, or you lose that bet, and so you might get double the Bitcoins, or none of the Bitcoins. So that's the gambling service of Satoshi Dice. We might guess that it's this one here, we might guess that it's Mt Gox, and so on, but this kind of guessing is sub-optimal. The authors wanted some sort of reliable way of identifying what are the service providers corresponding to each of these clusters. How did they do that? Well, one idea you might have is, you might think oh, why not go to the Mt Gox website and see what address they advertise for receiving Bitcoins. Well, that doesn't quite work, because they're going to advertise a new address for every single transaction, and if you just go to the website, look at the address, and actually don't complete that transaction, you don't send Bitcoins there, then they're simply going to discard that address. They're not going to reuse that address for another customer. In other words, that address will never get used. You simply won't find it in the blockchain. So what's the way around this? Well, the only way to reliably infer addresses that are associated with a service provider is to actually transact with that service provider, which is exactly what the authors did. They went ahead and bought a variety of things and interacted in a variety of other ways with a bunch of service providers comprising 344 transactions in all. Mining pools, wallet services, exchanges, various merchants, even gambling sites and so on. They got a bunch of cool things to show for their efforts and Meiklejohn informs me that in fact the cupcakes were really good. At any rate, the author's used this very clever technique to go ahead and label the major clusters in the graph that I showed you on the previous slide, and so this is what the labeled graph looks like. In fact, this was Mt Gox, as we might have guessed. This was Satoshi Dice, but a lot of the others would have been very difficult to guess, and by actually transacting with these services, they were able to identify most of these service providers. So already now, we've seen something pretty interesting beyond just clustering, being able to put labels on the clusters. So the next question is, sure you can do these labels for these major service providers. Can you put labels for individuals? So in other words, connect little clusters corresponding to individuals to their real life identities. Well there's a least a couple of different ways in which that can happen. One is intuitively, what I told you right at the beginning, you could simply interact at a coffee shop, or with some other merchant. So they learn some transaction or some address that corresponds to you and they might use that to tag your cluster. There are at least a couple of other ways in which this might happen, and one is that there's high centralization in these service providers. So the intuition here is that most users in the course of normal usage of Bitcoin over a period of months or years are going to interact with at least one of those major service providers that were labeled in the previous graph. So, if somebody wants to identify a cluster corresponding to a particular user. There's a very high chance that they're going to be able to identify a transaction that ties that cluster with a known labeled cluster. And then they can go to that service provider and if they have the appropriate authority, subpoena that service provider, or if they're a hacker, try to hack into that service provider and so on. So, this is one major avenue in which regular users can get de-anonymized because they eventually, inevitably interact with one of these major, easily identified service providers. Another one is simply carelessness. A lot of users end up posting address information in forums. They might post one of the Bitcoin addresses that they own for example, to receive donations when they're posting comments on forums. Now that might be because these users are not worried about getting de-anonymized. It could also be because they don't realize that posting one of their addresses is almost going to inevitably allow somebody to To connect all of their different addresses together. Okay, so hopefully I've convinced you there are clever ways that an attacker might utilize in order to not only link different addresses or transactions belonging to a user, but go from there to real world identity. And our experience, our history of these denominization algorithms shows that they only get more powerful with time. And more auxiliary information as we call it, for attackers to utilize in order to link together to get to users' identities. So this is something to worry about if you care about privacy. Before we look at how to make things better for anonymity, let's look at a completely different way in which users can get de-anonymized. So far what we've looked at is all based on what is available to the attacker in the blockchain. Right, the part that is permanently and publicly recorded, but recall that that's not the only part of Bitcoin. There is also a peer to peer network, in which a lot of messages are sent around that don't necessarily get permanently recorded in the blockchain. So the blockchain in networking terminology is called the application layer, and the peer to peer network is of course, the networking layer. And so de-anonymization can happen at this totally different layer at the networking layer. Well, how could that happen? Here is an example. This was first pointed out by Dan Kaminsky a few years ago in a talk at Black Hat. Here's the peer to peer network. What he noticed is that when a node creates a transaction and wants to broadcast it, it's going to connect to a lot of nodes at once and broadcast that transaction. And so, if a few nodes on the network put their heads together, they can figure out that hey, this new transaction, this is the first we heard of it, and all of us first heard of it from this particular node. So this must be the node, this must be the IP address corresponding to the user who created this transaction. So here you have a linkage not between a cluster and a real world identity. Instead, you have a linkage between a transaction and IP address, and of course, IP address is something that's very close to real world identity. There are a lot of ways to go from there to the next level with finding identity. So, this is already a serious problem. Luckily though, this is not a very hard problem to solve. Why? Because this is now a problem of communications anonymity, and communicating anonymously is a problem that has received a lot of attention from the research community. And as we already saw in the introduction, there is a good system called Tor that you can use for communicating anonymously. Now, there's one little caveat. Tor is intended for what is called low-latency activity, such as web browsing. Where there is a large volume of flow, and you don't want to sit around waiting for too long and you get the response immediately. So it makes some compromises in anonymity in order to achieve low-latency. Bitcoin is inherently a high-latency system, because it takes a while for transactions to propagate through the network, and especially to get confirmed in the blockchain. So we don't have this low-latency constraint. So it's possible that we could come up with a more specific, fine tuned anonymity network for this particular purpose. And there are such things called mix nets, the only problem is that Tor is a system that's most widely deployed and analyzed, and robust, and functional today. But it's possible that somebody might develop a mixed net solution for anonymizing your Bitcoin communications, and if that happens, that would be something to switch to. So let's summarize what we've learned so far. We've seen that based on the information in the blockchain, different addresses could get linked together, could also get linked to identity. We've also seen that based on the information at the network layer, a transaction or address could get linked to your IP address. Luckily, this latter problem is simple to solve. If you care about your anonymity and privacy when using Bitcoin, it's a good idea to do it through tour, but the former problem is much trickier, and that's what we're going to spend the rest of this lecture talking about.