February 24, 2010

GC, Continuations, Ruby Internals

When I was about 6, I told my mom I wanted to be a garbageman when I grew up. I mostly just liked the truck, and hanging from the side of it while it drove from house to house.

Garbage collection still fascinates me. I found this talk on Ruby's garbage collector of great interest. While we program with abstractions to make application development easier, the sad fact is that it still helps you avoid shooting yourself in the foot to know how your languages features are being implemented. Even if it didn't, wouldn't you be curious?

This is also a great example of a poor (or simply inflexible) design choice having major consequences down the road. In the talk they explain how Ruby's conservative, stop-the-world mark and sweep collector can't really be replaced by other, more efficient collectors due to the representation of objects. Personally, I'm partial to generational stop-and-copy's; but it only showed up in the talk as an impossibility. The best they could do was curb their lame mark-and-sweep (also Python uses reference counting lol).

I dropped a lot of jargon there; for any curious folk, I can explain what they mean in a later post ^_^ I'll spend the remainder of this one talking about another Ruby feature whose underlying implementation destroyed my algorithm.

---

Ruby is one of the few languages to support continuations out-of-the-box (in Ruby 1.8 anyways, in 1.9 you have to 'require continuation'). The presence of continuations are a sign of flair: when a designer has worked to put it in the language (Matz called it the hardest language feature of Ruby to implement) they are pretty much telling you they are committed to writing a flexible, powerful language that lets the programmer do whatever they want. Naturally, when I see a use for them, I use them.

Kent Dybvig shows us that continuations can be implemented to be very efficient. Unfortunately in Ruby, they aren't: like the GC, continuations are implemented in about the most bare-bones way possible. In Ruby, they implement a continuation by copying the entire program stack in its current state and storing it elsewhere. When you call the continuation, they copy back the old stack over the current one.

Here's the problem: I wrote a program to push the current continuation onto a stack before every call to the recursion. The idea was to use the continuations to keep track of backtracking over several parameters, and the stack meant you would only call as many as you needed.

The problem was, every recursive call increased the size of the stack, since Ruby doesn't support tail-call optimization. So at every recursive call, you would copy over the entire stack somewhere in memory, augment it, and recur. DEATH!

Needless to say, I found another solution. But this was another example of how the implementation of a feature can make the feature usable or not. Had I implemented the same algorithm in Dybvig's Chez Scheme, with both tail-calls and efficient continuations, this algo would have sailed.
---
Coming back full circle, there was a time when I considered peeking into the Ruby source and forking it to support Dybvig's stack frame model (gutting the whole language primarily to support... continuations?). Looking at the object representation from the GC talk though, it's probably much harder than I imagined ^_^.

February 22, 2010

m4d H4X

For two years I was a UTA for our department's Introduction to Security course, and my current roommate is the current Head TA. So when a friend was looking for someone to perform a security audit on his web application, he called my roommate, who called me in as his surgeon's assistant. Here's what we found:

Dictionary Attack


Anytime you have a problem in computing, there's always a 'dumb way to do it,' which normally involves checking every possibility. Remember being a kid and asking someone to guess your birthday? The first thing they ask is 'What month is it in?' Suppose you say 'August.'

A dictionary attack is the kid who closes his eyes and says "August 1st August 2nd August 3rd August 4th August 5th..." (and ruins the game).

The idea is this: if you want to guess someone's password, try every value it could be. You do this by trying to log in as them with every password, and you stop when one of them works.

Sounds dumb? It is, but never underestimate a fast, dumb computer. After all, it worked against Twitter.

The attack is called a Dictionary attack because the idea is that you try someone's e-mail address with every word in the dictionary. A simple dictionary (one I used for this) consists of the 500 most common passwords, a couple hundred first names, and an actual dictionary (the puzzle links to a text file). Since most people use real words as their passwords, there's a good chance you'll stumble upon the correct one.

To stop this, you have a few options:

  • Create a delay after some failed attempts, and report the behavior to an admin. So if someone messes up their password 3 times, make them wait 15 minutes. Another 3, make them wait an hour, etc. This slows down your opponent, and makes you aware of suspicious activity.

  • Demand strong passwords. We all get annoyed having to mix numbers and letters (one of the most common passwords is 'password1', the most common is '123456'), but it helps your security, since you won't find 'h4ll0MRP3ANut' in a dictionary.

  • Keep track of your requests, and stop trolls. This is a similar tactic to a DoS, but keep track of where people are logging in from. If you have 100 failed logins in 1 minute from IP Address 113.154.2.110, stop letting them try to connect (again, at least for a day or two).


File Upload


Most web applications let you upload files to share, or view online. There was once an artist who bound his book with sandpaper so that shelving and re-shelving it would destroy the books next to it. That bookshelf is your application, and that book is the other exploit we found.

The site in question had a file upload feature, so we uploaded an executable file that would run whatever command we fed it on the computer where it resided (in this case, the company server). As soon as we 'viewed' the document, it would execute. So a command like

find ../../ -name "config.rb" -exec grep password '{}' \;

Will find a configuration file, and find all the passwords in it (most web frameworks have contain a file).

The fix to this one is simple: don't let users upload any type of file your server might execute (unless, of course, you're a code hosting site, in which case you don't need to be told about this).

----

Security isn't the way movies make it out to be: most hacks aren't mathematical or cryptographic breaks, and they're never as dramatic. The brilliance in the best ones is that they're so simple. Most security holes are little leaks in the way software gets written, or more usually (like weak passwords) flaws in predictable human behavior.

February 19, 2010

Fruits

I made a video game! Czech it out ^_^



Follow my progress, as well as the rest of my games group, at our blog!

February 18, 2010

A short history of my belief

One of the (few) things I used to blog about was my atheism. Since I'm starting fresh in this new blog, an introduction is in order, since my relationship to belief has varied widely over the years, and religious belief is something I care passionately about.

I 'came out' as atheist almost as soon as I embraced the label, about 2 years ago, in my junior year at Brown. I'd been fighting doubts in my mind for a few years prior, but only really challenged and criticized my faith with any vigor or curiosity around when I came out.

I was raised culturally Catholic: technically my mother was a Catholic and my father a Protestant. I say was because my mother isn't technically allowed to practice anymore: she is my father's second wife. Since the Catholics don't believe in divorce, my mother could not longer 'officially' practice. When they got married, there wasn't a Catholic priest in town who would officiate the ceremony. They now attend a Methodist church.

We didn't grow up going to church, but I was still raised with a significant religious bent, in spirit more than practice. My parents always emphasized Doing the Right Thing that Christ Would Want as more important than saying any number of Hail Mary's. In Middle School especially I would pray every night before going to bed. I would wear a cross, and if I took it off before showering, would apologize to God (no joke).

As I got older, questions started forming. First, the simple facts-on-the-ground ones ("why does the church hate gay people?") to the higher level ones ("If God knows what I'm going to do, and even controls it, why does s/he care?"). They mounted over time, and after a while I just couldn't think about it anymore.

When Junior year rolls around, I read a few books, and realize the answer to all those questions is a very simple one: S/He doesn't exist. Suddenly, almost all the issues go away.

No more fighting my brain; no more defending stupidity manifesting itself in religious institutions. No more tainting my observations with false expectations of a Watcher or Creator. What happens on Earth is our own doing. It remains the most liberating moment of my life.

It can be depressing, however. Dawkins had an cute essay called Gerin Oil, where he compares religion to drug use (paraphrasing: "in light doses, it's rather harmless and often used as a social lubricant. In heavier doses...").

I think this is apt. Just like consuming substances can 'turn off' a part of your brain and allow you certain associated comforts, going atheist is like being the sober guy at a party. While other people are comfortable embracing blatant contradictions and living by hollow aphorisms, you get a far less sugar-coated view. Preventable injustices happen (no, God didn't want it). You aren't that special. When you die, its over.

All things considered, however, I'll take it, because a) if you look at it from another angle, it's not really bad at all, and b) I can believe it ^_^

So I might post about religion from time to time. If you agree, great! If you don't, better! Let's have a conversation.

February 17, 2010

This made me laugh today.

Yes he was a traitor, but I don't mind taking orders from Lando Calrissian when he plays it so cool.



Gotta love the cooool lovemakin' music they score it with.

February 15, 2010

Hiatus - And a voyage of vim

Sorry for the lack of updates. I hate when blogs say that, but publishing less than once a month defeats the purpose of a blog. I blame what I always have: my insistence that what I publish be Significant (the last two posts were more than epic enough), even when it goes against the interests of a) myself, and b) my readers (are there any there)?

But let's continue. What am I up to? Oh, yes! These are things I'm working on in school, and am likely to write about (if I ever do, in fact, write):

Solving Hard Problems with Combinatorial Optimization is one of the 5-6 legendary courses everyone at Brown CS is told to take, something like a rite of passage. I missed the chance to take one of those classes (Operating Systems) in lieu of another (Programming Languages Seminar), and while I loved loved loved PL, I won't allow myself to graduate without taking this course.

I'm also taking an independent study on Android Game Development, with some pretty awesome partners. It's something very different than I'm used to: whereas I'm used to writing programs that run on your terminal, performing computation or purely manipulating data, Android is framework programming with GUI's and graphics. The data in this case is usually mundane plumbing, and the easy part of your application.

Altogether, good fun. Hopefully I'll still have time (and this is dubious) to continue pursuing my lovely programming puzzles.

In the meantime, I'll bloviate some. Today I'll talk a bit about my tools with a focus on text editing.

I feel like refining and expanding upon your coding environment is something like building your own lightsaber: you can use someone else's, but you really only work productively (and more importantly, happily) when you find the tools to fit your needs yourself, and through much experimentation. For me, that environment is usually all-out vim + Makefiles (or in Java, vim + ant), and as much from the Terminal as possible.

A lot of junior hackers feel this way, but most 'enterprise developers' (those who know everything there is to know about the JVM and its classloading process but have no notion of first-class functions, or final vs. const in C#) love to stick it to me on their IDE's. Similarly, when I witnessed an emacs vs. vim brawl (one of our professors left their terminal exposed, revealing which side they were on), the emacs user felt under attack before I said my first words.

Here is why I use vim instead of emacs: because I learned it first. Alternatively, you can use any of the equally compelling reasons in the list below:

  • Because I prefer to map Caps Lock to Esc instead of Ctrl.

  • Because my brilliant hacker friends next to me use vim exceptionally well and inspired me (they could have just as easily been emacs users).

  • Because a friend put up a photo of Richard Stallman in the lab and joked about the fact that he used Ctrl^H for Help.


None of these are real reasons, folks. They're all true, and honest-to-Baal the only things that pushed me to vim. Since I have a soft spot for Lisp, I hope one day to pick up emacs, and remain open to the possibility that it may suit me better.

What I learned in the aforementioned editor brawl (and I hope we as an industry have outgrown the lame fighting*): if you see someone warring over their editor, don't feed the animals. It's clear that very smart and productive people use one or the other; that you don't like it is not reason for indignation. Further, who cares what someone else uses?

I would extend this to Tabs vs. Spaces as well, or where to put your open braces in if/else blocks. There are tools to convert to your preferred style. Change it, work on it, change it back.

---

For now, I use vim because I know it, and only started using it last year. Before that? Kate on the department filesystems, SubEthaEdit on my home computer, and nano when I ssh-ed from the terminal. I edited Java in Eclipse.

Do I recommend switching as I did? Yes. Keeping it general, if you are a coder who uses a primarily graphical environment, I highly recommend you become proficient with a serious, terminal-based text editor. You don't have to love it, you don't have to keep it, but try it, and give it a fair shot. Why?

  • Learning a Terminal-based editor lets you know you have a comfortable editor with you anytime, anywhere. When I used SubEthaEdit, it was not only Mac-specific, but commercial. If I wasn't editing on my laptop, I wasn't editing in my preferred environment.

  • Both major players encourage keyboard use. You will never know the shackles of inconvenience that frequent mouse usage is until you've been set free. Doing everything on the keyboard makes using the computer for 6-7+ hour stretches much more tolerable.

  • If you are in a resource-scarce environment (say, ssh on a shady connection), you can bet that software written in the early 90's or 70's will load quickly, and leave a small footprint on your resources.

  • They are scriptable. Many editors are programmable, but there's a difference to writing a tool in ELisp or Vimscript, and writing an Eclipse plugin.

  • Most of all, you look like a badass.


Of course, you lose some conveniences as well (especially in the case of Eclipse), but so far they haven't outweighed the benefits.

Of course, if you are an IDE user, there's a very (very) good chance you're a more premium coder than I am. I'm not making a value judgement on you; I simply suggest you try loving your Mother Terminal, who paved the way for the sexier IDE's.

*= Open question: What is our generation's equivalent? Ruby vs. Python? More generally, the static typers vs. dynamic typers? Chip in, as a former actor in training, the only things I love more than Computer Science is Drama ^_^

December 28, 2009

On-Demand Prime Generation

Many Project Euler problems are made easier if you have a good facility for generating and detecting prime numbers. Usually when I see such a problem, I would immediately switch to Haskell, but the other day a problem took about a minute to solve (way too long), so I ported the functionality to C.

In both, I'm doing a simple trial division algorithm: a number is prime if no other number divides it. A simple glance at Wikipedia shows you only have to test up to and including the square root of the number you're testing, and the nature of primes means you really only need to test factorization by other primes.

---

In Haskell, I used a former exercise from my CS173 class that highlighted laziness as a language feature to achieve this. First, we declare our list of primes:


primes :: [Int]
primes = filter isPrime [1..]


And then we define the predicate to determine if a number is prime or not, on which our list depends:


isPrime :: Int -> Bool
isPrime 1 = False
isPrime 2 = True
isPrime n = checkFactors (takeWhile (\ x -> x <= (floor $ sqrt (fromIntegral x))) primes) n

checkFactors :: [Int] -> Int -> Bool
checkFactors [] _ = True
checkFactors (x:xs) num = (num `mod` x) /= 0 && checkFactors xs num


For those who aren't Haskellites (even those who are, since my Haskell probably isn't very pretty or idiomatic), the code is doing this:

  • The list primes is the list of all integers beginning at 1, with all the non-primes filtered out, where non-primes are determined by the function isPrime.

  • The function isPrime tests a number x for primality by testing divisibility for all prime numbers below the square root of x, where the prime numbers come from the list primes.


Huh? You might have read over that a few times trying to piece it together and couldn't, as primes depends on isPrime and isPrime depends on primes. But not to worry, this is laziness at work :)

Laziness means that Haskell will only perform the computations it needs as it needs them, so as long as isPrime only needs as many numbers as primes has already computed, this works. Alternatively, so long as primes only needs to generate numbers isPrime can check, we're golden.

Imagine a snake that is three feet long eating a foot of itself from its tail, but growing a foot and a half longer as a result. It can do this continuously and grow to indefinite size!

More concretely, suppose I want to check if 9 is prime. A call to isPrime 9 causes the following to happen:

  • The takeWhile in isPrime will take all numbers from primes that are less than or equal to 3.

  • The language will compute values for primes by running each integer through isPrime, and will do this while those values are less than or equal to 3.
  • isPrime 1 will fail, isPrime 2 will pass, and isPrime 3...

  • ... we're not sure, so we repeat the steps above for all numbers in primes less than the square root of three. This gives us the empty list, which by definition of isPrime tells us 3 is prime.

  • We return this result, and stop taking values from primes since this is less than or equal to 3. We divide 9 by 3. This fails the predicate, so 9 is not prime.


Next to shell languages, Haskell is the only language I can think of in wide use (if you consider Haskell in wide use...) that employs laziness by default, and it allows you a powerful abstraction like this. The laziness allows us to define the list of all primes to be used freely in the code as any other list, the program will only calculate as many primes as it needs, and it can calculate arbitrarily many primes. Furthermore, the primes are comparatively fast to compute, since there are no 'wasted' divisibility tests.

---

And so I solved a number of Project Euler problems in Haskell just so I can use this facility. That being said, the speed of the language itself left me wanting, so I ported a somewhat similar abstraction to C. We'll do this in the opposite order, first defining an isPrime:

BOOL
isPrime(int num)
{
  int curr_prime, index = 0;
  double limit = sqrt(num);
  while ((curr_prime = takePrime(&index)) <= limit) {
    if (num % curr_prime == 0) return FALSE;
  }
  return TRUE;
}

This relies on a takePrime function, which can be called continuously to fetch the next prime from an index value. It looks like:

unsigned int
takePrime(int* indx)
{
  unsigned int val = prime_ptr[*indx];
  if (val != UNCOMPUTED) {
    ++(*indx);
    return val;
  }
  else {
    unsigned int last_prime = prime_ptr[(*indx)-1];
    unsigned int next_prime;
    for (next_prime = last_prime + 1; ; ++next_prime){
      if (isPrime(next_prime)) break;
    }
    prime_ptr[*indx] = next_prime;
    ++(*indx);
    return next_prime;
  }
}

Where prime_ptr is an array of integers, memset to some value UNCOMPUTED (I've left out most of the header information, as well as the init() and finished() calls that make this all work).

This takePrime preserves previously computed values (this is the purpose of the first condition), so if you ever need to check the i-th index of a value you've already computed, you simply return it. Otherwise, you take the last value you've computed previously, and increment all integers above it until you find your next prime. When you do, store it in the array and return it.

The C abstraction has a number of shortcomings over the Haskell version; namely, you can't pass in an arbitrary index and receive a prime number (last_prime only checks the prime immediately behind the one you are trying to compute. If you've only computed 6 primes and you ask for the 9th, you Segfault). You also lose the list abstraction (prime_ptr is static and this set of functions is #included, so I choose not to 'export' it), and you can't calculate arbitrarily many primes (the array has a fixed size).

All that being said, changes to correct those are pretty easy to implement; I've never needed them. But this allowed me to brute-force a few problems that might have had more elegant solutions. While I hate on C pretty frequently, gcc/g++ produce some pretty slick executables.