aketa

Monday, October 29, 2018

Linear time suffix tree construction

I needed (well... really really wanted) a linear-time online suffix tree for a project (linear time construction, specifically), and since that sort of thing isn't off-the-shelf it took some time and wrangling to put one together. I eventually started from scratch three times before I got the solution I wanted (i.e. one which is actualfax linear time construction and which always builds a correct suffix tree).

I implement advanced and specialized data structures and algorithms for fun, but I found Ukkonen's algorithm for online linear time suffix tree construction challenging to implement correctly, and it took some hours. It's not too hard to get an implementation to *usually* work, but if you want it to *always* work, there are a number of cases you have to consider carefully. I tend to try to work from original papers - read what the designers had to say and go from there. This is usually very effective, and I get a strong grasp of the problem and the author's solution, but with this paper a number of steps and cases were omitted. I was disappointed to discover this and it took time to determine what was missing and fill in the gaps in a manner consistent with the algorithm designer's intent. Here's Ukkonen's paper: https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

Eventually I looked around the net to see how other people have dealt with this, and sure enough it does seem other people have encountered this before. Some explanations are out there on the net, and some implementation are available, some of which may work. Others definitely don't. I'm pretty confident in mine - certainly, it did everything I needed it to do, and passes a variety of tests.

This is a fast algorithm, though it dances around memory a lot - there's no cache awareness, no effort towards memory locality, that kind of thing. And the expansion over space is pretty dramatic: the in-memory suffix tree is likely to be many times larger than the original document, depending on its entropy. I'm noodling on ways to improve this, arranging the tree to reduce cache misses so it's even quicker to build and perhaps a bit more compact in memory overall.

The linear time suffix tree is here on GitHub, along with some other things I've had fun building: a union find, a burst true, burst sorter, tries, a cuckoo hash, etc. Enjoy.

Wednesday, April 5, 2017

Challenge

Challenges can be big and architectural or big and in the details. Here's an example of one sort of thing I like to do, which is here - that's a C# implementation of a Cuckoo Hash. I used a three-table approach, which allows my drop-in replacement for .NET's default hashmap to be substantially more memory efficient - using about half as much memory to hold the same amount of data with the same access patterns. That means a single server can store twice as much data, so for many applications you'll need half as many servers. At scale, that's a big savings, all for focusing some effort on one little data structure. Also in that project there's an implementation of a burst sort, the only one for .NET last I looked, which is the fastest known way to sort lots of strings in memory - my implementation is about 50% faster than quicksort, the default string sorting algorithm. For certain applications that's a gigantic savings. For most purposes, the default is good enough. I care about when it's not. To get these kinds of results you have to worry about things most programmers ignore, if they know about them at all, like cache lines and instruction pipelining. But I've always felt that to accomplish great things, you have to know how things work all the way down.

Friday, June 5, 2015

Burst Sorter in C#

I wrote a preliminary Burst Trie implementation for Burst Sorting strings. It's available via MIT license on GitHub. At this point it's running about 30% faster than Array.Sort() for sorting arrays of strings representing all English words in Shakespeare's plays; its relative performance will vary depending on the number of strings you're sorting, but generally the larger the set the better off you are using a burst sort than a more traditional sort.

The built-in System.Array.Sort() method in .NET uses quicksort, heapsort, or insertionsort depending on the data; burst sort can beat these due to its cache-locality awareness.

I used these materials to learn about burst tries and burst sorting; I am indebted to their authors:

Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries
Burst Tries, A Fast, Efficient Data Structure For String Keys
Engineering BurstSort: Towards Fast, In-Place String Sorting

Please note that this code may not be exactly production ready for you: your mileage may vary and I'm continuing to work on it. I did not consult any existing source code; this is written from scratch. If you want to use it, make sure it behaves the way you expect. Feel free to reach out to me at pmcculler@gmail.com to ask questions or provide feedback, and of course, you can do the fork/pull thing on GitHub.

Up next for this project are some more correctness tests (yes, there are tests now!) and exploring various performance improvement possibilities, including sub-bucketing as described by Sinha and Wirth in their Engineering BurstSort paper, use of unsafe code, and implementing a Patricia burst trie (as far as I know, nobody has done that), and a number of minor todos scattered throughout the code.

There may be a pretty obvious question which is something like "if you're that concerned about performance, why are you using C# and not C/C++?"

Most performance improvement comes from algorithmic improvement, not point-optimization. Example: running a loop N times instead of N^2 times is better and better the bigger N gets, even if you make each iteration substantially more efficient in the N^2 version.
After the .NET JIT compiler gets done with the IL, similar code is approximately as fast in C# as in C++. You're free to argue with me on this but it's not going to be fruitful; we're talking a few percentage points, not 3x differences. I'm reasonably cautious about avoiding obvious problems here.
People use C#. A lot. C++ is great but it can be a bit of a bear for some problems, so it's handy to have this.
There doesn't appear to be any C# implementation available anywhere else, so there is novelty value.
I get to do whatever I want with my time. ;)

Patrick

Sunday, April 12, 2015

Posting a couple of basic text tools: suffix array, trie

I extracted a couple of data structures I made for some personal projects and put them on GitHub.

One is a basic suffix array, another is a basic trie. Maybe they will be useful to you.

I'll add more over time or if asked - I have various other things here, like a patricia trie, a fractal trie compressor, a shortest common superstring / supersequence calculator. The SCS algorithm was actually kind of fun to build; getting it to run in a reasonable period of time was tricky. I wrote it to create a superstring for an extended English dictionary of 240K words, and it takes about 20 minutes to do that (unparalleled).

My greedy algorithm superstring of a 2,457,521 character dictionary is 1,858,802 characters long, about 75% of the simple concatenation. Greedy algorithm results are usually within a small constant multiple of the optimal shortest superstring, according to widespread research, so the optimal might be 1/3rd as large, but it won't be 1/9th. Don't really have np-complete time on whole dictionaries to make sure though.

The superstring was so I could compress tries a little more than patricia and fractal will allow alone; instead of holding a string of characters in each node, there's a pointer into a dictionary superstring and a length. There's a tradeoff in cache locality of course. In practice it didn't actually save much memory - the patricia node character runs were generally too short. I think real space savings would appear if I created a superstring for all node contents instead of all original words. There's not a lot of room for improvement with 240K-word tries, so I haven't done that yet. All English words from the Google 2012 corpus is going to be another story.

Tuesday, March 17, 2015

"...I have not done interviews for quite a while was hoping to get some guidance/advice from you on the best approach."

I am well and hope you are too!

Here's my general approach when I'm looking.

First thing I do is pull out a copy of Programming Interviews Exposed and work through the examples. Makes my mind sharp for interviews and helps substantially when I'm asked coding questions.
Update LinkedIn & resume with current position, including all the latest appropriate buzzwords.
Update past work in resumes & LinkedIn so it uses cutting edge words that weren't in use at the time, like 'big data' in 2005 or something like that.
Update LinkedIn every day in some way so it comes up in recruiter searches (LinkedIn is actually a first stop for most recruiters at this point).
Look at what contacts are doing and ask about opportunities. Probably have to ask a LOT of people, and honestly the careers page of their company's website is probably a better source of information. May be able to get a referral in though, getting through the first step of the gauntlet automatically.
Look at the start-up scene in the Seattle area, several ways to do this.
Check out the major companies with local presence such as Facebook, Twitter, etc and see what they are up to, submit to anything looking good.
Contact any recruiter who's written to me or I've worked with before and let them know I am available starting soon or now (they love immediacy).
Look at the boards i.e. http://careers.stackoverflow.com/
Sign up with TheLadders. See if they still have a free resume assessment service, that was useful (cursory, not the whole assessment, but a gloss that I found useful).
Consider getting a premium LinkedIn account. Seeing everyone who visited me and the like was nice when I was looking.
Stay relaxed and focused during interviews. People are there to judge you, yes, but you never have to see them again. Who cares what they think. You're getting interview practice, free coffee, and the possibility of a new job. All good things, no downsides that you don't bring to the table yourself.
Consider practicing an interview with someone, even a family member.

I always consider the first interview I have when i'm looking to be a loss - it's rarely a good outcome. Just getting into the groove of it. So maybe interview at a place you're not terribly interested in, for a job you're overqualified for, if you can get such an interview.

People look for 'passion' as much as for skills. I don't know why or what this means, except showing some excitement for new techniques and talking about ways you innovated or at least really cared about working on some product. Highlight all the best things, there's no reason not to. Think about this ahead of time.

That's all that comes to mind. I wish you luck! There's a killer market for great devs out there, so as long as people recognize that you have the skills and a passion for making products awesome, I suspect it will work out well.

Monday, May 19, 2014

A new book: epistemology and metaphysics

In case you wondered what I do with my spare time, I wrote a new book. It's called "What Am I?" and it explores essential ideas in epistemology and metaphysics through a simple fantastical parable. It's lighthearted and should be a quick, fun read that will make you laugh and think.

What Am I? - on Amazon

Feedback graciously welcomed.

Patrick

Thursday, August 22, 2013

A better and easier work life

Many people are unhappy at work. It doesn't have to be that way; a few key behaviors and perspectives will make your work life better and easier.

Take pride in your work.

It's sad when you don't. Those of us who care notice.
If you can't bring yourself to care about your work, quit.

Fulfill your responsibilities.

Among other things, it hurts other people when you don't. Not just you.

Only sign up for work you are sure you can do.

If you feel obliged to sign up for things you can't do, quit.

Do not take responsibility for anything that is not in your purview or the responsibility of a direct report.

If you are held accountable for others' failures, quit.

Do not feel responsible for others' failure to do their work unless they report to you. You must be able to assume that other people will do their jobs.

If this is not practical in your role, quit.

Do not ever lie.

Lying is: intentionally creating or persisting in another person a view of the world that does not match reality. This conceptual error may be harmful to them, and you should not presume to know what will hurt others and what will not.
It's disgusting. You're not a four year old. (If you are a four year old, write me at pmcculler@gmail.com and we'll take it up offline).
A poorly considered work estimate is usually a lie.
If you feel obliged to lie, quit.

As a manager, your team is your work product.

Make them powerful.
You categorically cannot be successful when your team is not.
If you are not rewarded for just and only this, quit.

Tell people what you are doing and how it is going.

If you're not regular and explicit, you are misleading them.

Expect and reciprocate cooperation.

Cooperation is expected and necessary and should be natural and easy.
If cooperation is painful, quit.

Coordinate with your broad team to make sure all scenarios and processes actually work.

If no one cares, quit.

Do beat yourself up over failures, exactly until you get the message.

You are the only one who can do this effectively.

Get better at your work over time.

The straight odds of being in the top 1% of your field are 100:1. Do you work 100x harder than 1% of your colleagues? 50x harder than 2%?

Use your curiosity.

Things you don't know are always the ones that wind up hurting you.
... and helping you.

Look for a better way.

The odds you're doing it the best way you can with the resources you have are miniscule.
If your manager insists that you do it badly anyway, quit.

Prompt for and accept feedback.

Most of the feedback you get will be clandestine, implied, or third hand. Find it.
If you never receive any positive feedback, quit.

You've noticed many "if... , quit." imperatives. Yes. You can be successful, and life doesn't have to be horrible. Work to improve your situation, but remember that you are in an environment that is also a self-perpetuating system. Not all environments will respond to your sincere, thoughtful, and persistent attempts to improve them. Ultimately, you can control yourself, not your environment, but you can and should look for better environments to inhabit, thrive in, and improve.