MongoDB and $exists Queries

March 12, 2014 rob other

Since Mongo is a schema-less database, you’re not guaranteed to have the same field in all your documents. Often you’ll find yourself querying on a collection to only find documents containing that field:

db.myCollection.find({myField: {$exists: true}})

As it turns out, the $exists query does not use any indexes provided. If you’re working on a large dataset, this means that doing this type of query can take quite a long time.

One solution that we’ve found that works great and is much much faster is to use a sparse index on the field, and then query on a degenerate regex:

db.myCollection.ensureIndex({myField: 1}, {sparse: true})
db.myCollection.find({myField: /.*/})

Obviously this won’t work if myField is not a string, but there are definitely other ways to avoid the $exists query that does not use an index.

More

Why We Ditched Socket.IO

December 10, 2013 rob javascript

Last summer when building SweetiQ, we had a need to use WebSockets for real-time communication with the client. The obvious choice was to use socket.io, which is the library that everybody uses since it simplifies a lot of WebSocket details, and allows it to work on browsers that don’t yet support it (this was more important a year ago than it is today, but meh).

I’m going to tell you why it was a bad choice.

Our first major issue is that when clients left the browser open for a long time, the socket.io connection would die. Not the end of the world, since socket.io has a reconnect capability. Unfortunately we had to do a bit of logic on socket reconnects, and the documented “reconnect_failed” event was never firing. I discovered this issue on Github which said that other people were having the same problem, so it wasn’t just me. I can’t remember where I saw the fix for it, but I posted a tiny code change on the issue so that other people are able to use the event properly. I can’t remember at this point whether there was a pull request or not, or whether I saw the fix on Stack Overflow; whatever the case is I didn’t make my own pull request.

People continued to have the same problem, and many people commented on the issue saying that the little fix I posted works. There are now three outstanding pull requests that supposedly fix the issue, yet none of them have been merged. It has now been over two years since this issue was opened, and the two line fix is still not in the core socket.io.

We started seeing some other issues: socket.io couldn’t get through firewalls, certain antivirus software would block the connection, etc. I’m not 100% certain what the issue was since the clients we had were on large corporate networks somewhere out there on the Internet.

One of my coworkers discovered SockJS, which is a library that provides a WebSocket-like system that uses the exact same interface as a native WebSocket (onopen, onclose, onmessage) and has a server-side supported in many different languages – which was handy, since we periodically have issues with Node.js and less-mature libraries than the other language we use daily (Python) so we had an option in the case we wanted to switch.
Since we had no real idea on how to solve the corporate firewall problem, we ended up giving SockJS a go. The migration was pretty simple since the way you write code with SockJS and the way you write code with socket.io aren’t that different.

It worked beautifully. Every single one of our clients were able to use the WebSocket-based features without any problems at all, including those behind corporate firewalls. To this date we have not found a single bug with it, and it continues to work great a year into production without any updates.

I would recommend that if you’re using a web app that uses any sort of WebSocket functionality, I would recommend using SockJS. If you’re not using any sort of library then you’re even better off, as it is a drop-in replacement since everything works exactly the same as the normal WebSocket specification.

More

.NET and Startups

December 6, 2013 rob other

I’ve been reading through the comments on this post that talks about how not many startups are using .NET and how the .NET conferences and meetups are largely older people.

A bit of a disclaimer before I get started: I worked with .NET for about 2 years. I used C# and VB.NET, later converting the VB.NET code to IronRuby because I was able to express complicated domain-specific logic (in this case it was stock trading logic) very easily using IronRuby in ways that were just not possible in VB.NET. As an outsider to the .NET world I was impressed that C# actually keeps up-to-date with modern programming techniques (type inference, async/await, anonymous delegates), unlike other C#-like languages that I’ve used in the past that seem to start thinking about adding them maybe 5-10 years after they’ve been mainstream in other languages.

So the question: why are the cool jobs not using .NET? Since “cool jobs” often is synonymous with “jobs at a startup or small company” I will use the two interchangeably.

My answer comes from this quote:

Why? I’d rather spend the time honing my skills in one area than learning languages I’m unlikely to ever use. Don’t want to be a ‘Jack of all trades, master of none’.

This is what I believe is the general mentality of the .NET world: the idea that what you are doing right now is unlikely to change quickly, so the best thing you can do is to master a single thing. In the enterprise world this is a virtue: your job is unlikely to go away, management is not likely to adopt a fancy new technology (how many places are still using COBOL?), and you have the budget to be able to hire specialists in the odd chance your project needs expertise that is not on hand.

On the other hand the startup/small business world that does the “cool things” is the complete opposite – the forces of natural selection are at work. Companies that embrace technologies that get you to market faster, that allow you to pivot faster, allow you to prototype-and-fail faster will be the ones that survive to become big. I don’t have valid evidence that languages like Node.js, Ruby, or Python satisfy any of these better than C# or VB.NET, but from my anecdotal experience this is certainly the case. This effect will feed on itself: .NET programmers either adapt to the startup environment, or they quit/fail and head back to the enterprise; we end up with a startup community full of people who either never did .NET, or people who came from it to “cool” technologies after seeing it fail them. Couple this with the fact that in the .NET world it is customary to pay for everything – imagine having to shell out cash every time you did an npm install – the .NET ecosystem just does not lend itself well to startups.

To answer the question, “why are the cool jobs not using .NET?”
It may very well be that there are cool companies out there using .NET. Maybe they just aren’t surviving long enough for any of us to hear about them.

More

How To Get to the First Page of Hacker News

November 26, 2013 rob other

Here are some strategies to getting to the first page of HN:

  • Talk about some government spy story.
  • Talk about how you had trouble getting into the United States.
  • Talk about how you hate MongoDB, Node.js, or Ruby (or X, where X is some language or technology that is not cool according to Hacker News).
  • Talk about how you love Go.
  • Talk about how you quit your job and launched a startup.

Here’s how you might get to the first page:

  • Post something genuinely interesting.

Here’s how to get downvoted immediately:

  • Talk about the pros of MongoDB, Node.js, or Ruby (or X, as defined above).
  • Talk about how you left your startup to get a salaried job with good benefits and sane working conditions.

More

LED Grow Lamp

November 15, 2013 rob hardware

After several hours of snipping wires and components, soldering them together, and the occasional drilling, I’ve completed a prototype for a new project that I’ve been pondering for a while: a LED grow lamp. The project is to experiment with growing plants without the use of sunlight and using low-power electronics. This is perfect for my apartment since it doesn’t get a lot of sunlight in some rooms, and during the winter months the day doesn’t last very long so even when there is sunlight coming into a room, it doesn’t stay.

Here’s the prototyped board:

board-view

The circuits are fairly simple. There are 6 clusters of each mini-circuit that contains 9 red LEDs, 1 blue LED, and 3 resistors. Each cluster has two rows of 3 red LEDs wired in series, along with a row down the middle that has the remaining 3 red LEDs and the blue one also wired in series. As shown in the image, there are 6 clusters total.

I’ve mounted the board on a beer capper:

CIMG3367-small

It works surprisingly well because the beer capper is adjustable, so I can put the board at whatever height I like. It also has a handle that more-or-less locks the board in place. This will serve well until I have a chance to build something a bit more proper to hold it up.

The pot contains spinach, it was chosen because it grows well in colder temperatures, and it acts somewhat as a control for the experiment: when grown inside it tends to be all spindly and overreach towards the sunlight. If that doesn’t happen here, then I know the lamp is working well.
Another unknown is the battery: I’m using a 9V battery from the dollar store to power the system, I have no idea how long it will last.
I’m also wondering how the angle will affect the growth of the plants. It might be better to grow them with the board mounted directly above instead.

More

Quixercize

October 12, 2013 rob other

I just launched a little app called Quixercize, which is an automated implementation of the 7-minute workout. It’s really simple, just click “Start” and follow the instructions. It helps when you know the workout beforehand, so I would quickly look at the pictures in the PDF here before you try it out.

It works well in Firefox and Chrome, I haven’t tested other browsers yet. It should work fine in any browser that supports HTML5 audio.

If you have any suggestions, don’t hesitate to let me know.

More

MongoDB Disk Usage and Compacting

October 10, 2013 rob programming

An interesting thing I noticed today about MongoDB. We have a collection called “scheduler” which stores various tasks that need to be done. Here’s the memory footprint, scaled to be in GB:

> db.scheduler.stats().storageSize / 1024 / 1024 / 1024
247.55508625507355

Here’s the amount of storage actually being used, scaled to MB:

> db.scheduler.stats().avgObjSize * db.scheduler.count() / 1024 / 1024
784.5822143554688

The percentage of data in use:

> db.scheduler.stats().avgObjSize * db.scheduler.count() / db.scheduler.stats().storageSize * 100
0.24546323483519988

So the collection has allocated 248GB of space but is only using 785MB of space, which is 0.25%. Seems a bit wasteful, no?

Fortunately, there is a simple solution: compacting. Run the repairDatabase command (this can take quite a while – especially when you’re at 0.25% usage):

> db.repairDatabase()
{ "ok" : 1 }

When done, it looks like this:

> db.scheduler.stats().avgObjSize * db.scheduler.count() / db.scheduler.stats().storageSize * 100
95.75198140092786

Unfortunately repairDatabase requires a lot of disk space to be available to do it. If your DB server doesn’t have space, then you can’t run a repair.

The trick to get around this is by importing all the data to a machine that does have space. My workstation had about 1.7TB free, so I just did a mongodump to my local machine and ran the repair there:

mongodump --host server  # download data
mongorestore --host localhost  # load it into my local mongo
echo 'db.repairDatabase()' | mongo localhost/scheduler   # do the repair
mv dump dump.bak   # back up the data (just in case)
mongodump --host localhost  # dump my local data
echo 'db.dropDatabase()' | mongo server/scheduler  # Drop the DB on the server
mongorestore dump --host server --drop  # restore the DB on the server

Now you should be using much less space on your server, and in theory your DB server should be running faster!

More

STEM Shortages Report

September 12, 2013 rob economics

I just read this report that claims that the constant media hype about “a shortage of STEM workers” is actually false, and has been false for many years. They take a data-driven approach to the problem to demonstrate that in fact there are easily enough STEM workers available to fill the jobs that are being created. It claims that STEM companies are constantly cutting benefits like pensions and salaries for workers when they would be doing the opposite in the case that there were too few workers.

I respect data-driven approaches. The world needs more people doing things this way (or perhaps the world just needs more people paying attention to people using data-driven approaches). But I’m going to go ahead and point out a few problems with this post anyway, based entirely on my anecdotal evidence – so take it as you will.
I’ll go over two problems in detail: the first is that they are assuming that all STEM graduates are created equal; the second is they seem to be cherry-picking a few companies who are collapsing and claiming that it is true across the entire industry.

Part of my job is as a technical recruiter. I get many applicants every day from new graduates, old-timers, immigrants, students, etc. the majority of whom have some sort of STEM education (many have a Master’s degree). There’s been a wide variety in the number of people that I’ve seen, and there are two main problems with most of the people I see.

  • They lack the right skills. STEM jobs are highly specialized, and just because someone has a STEM degree doesn’t mean they can fit in any STEM job. I’ve interviewed mechanical engineers and telecom guys who seem to know their stuff in their specific domain, but don’t know much about programming or software engineering beyond for loops. While it is fine that they are smart, there is a lot more to building software than simply learning a programming language – concerns like architecture, optimization, or even just getting unstuck from a difficult problem are things that take months to years to learn how to do well.
    Now, you can go into a big talk about smart people are capable of picking up whatever language you need them to pick up, and I completely agree. If I were to get an application from an experienced programmer who didn’t know any of the languages we used, I would probably still give considerable thought to hiring them. The problem that I’ve had with hiring people that are smart but have no experience is that they seem to have a number of bad habits: global variables, breaking widely-used interfaces, not writing tests (or worse: not bothering to run the tests that are already written). These are all things that can be kept in check with good quality standards and supervision, but unfortunately a startup does not have the manpower to constantly be watching the newbies.
    In addition to lacking proper technical skills, many programmers lack important soft skills: acceptance of responsibility, communication, desire to do a good job, ability to deliver, etc. most of which are not exercised at all in school. Even just the simple process of creating resumes seems to be difficult – some people don’t use spellcheck, or they blatantly lie about their knowledge and I find out during the interview or during the first week of their employment.
  • They don’t seem to have the ability to be able to do the job. I’ve interviewed hordes of computer science graduates (some of them Master’s graduates) who can’t code whatsoever. I give coding problems during interviews that are based on some things I’ve seen while coding (for example: dealing with recursive data structures like folder hierarchies) and many of the people who supposedly know how to code cannot solve them – even given the option to complete it at home, since I know some people are nervous during an interview. When I’m trying to hire a person, I want people who will be able to come up with solutions to problems.

So yes, there are a lot of people who have STEM degrees. That does not mean that there are enough people qualified for the available STEM jobs.

The second problem with the article is they are claiming that companies are downsizing; that they aren’t interested in hiring more STEM workers. While this may be true for some of the older firms that are no longer competitive, it is definitely not the case for newer companies. Here are a few examples of companies that are giving a wide variety of extra perks to people who work for them. Not a week goes by without a recruiter contacting me for a position at some job in town (and this is in Montreal, which is not exactly a booming tech town) – and it is the same for most of the people I have ever worked with.

The issue is not that there are too few STEM workers. The issue is that there are too few qualified STEM workers. The old days when somebody with some basic skills with programming are gone: libraries and readily available APIs automated most of the grunt work. The only things left are those that computers can’t do: come up with creative solutions to difficult problems. That is what we have trouble finding in the people we look at, and it is what the “crisis” is about.

More

Distributed Coding at sweetiQ

April 16, 2013 rob programming

It’s been a long time since I really wrote anything on this blog, so it is time for a little update.

Just over a year ago I joined another startup called sweetiQ. It’s a step up from the other startups that I’ve worked for in that it has actually gone live and it is actually making money. It’s also interesting because this is the longest amount of time I’ve spent working for a single company.

The system uses RabbitMQ for driving a distributed computing cluster that essentially implements Actor model of computation, except rather than working with threads within a single machine we’re using multiple processes on multiple machines (100+). Each process has a set of endpoints that listen on different queues, RabbitMQ manages dispatching these messages to the different worker nodes.

One of my major contributions is what I’m calling a “distributed job control system.” There are many steps in each “job” that the system must do, and each step of the job may be handled by a different process across different machines. As a particular computing job becomes more complex, several problems arise:

  • managing the dependency structure between various components of each job. In a simple sequential system you can do A, then B, then C; the dependency structure is very simple: if x depends on y, do y after x. In a distributed system B may not depend on A, so you can do them in parallel, but C depends on the output of both A and B so the system cannot start C until both A and B are done. It is not possible to just have the node handling C wait for a response of both A and B, because the messages from A and B may not be delivered to the same node that can handle C. Even more complex, in certain cases C may not need the computation from A but in other times it does – for example if we’re aggregating social data from different networks but the user hasn’t linked their Twitter account, we don’t need to fetch data from Twitter.
  • the possibility of failure increases – sometimes a node will lose its database connection. Sometimes a node will die (exmaple: we use Amazon spot instances that at any time can just shut down). Sometimes a node that attempts to fetch something on the Internet may fail for whatever reason (the API version of Twitter’s fail whale is something that has happened relatively frequently). In this case the process needs some sort of elegant failure handling – often the solution is just to try again later. Other times we need to send a message to the user informing them that they need to take some sort of action (if a user changes their Facebook password, it will invalidate the stored OAuth token).
  • need to rate-limit certain jobs – some jobs can have many sub-components done in parallel, we have encountered some problems sending out too many messages in parallel. The first and most obvious problem that we hit is that the database will choke, however once we learned how to use MongoDB properly this became a non-issue (having scaled both MySQL databases and MongoDB databases, I can tell you I am fully sold on MongoDB at this point). The bigger issue was a problem of process starvation: at peak times jobs will begin spawning at an enormous rate, and if we keep sending parallel messages for stage 1, the computation nodes spend most their time processing messages for stage 1 (Rabbit has no concept of priority). There is a need for the control system to detect starvation and alleviate it (a variety of different ways we can do this).
  • recursive job types – our data is hierarchical. A user can categorize components of what they “own” in groups and sub-groups, and may want to be able to break down aggregated information at each level. Each job may need to perform itself separately on a child node, which may in turn need to perform itself on its own children, etc. Having some sort of way to easily specify recursive sub-jobs saves a ton of time.

What I ended up building is a system that takes a high-level description of a distributed job and controls that job across the different nodes. Rather than having each endpoint communicate directly with one another, they communicate with the central controller that tracks their progress and makes decisions on what control points to activate. The huge benefit of this is that it is a separation of concerns: the endpoints can focus on doing their particular computation well, while the “bigger picture” concerns like starvation and node failure can be handled by the control system.
The system can handle recursive job structures and in fact it can handle any type of sub-job: any job can trigger any other job as a child, including itself. This makes it trivially easy to run one component of a job so that you don’t need to go through everything in order to do what needs to be done. It also allows you to remain DRY: you can abstract out certain commonly-used components and use them as a “library” of sorts to compose more complex jobs.

The code is not currently available, however we are trying to figure out the legal implications of open-sourcing the software. Ideally we’ll figure all this out in the near future, and I’ll be happy to release it for everyone to play with.

Shameless company promotion: If this type of work interests you, send me a message. Like most dev shops, we’re always happy to bring in smart folks.

More

Typing as seen by Fanboys

February 21, 2013 rob programming

Thought this was kind of funny:

More

« Previous Posts

Powered by WordPress. Designed by elogi.