Avi Das

Home for my code, thoughts and else.

Nodeconf 2015: Unconf With the Right Intentions

Conferences can be a great way to get the creative juices flowing, meet people in the community and share stories and problems. They offer great opportunities to learn from core developers building the frameworks that your software depends on.

Nodeconf managed to achieve all this, in the rather unusual form of an unconference. An unconference meant that the structure and events/presentations and talks at the conference were left to be decided by the community rather than a committee. That does make Nodeconf a conference not for everyone. Understanding the format and structure of Nodeconf is important before you make the hike to Walnut Creek Ranch next year.

I thought to distill down the reasons why you might or might not be interested in attending Nodeconf as well as get the most out of it. You might be interested in Nodeconf if you

  1. Build for the web: For a lot of attendees, Nodeconf would feel like living in the future as a lot of attendees are very involved in making the decisions and tradeoff that would shape the future of the web. Specially discussions around packaging and parceling front end assets in npm (Modular UI) was really interesting as was Isomorphic JS, which covered the challenges involved in writing identical client and server side code. The JavaScript landscape is a fast evolving one and Nodeconf offers fantastic perspective on how the decision making can work.

  2. Publish on npm/github: As someone who maintains projects on npm and github, the discussions around distributing node modules were very insightful. Issues such as broadening adoption, getting contributors for github modules and standards for publishing on npm came up and maintainers of hugely popular modules shared their experiences. Picking a good module scope, having really good examples for beginners to start with and publishing with concise yet searchable package descriptions were all emphasized.

Building Realtime User Monitoring and Targeting Platform With Node, Express and Socket.io

Being able to target users and send targeted notifications can be key to turn visitors into conversions and tighten your funnel. Offerings such as mailchimp and mixpanel offer ways to reach out to users but in most of those cases you only get to do them in post processing. However, there are situations when it would be really powerful is to be able to track users as they are navigating your website and send targeted notifications to them.

Use Cases

Imagine that a buyer is looking for cars to buy and is interested in vehicles of a particular model and brand. It is very likely that he/she will visit several sites to compare prices. If there are a few results the buyer has looked at already, there may be an item which would fit the profile of this user. If you are able to prompt and reach out as the user is browsing through several results, it could make the difference between a sale and user buying from a different site. This is particularly useful for high price, high options scenerios e.g. Real Estate/Car/Electronics purchases. For use cases where the price is low or the options are fewer, e.g. a SAAS offering with a 3 tiers, this level of fine grained tracking may not be necessary. However, if you have a fledgling SAAS startup, you may want to do this in the spirit of doing things that don’t scale.

Prerequisites

This article assumes that you have node and npm installed on your system. It would be also be useful to get familiar with Express.js, the de facto web framework on top of Node.js. Socket.io is a Node.js module that abstracts WebSocket, JSON Polling and other protocols to enable simultaneous bi directional communication between connected parties. This article makes heavy use of Socket.io terminology, so it would be good to be familiar with sending and receiving events, broadcasts, namespaces and rooms.

Install and run

Start by git cloning the repo, install dependencies and run the app.

1
2
3
4
git clone git@github.com:avidas/socketio-monitoring.git
cd socketio-monitoring
npm install
npm start

By default this will start the server at port 8080. navigate to localhost:8080/admin on a browser e.g Chrome. Now, on a different browser, e.g. Firefox, navigate to localhost:8080 and browse around. You will see that the admin page gets updated with the url endpoints as you navigate your way through the website in firefox. You can even send an alert to the user on Firefox by pressing the send offer button on Chrome!

Walkthrough

Let’s get into how this works. When an admin visits localhost:8080/admin, she joins a Socket.io namespace called adminchannel.

1
var adminchannel = io.of('/adminchannel');

When a new user visits a page, we get the express sessionID of the user by calling req.sessionID and pass it to the templating engine for rendering. The session id ensures that we can identify a user across pages and browser tabs.

1
res.render('index', {a:req.sessionID});

The template sets the value of sessionID as a hidden input field on the page, with the id “user_session_id”.

1
2
3
4
5
6
7
<body>
<input type="hidden" id="user_session_id" value="<%= a %>" />
  <div id="device" style="font-size: 45px;">2015 Tesla Cars</div>
    <a href="/about">About</a>
  <br />
  <a href="/">Home</a>
</body>

After the page has loaded, it will emit a pageChange socket.io event. Accompanying the event is the url endpoint for the current page and sessionID.

1
2
3
4
5
6
7
8
  var userSID = document.getElementById('user_session_id').value;
  var socket = io();

  var userData = {
    page: currentURL,
    sid: userSID
  }
  socket.emit('pageChange', userData);

On server side, when pageChange is received, a Socket.io event called alertAdmin is sent to the adminchannel namespace. This ensures that only the admins are alerted that user with particular session id and particular socket id has navigated to a different page. Since anyone with access to /admin endpoint will join the adminchannel namespace, this can easily scale to multiple admins.

1
2
3
4
5
6
  socket.on('pageChange', function(userData){
    userData.socketID = socket.id;
    userData.clientIDs = clientIDs;
    console.log('user with sid ' + userData.sid + ' and session id ' + userData.socketID + ' changed page ' + userData.page);
    adminchannel.emit('alertAdmin', userData);
  });

When altertAdmin is received on the client side, the UI dashboard is updated so that the admins have a realtime dashboard of users navigating the site. This is done via Jquery which appends each new page change to a html list as users navigate through the site.

1
2
3
4
5
6
7
8
  adminsocket.on('alertAdmin', function(userData){
    var panel = document.getElementById('panel');
    var val = " User with session id " + userData.sid + " and with socket id " + userData.socketID + " has navigated to " + userData.page;
    userDataGlob = userData;
    var list = $('<ul/>').appendTo('#panel');
    //Dynamic display of users interacting on your website
    $("#panel ul").append('<li> ' + val + ' <button type="button" class="offerClass" id="' + userData.socketID + '">Send Offer</button></li>');
  });

Now, the admin may choose to send certain notifications to the particular user. When the admin clicks on the “Send Offer” button, a socket.io event called adminMessage is sent to the general namespace on the server with the user specific data.

1
2
3
4
  //Allow admin to send adminMessage
  $('body').on('click', '.offerClass', function () {
    socket.emit('adminMessage', userDataGlob);
  });

When adminMessage is received on the server side, we broacast to the specific user the message. Since every user always joins into a room identified by their socketID, we can send a notification only to that user by using socket.broadcast.to(userData.socketID) and we send an event called adminBroadcast with the data.

Here, you could have chosen to broadcast a message to all the users, or to a particular room, which subsets of users could have joined. Thus, you can fine tune how you want to reach out to users as well.

1
2
3
  socket.on('adminMessage', function(userData) {
    socket.broadcast.to(userData.socketID).emit('adminBroadcast', userData);
  });

Finally on the client side of the user when adminBroadcast is received, the user is alterted with a notification. However, you can easily use it for more complex use cases such as dynamically updating the page results, update ads section to show offers and so on by setting up event listeners.

1
2
3
  socket.on('adminBroadcast', function(userData){
    alert('Howdy there ' + userData.sid + ' ' + userData.socketID + ' ' + userData.page);
  })

There you have an end to end way in which a set of admins can track a set of users on a website and send notifications. This system can be particularly valuable when the user’s primary reason for visit accompanies purchasing intent. E-commerce and SAAS platforms have recognized the importance to user segmentation and targeted outreach. This system enables you to minimize the latency of such outreach. On the plus side, you can get to rely on fully open source tools with broad user bases and support.

This particular example used url endpoints as part of the data payload, but you can really strech it to any user events. For example, you can easily track where the user’s cursor is and send that information back in real time. One can imagine High Frequency Trading firms using this technique in bots to track real time user behavior, e.g. user’s cursor hovering on a buy button for a ticker, as information gathered for its trades. How much you want to track and react to can be an exercise in determining the bounderies of being responsive and creepiness for users.

Props to my friend Shah for working with me on this. If you are doing some level of realtime tracking on your site, I would love to hear about it. Please feel free to send over any other feedback as well.

Bug Hunting With Git Bisect

With large projects with Git, feature development tends to happen often in separate branches before they are ready for merge. However, once the merge happens and tests break, it’s often challenging to figure out the commit at which the bug got introduced. Git bisect is an excellent tool to triage that commit. It does so in a binary search like fashion, marking good and bad commits and reducing problem space of commits by half every time.

However, this process can be quite manual so git bisect has a run command. This allows you to set a testing scipt and based on the output of the testing script, it automatically finds the middle commits and continues searching till it finds the breaking commit.

Another neat feature is its ability to log out the output, record and rerun the bisect for further debugging. The git-scm book has some excellent documentation for the complete api and technical details.

There are still a few manual steps, as you would want to stash for saving and recovering state of uncommitted work, get to HEAD and view the log available for record and replay.

For reusability, I wrote the following script to make git bisecting and setup into a handy bash function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Stash current work and and git bisect with given good and 
# bad commit ids, running given script that exits with 0 on failure
# and positive number on success
gbisect() {
    if [ "$#" -ne 1 ]; then
        echo "gbisect good-commit-id bad-commit-id script <arguments>"
    else
        git stash # stash current state
        git checkout HEAD
        git bisect start # initialize git bisect
        git bisect good $1
        shift
        git bisect bad $1
        shift
        git bisect run "$@" # # git bisect 

        git bisect log
        git bisect reset

        git stash list
        git stash apply
    fi
}

If you are using mocha as a test runner, you could use the script as following

1
gbisect 23df33 56dg23 mocha -t 15000

Git is like an iceberg, in a good way. Generally instead of perusing heavy books on something, I like learning as I run into challenges. Once something clicks though, it is great as it has a N times effect into your workflow if you are using git for work and personal projects.

Scipy 2014: Python as Expression to Push Boundaries in Science

It’s not everyday that the person sitting next to you interacts with Mars Rovers everyday or is trying to build a data pipeline to handle petabyte-scale genomics data. But that was perhaps the key takeway from my first Python conference: a large number of people pushing the boundaries in scientific disciplines and using Python as their means of expression.

I have been using Python for a while now, both at work and for hobby projects but until of late have mostly been in the peripheries in contributions to open source projects. When I learned about Scientific Python conference right near to me in Austin, I was immediately interested. If you buy that there is such a thing as language wars, scientific computing has been one of Python’s key wins. With libraries such as NumPy, Matplotlib and Pandas (and of course IPython), Python have dominated the Scientific Python landscape alongside R and Julia.

When such a strong ecosystem is matched by a very welcoming community, there is a recipe for a conference worth being at. Well, If you can get past the imposter syndrome of being at a place with the highest density of phds of any place I have ever been at.

Takeaways

  1. Python catching up in areas where it lacked: Performance, distribution, scalibility and reproducability were some of main themes at the conference. This addresses some of the historic lackings of the language. Sometimes this is via adoption of new tools such as docker for containerizing work enviroments for remote co-working researchers. Dependency on other languages has been one of the major pain points in working with the scientific Python libraries, so it is great to see Conda and HashDist (which I just discovered) to take that head on. Interoperability and scalability are two of the main problems Blaze is solving, and Bokeh and Plotly takes on the problems of publishing and sharing interactive visualizations in Python.

  2. New tools for my workflow: There are many tools which deserve a space here, but I was primarily exited to discover pyspark, yt, plotly, sumatra/vistrails, hashdist and airspeed velocity. Version control and workflow control are familiar terratories for software engineers, but the idea of event control was new to me, something explored in a Birds of a feather discussion.

  3. Birds of a Feather talks are revealing: Birds of a feather discussions were sometimes my favorite, where there was candid sharing of painpoints and their solutions from the community members. It was also good to know what were the open problems in various areas are as they often indicate valuable areas to focus on.

Best Chrome Productivity Extensions

How much time during the day do you spend on an internet browser? As a software engineer, my answer is a scary amount of time. Something I have personally struggled with is avoiding distracting sites on the internet. There is too much good content on the internet that dries up my rather finite amount of attention.

Chrome has been my weapon of choice for browsing the web for a while now. Intuitive, clean and fast, Chrome’s ascent has been amazing to see over the past few years. Extensions on chrome allow addition of new features on Chrome made by third party developers. Lately, and there are a few extensions I simply cannot do without. When you need to buckle up and get something done, these extensions will help you get there.

  1. Pocket: There are enough interesting things to read on the internet for many lifetimes. Just like youtube’s watch it later does for videos, Pocket lets me save articles for future reading. With clients available for iOS and Android, articles saved can be accessed anywhere. At any time on social websites, be it facebook, quora or hacker news, there is a lot I would like to read. Pocket gives me the convenience to customize articles from the internet into my customized feed or magazine via Flipboard integration.

  2. Session Buddy : If you are like me, you context switch a lot between projects during a day. Chrome deals with tabs beautifully, and yet I find myself often with 20 or more tabs, usually researching a particular topic. Usually tabs follow a train of thought but fit within a particular context. Session buddy solves this exact problem, allowing me to save a set of tabs and windows which can be resumed later. For example, at the end of the day at work, I save a bunch of tabs, saved as the name of the project I was working on. This allows me to pick up from the chain of thought from when I left off. Switching between contexts is expensive, and session buddy is the best solution I have found.

  3. RescueTime: Holding myself accountable for how I spend time on the web has been eye-opening for me. RescueTime not only tracks how much time during a day you spend on distracting sites, but the desktop application also tracks your non browser time. Over time, it subtly moves you to try and improve your productivity score by avoiding distraction, thus getting more productive (Bravo on the gamification aspect, ResueTime team!) I have discovered patterns of my work through the tool as well, such as the benefits of taking regular breaks and how much standing desk really helps productivity. It however, cannot track that even on productive sites, you can find ways to procrastinate. (I am looking at you, StackOverflow).

  4. Block site: As much as tracking my time spent on the web helps, sometimes I need to simply block distracting sites. Block site makes for a great complement to RescueTime, allowing me to either visit whitelisted sites only or block off blacklisted sites. I have found my productivity levels on RescueTime spike when using block sites. Often its a reflex action for me to go to facebook or twitter without much thinking. Block sites counts such attempts, and shames you with clever 404 comments. I love that with the latest update, with one switch, you can turn off the distracting websites and focus on the task at hand.

  5. LastPass: Despite the rise of alternative logins, passwords are still integral parts of the web. With a million sites, having passwords lying around can be rather dangerous. Remembering passwords is time consuming and difficult, adding frustration to the day. LastPass securely stores your passwords for various sites, syncing them across devices. This saves the need to browse through memory trying to dig up a password or worse, store them in flat files. It can also do auto login’s for your chosen sites. One of my favorite features is that it warns if you are using duplicate passwords for sites, making your web experience more secure.

  6. Adblock Plus: Ok, this one is controversial. Adblock will stop advertisements on websites, keeping you focused and saving time particularly for video ads. Now, ads power the web, and any website with ads as primary revenue stream should be aware of adblock. I particularly enjoy the cat and mouse chase between youtube and adblock: youtube refusing to play videos as long as adblock is enabled, adblock gets smarter, youtube catches up. However, it can make certain sites load faster, thus saving time.

Connected Car: Meticulous.io @Angelhack

Angelhack looks for startups to come out of hackathons rather than just weekend hacks. As a result, it tends to get very product driven as teams compete for the Angelpad recognition. The theme this year was Apphack, and mobile apps were the key focus. Not really a surprise, as more and more startups build mobile first.

We worked on the verizon telematics data. Verizon telematics is something I did not know about before. Car companies have been working with Verizon to install a device in the cars that could send data about speed, location, rpm, condition of the car in terms of last time since gas, particular model of car, wear and tear to the cloud. I expect Verizon is able to collect a lot more than the columns we saw. Needless to say, this would be a very powerful dataset and it is surprising that there aren’t more going on in this space.

iPython and Pandas are built for analyzing and visualizing tabular data like this. Questions such as which brand of car gave the best milage per gallon, which brand was driven more on the highway vs which was more popular on the city roads were fairly easy to answer. It was not clear if the data given to us was indeed an uniform sample, but results were interesting. However, they are hardly hackathon worthy ideas.

Needed an idea. One set of data had trip ids tied to driver ids, the other set had per trip latitudes and longitudes. Aggregating them somehow would be interesting. Geographical history of most visited places can be found since trips could be tied to drivers. Historical driving records could be obtained from individual trips and the overall history of the driver measured. Global driving records could be obtained so that the best driver’s can be found.

Once noting those features, the next step was to think how this would make sense as a product. Context sensitive software is hugely in demand and this is a great dataset for revealing patterns of drivers. Buisnesses aware of customers frequenting particular location can push deals and coupons to those individuals to drive buisness. Timeliness of visit can be used to fine tune the deals even more, e.g. lunch deals would make sense around the Grand Central area in NYC to a busy professional. The brand of car can be used to push vehicle specific ads. This would also be a treasure trove of information for car insurance companies figuring out insurance rates. Finally, it is quite possible that driving patterns can be used to trigger alert that may detect an accident ahead of time.

Its hard to argue that driving patterns are valuable data. However, as with any user specific information used for personalization experiences curated by businesses, privacy would be a big concern. It was not entirely clear who would own the data, the driver, the provider, or else. It should be made very clear to the driver of the nature of the tracking going on and how the data may be used. Since data like this could easily be used by a malicious third party to monitor you in real time when you are driving, there is inherent danger in it being available without the driver being informed. Tracking is an issue users of any clowd aware system needs to be aware of.

We did win the Verizon prize for the hackathon, and it was nice having our idea at the eleventh hour being well accepted. Angelhack put on a good show, and I must mention the apps “Make it Rain” (literally, on the phone screen) and “Kanye”(alerts if you are nearby Kanye West) for the hilarity during the presentation!

Hack the Trackers

Online tracking has been a peak topic of debate this year and by the look of things will continue to be. NSA programs, Snowden and the reactions from top tech companies brought in more attention to tracking than ever before. It was hence very timely for Evidon/Ghostery to organize hack the trackers in early November.

Ghostery is a chrome extension that displays trackers on a particular webpage. Not only that, they allow features to block particular trackers and include detailed information about the trackers. The emphasis seems to make web users aware of tracking and let them make the choice.

We built Falcon, which we thought would complement Ghostery’s offering quite well. Falcon is a chrome extension where we displayed the overall lost time due to the trackers and which trackers were the most resource intensive. Our hypothesis was that to increase awareness of online tracking, we needed to provide tangible ways in which tracking affects online browsing experience. Ghostery provided us with a data set with a lot of interesting information, among which average load time for trackers was a key indicator. Even with caching, users could lose time which would take to load the trackers, if trackers are loaded synchronously. This could particularly matter for mobile and locations with poor wi-fi as poorly created trackers would slow down the browsing experience.

Falcon Demo

Building a chrome extension for a first time was not too complicated as it is very similar to building a web page and chrome is as reliable as platforms get. We ended up being one of the two semi finalists, the other being a cool way to link the trackers with their public figures as a fun way to raise attention to tracking.

This hackathon was particularly a great learning experience. Companies are taking highly innovative routes to glean more information about the users. Cookies have always existed, but two other forms of tracking I learnt was autocomplete fields tracking and browser fingerprinting. Browser fingerprinting tries to get information from user-agent, the OS, extensions installed and other configurations to bind a particular browser to an user and this can happen completely on the server side. I have only learnt fairly recently about the bidding platforms for display advertisement and it was pretty interesting to see dictionary.com revealing on their console the bids as they happened in real time.

Computer security is a dynamic and fast changing field and this hackathon was an interesting mix of people in different niches of the industry. Tracking will continue to be an issue and it was good to see Ghostery taking the initiative to search for innovations in this space.

Prioritized Date Interval Merge

Ran into this interesting problem lately and wanted to code up a recursive solution in Python. Essentially an extension of merge from merge sort but for intervals. There is definitely something very satisfying about coding up a recursive solution, as they tend to produce clean solutions despite the ugly formatting in this case to make list concatenation work.

Merge date intervals by prioritySource
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def merge_interval(low_priority_lst,high_priority_lst):
  '''
 Given two lists with sorted date ranges, return merged list with high_priority_lst 
 ranges preferred over low_priority_lst ranges in case of intersection.
 Partial intervals are allowed.
 '''
  if low_priority_lst == [] or low_priority_lst == None: return high_priority_lst
  if high_priority_lst == [] or high_priority_lst == None: return low_priority_lst
  
  # case :               |-------|
  #        |-------|            
  if low_priority_lst[0][0] > high_priority_lst[0][1]:
   return [high_priority_lst[0]] +
          merge_interval(low_priority_lst,high_priority_lst[1:])
  # case :   |-------|
  #                     |-------|      
  elif low_priority_lst[0][1] < high_priority_lst[0][0]:
      return [low_priority_lst[0]] +
          merge_interval(low_priority_lst[1:],high_priority_lst)
  # case :|-------|
  #            |-------|  
  elif low_priority_lst[0][0] < high_priority_lst[0][0]:
      return [[low_priority_lst[0][0],high_priority_lst[0][0]]] +
          merge_interval( [[high_priority_lst[0][0],low_priority_lst[0][1]]] +
                               low_priority_lst[1:], high_priority_lst)
  # case :      |-------|
  #        |-------|  
  elif low_priority_lst[0][1] > high_priority_lst[0][1]:
      return [high_priority_lst[0]] +
          merge_interval( [[high_priority_lst[0][1],low_priority_lst[0][1]]] +
                              low_priority_lst[1:] , high_priority_lst[1:])
  # case :  |-------| |---| |----|
  #        |-----------------| 
  else:
      return merge_interval(low_priority_lst[1:],high_priority_lst)

Complexity :