Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Making Everyone Go Away (bennjordan.com)
267 points by tshadwell on June 2, 2014 | hide | past | favorite | 45 comments


The traditional way to do this is taking a stack of aligned images and applying a median filter: http://www.jnack.com/adobe/photoshop/fountain/ which relies on each pixel in the median case, not having a tourist in front of it. There might be tourist destinations so busy that over a set of photos, a certain pixels contain a tourist more often than not.

Or you can just do it on christmas day: http://www.ianvisits.co.uk/blog/2008/12/25/deserted-london/


Just so it's clear: the median filter way of doing this doesn't rely on finding two images with non-overlapping people, it uses the open spaces from all the images. As such, you don't need an exponentially growing number of images.

You'll still run into issues with two things:

1. Someone napping or being otherwise still throughout your photos will show up in the finished product. 2. Systems which are stationary but put out a lot of internal movement (trees, video screens) will likely show up as random-colored pixels within the range of their colors. For trees this would look like a blur. For TV screens it would probably end up gray-ish or staticky.


This remind me how the first picture of a human was being taken. Louis Daguerre took pictures of busy streets in Paris, but because the photographs had an exposure time of around 15 minutes, the streets looked empty. The exception was a guy that had his shoes shined, since he had been standing still for the period the picture was taken.

http://petapixel.com/2010/10/27/first-ever-photograph-of-a-h...

This can be considered a kind of analog filter (would this not be an average rather than median filter?)


I think the best solution for tree blur would be to (after applying the median stack mode) add any picture in the stack (that has clear trees) and use a layer mask to just paint the trees in.


This is a reasonably well known problem in video processing, often referred to as "background extraction." It mostly amounts to running local outlier rejection on the video frames then generating a composite image. There are better and worse algorithms for this, but it's just noise rejection. Start with a median filter, tweak window size and number of frames, exploit color if desired.

Key trick is LOCAL outlier rejection. You don't take the median of the global dataset, you take the median of a subset of frames. Then you do it to a subset of results, and so forth until you get a pretty image. Then you can highlight problem areas and go back and try to sample them from different data, depending on how much you care about that. An incidental benefit of this is that it lets you dramatically speed up the job by throwing CPU cores at it, if that's something you care about.

(Lots of relevant academic papers for after that.)

The problem encountered in the article is that 2100 frames gives 22.4 MB per frame, which napkins out to 5.8 gigapixels uncompressed. For reference, at 30 FPS that would result in 70 seconds of continuous video. Using high-res stills is going to balloon your storage cost & processing time, which has nothing to do with underlying problem.

A good workaround if you want a high-resolution result would be to do processing at a reduced resolution, then upscale from there. E.g. if you drop resolution by 1/4 for processing, you could take the output and for each pixel find the best matches in the source data and sample a larger window from those to get a full-resolution result.

Or you could use one of those nifty video super-resolution algorithms that have been popular in recent computer vision papers. Depending on what you chose to do when you captured the data, and what you feel like implementing.

Times Square is still problematic, mainly because it has persistent crowds during many times of the day. People move out of the way, crowds don't unless there are gaps (which may be inserted due to e.g. stop lights, transit arrival times). Best advice there is to catch it when the traffic is less dense, or when it's disrupted (movie filming, accident, random variation).


'video' makes me wonder if there's a way to 'hack' an encoder (h264/avc?) to use P- and B-frame information to mark moving areas (that's what they're supposed to be good at) so that "only" still data (I-frame minus what moved) remains. Now you've got a lot less data to churn.

Also, in this process, could marking pixels with "stillness" probabilities resulting in some blending factor (alpha channel?) before flattening the I-frames thing work?


I may be completely off-track here but this seems to happen when VLC drops frames in a h264 source - you get green/black blocks except for the moving parts of the image. Maybe libvlc/libav could help you here?


Speaking of video, I wonder how much grain came from the camera sensor - I'd start with some denoising, just to try to knock that out.


What we believe to be the first human being caught on photo, by Louis Daguerre in 1838, was a shoe-shine boy in Paris.

The rest of the people around him were not captured by the long exposure, but he stayed still long enough to be captured.

http://www.dailymail.co.uk/news/article-1326767/Louis-Daguer...

Technology marches on, but the song remains the same.


Peter Funch has a series of compositions like this. Only he is adding not removing. http://www.v1gallery.com/artist/show/3 scroll to the middle for examples of everyone holding a folder, or wearing red, yawning, etc.

Pretty intensive work to get the photos, shadows and composition together.

Edit: a link directly to an example http://petapixel.com/assets/uploads/2009/12/babel11.jpg


Hugin can do some of this, too. I use it mostly for stitching panoramas together but it is possible to hold the camera still and catch several 'instances' of a moving object. For example, here is someone pier jumping into the sea, and you can see four instances of her while the other people (and of course all stationary objects) are perfectly normal: https://imgur.com/h5Ysb6b

There was a news story I saw recently with a photographer doing this near an airport and had dozens of planes in the final image, like a big traffic jam in the sky.



Actually, the update below says that it was basically made in Photoshop, because there were planes that apparently don't serve that airport, and the relative sizes of planes were off. Still a neat idea, but probably hard to capture precisely like that without an automated stationary camera.


Google+ Auto Awesome has a feature that does this for you. If you take multiple pictures of a subject with your phone it will automatically give you a photo with all the moving people/cars/etc from the scene removed.

The second feature listed on this page: http://googlesystem.blogspot.ca/2013/10/auto-awesome-action-...


Yep. And as much as I like the latest version of google+ sometimes it is funny:

Removing the kids and their dad playing soccer so I can have a get a better view of the toolshed. ;-)

(Tbh: AutoAwesome also created a "action version". I also gurss they are data mining which AutoAwesome s we keep and which ones we delete.)


Unless I'm entirely wrong, this is the same Benn Jordan of Flashbulb fame. A hilariously talented individual, if you haven't already listened to his music, you really should.

http://theflashbulb.bandcamp.com/album/nothing-is-real


I too was suprised to find that he was into tech stuff.

And yes he's done some really cool records.

(It does say on his homepage that he is The Flashbulb)


He posted some videos on YouTube a while ago of a live visualization system he made:

https://www.youtube.com/watch?v=t9jW3SYTtho

Have to agree with existencebox, he is hilariously talented. His blend of live guitar, drill and bass and acid is awesome (slightly annoying video editing here):

https://www.youtube.com/watch?v=4_SxlRQhHOA


Ah, I had clicked around briefly and must have missed it. Just saw a "does music etc." statement. It certainly made me chuckle to make the connection; one hell of a motivator to "do more stuff" too.


Nice find, thanks for dropping the link.


This is technological bruteforce. Behold a more low-tech approach[0]:

> how do you photograph one of the biggest, most populous cities in the world without having any people in your pictures? With a lot of patience and an alarm clock!

[0]: http://blog.roberttimothy.com/2013/05/Deserted-empty-London-...

[Also]: http://humanless.org/


I wonder how much of the difference in pictures with lots of people is down to things like shadows rather than people obscuring the view. You might get a 'good enough' result by approaching the problem using averages of the pixel values after applying a high pass filter - take a few hundred pictures, build an array of all the same pixels in each image (eg 0,0), sort them by some factor such as luminescence, and take an average of the most common within a small tolerance.


Agreed, the original method seems very inefficient to me. (I think you would want a low-pass filter though? And probably a mode-filter rather than a mean filter?)


It would have been nice to see the initial unsuccessful results using Photoshop, since it looks like a median filter can potentially work quite well at removing tourists: http://nifty.stanford.edu/2014/nicholson-the-pesky-tourist/

The alternative approach is just to use a really long exposure time, like "Silent World": http://www.popsci.com/technology/article/2012-04/artists-use...


It's interesting that the clouds overhead (in the Chinatown before and after pictures) are averaged out, but the clouds on the horizon are reasonably fixed. I'm guessing this has to do with the apparent rate of movement since the clouds directly over your head always seem much faster than those elsewhere.


This research paper that has a very similar idea: http://people.mpi-inf.mpg.de/~granados/projects/bgest/index....

They basically stitch together a set of images of the same place such that the result does not contain occlusions. This is done by always selecting the input image with the most frequent color, similar to OP's idea, assuming that occlusions will cause 'outlier' color values. They use a Markov Random Field to cleverly control where the stitching seams should be and use Poisson Matting to create a smoothly stitched result.


The trick from the analog photography era seems easier: take many shots with very low exposure on the same image frame from a fixed viewpoint. Over time, the things that are static make a stronger imprint on the film, whereas the moving things are too volatile to get recorded (in more than one shot). No need for averaging a stack of digital images in Photoshop...


I wonder if the process can be improved by using long exposure pictures. You might know the effect from long exposures at night: the lights of the cars are in the frame, but the cars themselves have vanished. During daylight a similar effect could be achieved - apply a strong neutral density filter to the lens, so that the exposure can be increased to a couple of minutes maybe. The moving people, cars and so on are not still enough to make the sensor react.

I was astonished when I saw this video (at approx 4:57 minutes in) https://www.youtube.com/watch?v=T24_uq0AY6o where the stream of people has almost vanished.

I wonder if the results could be the same but with less data, when you just take a couple of long exposures and merge them, removing all things that are too soft (-> movements) in the process.

However this is a really cool idea for a photo project :)


2,100 images doesn't pass the sniff test for me, but I might be totally out. I'd have thought if you spaced out 100 photos over a minute each, you could take the mode pixel, and be pretty close. Am I massively under-estimating the complexity?


Depends on tolerance for failure. Take a heavily trafficked area like time square, what is the probability that your 100 pictures would capture a clean pixel for a spot across the square even once, let alone enough for it to "win". Say lots of people are wearing black coats, black might accidentally "win" a few pixels here or there.


I'd actually quite like to see a series of progressively enhanced images made in this way. Like 1 pic every second, take, the mode pixels; repeat for 10 pics, 100, 1000, ... might be interesting. There's probably a range across the number of images that ties with the busy-ness of an area to create some interesting visual effects? Like is there a point where an image starts to appear out of the noise, could that be an effective measurement technique (counting insects/cataloguing their level of activity?)?


Same kind of thing can be done to satellite imagery:

http://www.wired.com/2013/05/a-cloudless-atlas/


you want mode, not median. there's no reason to think that the background is close to the median colour. that's kinda the whole point of the article...

(median works well if you have symmetrically distributed noise, which is true when denoising astronomy photos, for example, but not here).


I was thinking the same thing, but for sets of images where the occluding items are a relatively small percentage of the image area (and are moving around enough between frames), taking the median pixel value is effectively the same thing as the mode, but faster. (e.g. if you take 10 frames, the 1-2 frames where a pixel is occluded will almost certainly be an outlier to the 8-9 frames where the pixel is almost exactly the same, and the median will take the value in the middle of those 8-9 frames.

(Another problem with mode is that you'd have to posterize the image to ensure that shifting light, noise and camera movement don't cause the values of background pixels to vary slightly. Mode is much more brittle in that respect.)

If I recall correctly, Lightroom / Photoshop has a handy "median" filter, but no mode filter, which is why the popular method uses median.


This is where 'crowdsourcing' could possibly be taken at its most literal. All those Eiffeltower pictures that every tourist just has to take might contain enough data to allow a complex adaptation of this technique to produce an image without those very tourists.


This is Microsoft Photosynth's function, IIRC


See Figure 2 in http://statweb.stanford.edu/~candes/papers/RobustPCA.pdf

There's lots of work on background estimation. This is just one approach using nuclear norm minimisation.


Also worth watching the painstakingly done Empty America series of videos by Ross Ching: http://rossching.com/empty-america


If you go in the early morning in summer it's surprisingly easy to get shots of major landmarks with noone around. (I've done this in London)


Yeah, also 8am on January 1st is a good day to feel alone on Earth.


28 days later and 28 weeks later were filmed at silly o'clock in the morning in London to get an abandoned feel. They just had to digitally edit out some early morning commuter traffic on a bypass that was visible in a couple of the scenes.


what would happen to some things that aren't people, like a flag?


How does it not filter out the sun and other sky-related elements?


It does filter out clouds. Look at the before and after photos. http://i.imgur.com/NFNjBZD.jpg


You can do this with ONE photo.

Pick an urban scene that doesn't have an electronic goods store in sight, then wait for the release of the next iPhone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: