The traditional way to do this is taking a stack of aligned images and applying a median filter: http://www.jnack.com/adobe/photoshop/fountain/ which relies on each pixel in the median case, not having a tourist in front of it. There might be tourist destinations so busy that over a set of photos, a certain pixels contain a tourist more often than not.
Just so it's clear: the median filter way of doing this doesn't rely on finding two images with non-overlapping people, it uses the open spaces from all the images. As such, you don't need an exponentially growing number of images.
You'll still run into issues with two things:
1. Someone napping or being otherwise still throughout your photos will show up in the finished product.
2. Systems which are stationary but put out a lot of internal movement (trees, video screens) will likely show up as random-colored pixels within the range of their colors. For trees this would look like a blur. For TV screens it would probably end up gray-ish or staticky.
This remind me how the first picture of a human was being taken. Louis Daguerre took pictures of busy streets in Paris, but because the photographs had an exposure time of around 15 minutes, the streets looked empty. The exception was a guy that had his shoes shined, since he had been standing still for the period the picture was taken.
I think the best solution for tree blur would be to (after applying the median stack mode) add any picture in the stack (that has clear trees) and use a layer mask to just paint the trees in.
This is a reasonably well known problem in video processing, often referred to as "background extraction." It mostly amounts to running local outlier rejection on the video frames then generating a composite image. There are better and worse algorithms for this, but it's just noise rejection. Start with a median filter, tweak window size and number of frames, exploit color if desired.
Key trick is LOCAL outlier rejection. You don't take the median of the global dataset, you take the median of a subset of frames. Then you do it to a subset of results, and so forth until you get a pretty image. Then you can highlight problem areas and go back and try to sample them from different data, depending on how much you care about that. An incidental benefit of this is that it lets you dramatically speed up the job by throwing CPU cores at it, if that's something you care about.
(Lots of relevant academic papers for after that.)
The problem encountered in the article is that 2100 frames gives 22.4 MB per frame, which napkins out to 5.8 gigapixels uncompressed. For reference, at 30 FPS that would result in 70 seconds of continuous video. Using high-res stills is going to balloon your storage cost & processing time, which has nothing to do with underlying problem.
A good workaround if you want a high-resolution result would be to do processing at a reduced resolution, then upscale from there. E.g. if you drop resolution by 1/4 for processing, you could take the output and for each pixel find the best matches in the source data and sample a larger window from those to get a full-resolution result.
Or you could use one of those nifty video super-resolution algorithms that have been popular in recent computer vision papers. Depending on what you chose to do when you captured the data, and what you feel like implementing.
Times Square is still problematic, mainly because it has persistent crowds during many times of the day. People move out of the way, crowds don't unless there are gaps (which may be inserted due to e.g. stop lights, transit arrival times). Best advice there is to catch it when the traffic is less dense, or when it's disrupted (movie filming, accident, random variation).
'video' makes me wonder if there's a way to 'hack' an encoder (h264/avc?) to use P- and B-frame information to mark moving areas (that's what they're supposed to be good at) so that "only" still data (I-frame minus what moved) remains. Now you've got a lot less data to churn.
Also, in this process, could marking pixels with "stillness" probabilities resulting in some blending factor (alpha channel?) before flattening the I-frames thing work?
I may be completely off-track here but this seems to happen when VLC drops frames in a h264 source - you get green/black blocks except for the moving parts of the image. Maybe libvlc/libav could help you here?
Peter Funch has a series of compositions like this. Only he is adding not removing. http://www.v1gallery.com/artist/show/3 scroll to the middle for examples of everyone holding a folder, or wearing red, yawning, etc.
Pretty intensive work to get the photos, shadows and composition together.
Hugin can do some of this, too. I use it mostly for stitching panoramas together but it is possible to hold the camera still and catch several 'instances' of a moving object. For example, here is someone pier jumping into the sea, and you can see four instances of her while the other people (and of course all stationary objects) are perfectly normal: https://imgur.com/h5Ysb6b
There was a news story I saw recently with a photographer doing this near an airport and had dozens of planes in the final image, like a big traffic jam in the sky.
Actually, the update below says that it was basically made in Photoshop, because there were planes that apparently don't serve that airport, and the relative sizes of planes were off. Still a neat idea, but probably hard to capture precisely like that without an automated stationary camera.
Google+ Auto Awesome has a feature that does this for you. If you take multiple pictures of a subject with your phone it will automatically give you a photo with all the moving people/cars/etc from the scene removed.
Unless I'm entirely wrong, this is the same Benn Jordan of Flashbulb fame. A hilariously talented individual, if you haven't already listened to his music, you really should.
Have to agree with existencebox, he is hilariously talented. His blend of live guitar, drill and bass and acid is awesome (slightly annoying video editing here):
Ah, I had clicked around briefly and must have missed it. Just saw a "does music etc." statement. It certainly made me chuckle to make the connection; one hell of a motivator to "do more stuff" too.
This is technological bruteforce. Behold a more low-tech approach[0]:
> how do you photograph one of the biggest, most populous cities in the world without having any people in your pictures? With a lot of patience and an alarm clock!
I wonder how much of the difference in pictures with lots of people is down to things like shadows rather than people obscuring the view. You might get a 'good enough' result by approaching the problem using averages of the pixel values after applying a high pass filter - take a few hundred pictures, build an array of all the same pixels in each image (eg 0,0), sort them by some factor such as luminescence, and take an average of the most common within a small tolerance.
Agreed, the original method seems very inefficient to me. (I think you would want a low-pass filter though? And probably a mode-filter rather than a mean filter?)
It's interesting that the clouds overhead (in the Chinatown before and after pictures) are averaged out, but the clouds on the horizon are reasonably fixed. I'm guessing this has to do with the apparent rate of movement since the clouds directly over your head always seem much faster than those elsewhere.
They basically stitch together a set of images of the same place such that the result does not contain occlusions. This is done by always selecting the input image with the most frequent color, similar to OP's idea, assuming that occlusions will cause 'outlier' color values. They use a Markov Random Field to cleverly control where the stitching seams should be and use Poisson Matting to create a smoothly stitched result.
The trick from the analog photography era seems easier: take many shots with very low exposure on the same image frame from a fixed viewpoint. Over time, the things that are static make a stronger imprint on the film, whereas the moving things are too volatile to get recorded (in more than one shot). No need for averaging a stack of digital images in Photoshop...
I wonder if the process can be improved by using long exposure pictures. You might know the effect from long exposures at night: the lights of the cars are in the frame, but the cars themselves have vanished. During daylight a similar effect could be achieved - apply a strong neutral density filter to the lens, so that the exposure can be increased to a couple of minutes maybe. The moving people, cars and so on are not still enough to make the sensor react.
I wonder if the results could be the same but with less data, when you just take a couple of long exposures and merge them, removing all things that are too soft (-> movements) in the process.
However this is a really cool idea for a photo project :)
2,100 images doesn't pass the sniff test for me, but I might be totally out. I'd have thought if you spaced out 100 photos over a minute each, you could take the mode pixel, and be pretty close. Am I massively under-estimating the complexity?
Depends on tolerance for failure. Take a heavily trafficked area like time square, what is the probability that your 100 pictures would capture a clean pixel for a spot across the square even once, let alone enough for it to "win". Say lots of people are wearing black coats, black might accidentally "win" a few pixels here or there.
I'd actually quite like to see a series of progressively enhanced images made in this way. Like 1 pic every second, take, the mode pixels; repeat for 10 pics, 100, 1000, ... might be interesting. There's probably a range across the number of images that ties with the busy-ness of an area to create some interesting visual effects? Like is there a point where an image starts to appear out of the noise, could that be an effective measurement technique (counting insects/cataloguing their level of activity?)?
you want mode, not median. there's no reason to think that the background is close to the median colour. that's kinda the whole point of the article...
(median works well if you have symmetrically distributed noise, which is true when denoising astronomy photos, for example, but not here).
I was thinking the same thing, but for sets of images where the occluding items are a relatively small percentage of the image area (and are moving around enough between frames), taking the median pixel value is effectively the same thing as the mode, but faster. (e.g. if you take 10 frames, the 1-2 frames where a pixel is occluded will almost certainly be an outlier to the 8-9 frames where the pixel is almost exactly the same, and the median will take the value in the middle of those 8-9 frames.
(Another problem with mode is that you'd have to posterize the image to ensure that shifting light, noise and camera movement don't cause the values of background pixels to vary slightly. Mode is much more brittle in that respect.)
If I recall correctly, Lightroom / Photoshop has a handy "median" filter, but no mode filter, which is why the popular method uses median.
This is where 'crowdsourcing' could possibly be taken at its most literal. All those Eiffeltower pictures that every tourist just has to take might contain enough data to allow a complex adaptation of this technique to produce an image without those very tourists.
28 days later and 28 weeks later were filmed at silly o'clock in the morning in London to get an abandoned feel. They just had to digitally edit out some early morning commuter traffic on a bypass that was visible in a couple of the scenes.
Or you can just do it on christmas day: http://www.ianvisits.co.uk/blog/2008/12/25/deserted-london/