Archive for the ‘Education and Support’ Category

The de-dupe problem

Thursday, August 5th, 2010

If you collect a lot of stuff in electronic form, the issue of how many copies to maintain becomes a concern. Ideally, you’d only store one copy of each photograph, album, document, or whatever. Metadata can then be used to find the same item through various routes such as file path, search words, or description.

Google encounters this problem in spades when it comes to their project to scan the worlds books. Books of the world, stand up and be counted! All 129,864,880 of you describes the problem. The problems run the gamut from the basic question of defining what is to be considered as a book through all of the classification and identification schemes being used to manage book collections.

So what does Google do? We collect metadata from many providers (more than 150 and counting) that include libraries, WorldCat, national union catalogs and commercial providers. At the moment we have close to a billion unique raw records. We then further analyze these records to reduce the level of duplication within each provider, bringing us down to close to 600 million records.

The gets boiled down in various ways such as finding errors or April fools jokes such as the turkey probe given and ISBN book number to where Google figures there are about 130 million books in the world.

Take a gander through the essay. Things that seem simple, like creating a catalog of all the world’s books, can – and usually do – have complexities you might not imagine.

Two days in May, 11,500 miles of track

Monday, May 31st, 2010

It was May 31, 1886 that a two day rail gauge conformance effort began. Southern railroads changed the distance between the rails from five feet to four feet nine in order to be compatible with the Pennsylvania Railroad. PrawfsBlawg suggests Happy Uniform Gauge Day! How a 3-inch nudge destroyed American federalism.

Today is the 134th anniversary of one of our most important yet most unrecognized constitutional events: On May 31st, 1886, the operators of Southern railroads began their famous two-day conversion of all southern railroad tracks

Check the link for the story. This was just one very big step towards conformity to enhance and enable commerce. The problem was widespread. “In 1871 no less than 23 different gauges existed in the United States, ranging in width from three to six feet.”

We face similar standards development processes today. Since computing technology has become a consumer good and service, protocols for communications, data storage, and service descriptions have followed the railroad gauge uniformity history. Back in the eighties, there were many different ways of connecting computers together both in terms of the wires and also in terms of the methods. Now the methods are overwhelmingly TCP/IP and the wire is twisted pair ethernet. The process continues in the wireless regime, however, as cell phones and wifi and other approaches compete.

There will always be custom solutions for niche markets but the economic advantages of standardization are usually overwhelming. From Cargo containers to rail gauges to electrical power delivery to clothing sizes, much of our prosperity comes from being able to talk a common language and share products and services easily.

As to whether this standardization and conformity is a attack on federalism, I don’t know. I think Rick has headed out to hyperbole, with the title of his post. States, towns, and counties still exist and have not been shoved off the map. They can just communicate better and profit more from each other’s efforts.

NIST Handbook of Mathematical Functions

Friday, May 14th, 2010

When it’s time to get your math on, NIST has the Digital Library of Mathematical Functions up. The goal is to provide

a reference tool for researchers and other users in applied mathematics, the physical sciences, engineering, and elsewhere who encounter special functions in the course of their everyday work.

This is a reference work. That means you need to know what you are doing to be able to make best use of it. The online version has links to papers and other documents that will provide background but otherwise this is like a dictionary of mathematics.

Dictionary of Algorithms and Data Structures

Thursday, January 21st, 2010

If you want to look up the definition of a software development term that you won’t be able to understand unless you already know the meaning, the NIST Dictionary of Algorithms and Data Structures might be a resource. Each definition seems to depend upon many other words in the dictionary and the links to those words can make for an interesting journey. I wonder if they get circular.

This is a dictionary of algorithms, algorithmic techniques, data structures, archetypal problems, and related definitions. Algorithms include common functions, such as Ackermann’s function. Problems include traveling salesman and Byzantine generals. Some entries have links to implementations and more information. Index pages list entries by area and by type. The two-level index has a total download 1/20 as big as this page.

If you don’t have a clue, the dictionary can be an interesting look at the terms and names and ideas that define software development and how they fit together. There are also links you can use to escape the dictionary into more complete treatments of the concepts, some with code.

pi calculator and FOSS hero

Thursday, January 7th, 2010

You may have heard (PhysOrg

A computer scientist in France has broken all previous records for calculating Pi, using only a personal computer. The previous record was approximately 2.6 trillion digits, but the new record, set by Fabrice Bellard, now stands at almost 2.7 trillion decimal places.

Which is impressive but there is more in the story. For one, “Bellard has been following the records for calculating Pi to the maximum number of decimal places since he received his first book about Pi at the age of 14.” Then there are other projects Bellard has done.

M Bellard is perhaps best known as the writer of the open source project FFmpeg and processor emulator QEMU. He said he has no immediate plans to calculate Pi to further digits in the future, but may do, depending on his motivation and the availability of larger and faster storage. He intends to release open-source versions of his software for Linux and Windows to enable anyone who is interested in furthering the calculation to beat him to it.

If you’ve been doing video transcoding, then you’ve appreciated Bellard’s work. QEMU is a system emulator that creates an artificial hardware environment to allow running systems within systems.

The mathematical ideas behind pi are fascinating in themselves but then there is the math behind arbitrary precision numbers and the infinite series calculations and algorithms used to calculate values such as pi out to a precision that would take an entire bleeding edge hard drive to store in decimal notation.

And then to think that the computers that can do this sort of calculation in weeks are readily available to anyone.

Modem history

Sunday, December 27th, 2009

SSB and echo cancellation were two technologies that allowed modems to up the speed on POTS landlines. TechRadar has the story in Getting connected: a history of modems

It took 14 years, from 1980 until 1994, for the speed of the modem to develop from 14.4Kpbs to 28.9Kbps but it was only two years later, in 1996, that Brent Townshend came up with the technology for the 56k modem.

A lot of things happened in the 90′s. The I’net move from nonprofit only to commerce capable provided a social impetus to go along with data transfer rate improvements at the same time as operating systems could take advantage of new computer architectures and video systems to provide WIMP (windows, icons, mouse, pointer) interfaces. Two years to double modem speeds and then another two years before cable and DSL wideband started to up that by an order of magnitude.

An article like this is a good way to sit down and marvel at just how much things have changed in such a little time.

Understanding what’s going in with climatology

Saturday, December 12th, 2009

There are two good posts today to help understand the brouhaha in climatology. Iowahawk has Fables of the Reconstruction (Or, How to Make Your Own Hockey Stick) that has a step by step tutorial on how to replicate Mann’s famous hockey stick graph complete with OOo calc spreadsheets.

My goal was to provide interested people with a hands-on DIY example of the basic statistical methodology underlying temperature reconstruction, at least as practiced by the leading lights of “Climate Science.” … Is there anything wrong with this methodology? Not in principle. In fact there’s a lot to recommend it. … The devil, as they say is in the details. In each of the steps there is some leeway for, shall we say, intervention.

That deals with what is called “homogenized” data where temperature records have been ‘adjusted’ to smooth out variability and compensate for known errors. The reason to do this has much to do with such things as the accuracy of measure, which surfacestations.org notes is considered to be greater than 2C for 2/3 of US surface stations, as well as the fact that temperature is measured at a point in space and time and what you are really after is atmospheric heat content.

Basil Copeland takes on the temperature measures problem in Would You Like Your Temperature Data Homogenized, or Pasteurized?. His point is that you don’t want to remove inhomogeneities but rather to just clean up the data as it is the lumps and their distribution that have meaning, too.

with temperature data, I want very much to see the natural variability in the data. And I cannot see that with linear trends fitted through homogenized data. It may be a hokey analogy, but I want my data pasteurized – as clean as it can be – but not homogenized so that I cannot see the true and full range of natural climate variability.

The problem with any of these raw temperature data cleaning problems gets back to what Iowahawk illustrated. It is fundamental to understanding the accuracy of climate studies. When you step through data manipulations, you bring in numerous assumptions and sources of error.

That is where ideas such as noted in Models for gravity and heat become interesting. Instead of temperatures being a starting point for analysis and calculation, they become an end point or verification of a model. The model describes the variables involved in determining temperature at a given point in time and space. The problem with current models is that they are very good for current conditions but their veracity rapidly degrades the farther away from current you go. That means that they are nearly useless for climatology. That, in turn, is both why climatology turns elsewhere and an illustration of just how tough the problem is to predict climate change.

MiB vs MB; GiB vs GB and IEEE 1541

Friday, December 11th, 2009

Tech Report has a poll going – Poll: Gigabytes vs. gibibytes.

You may not know it, but the world of technology is split into two camps right now. For one camp, megabytes, gigabytes, and terabytes are all powers of 10. Hard drive makers are part of that clique, as is Apple, at least since the release of Mac OS X 10.6 Snow Leopard. For the other camp—the traditionalists, if you will—those same units are powers of two. On that side of the ring, we find Microsoft, memory vendors, and just about anyone who doesn’t know the difference.

The issue is whether you are counting in base 2 like a computer or base 10 like a human. Engineering has always uses thousands as a convenient grouping so measures are reflected with kilo, mega, and giga prefixes to indicate thousands, millions, or billions. It is convenient, yet maybe a bit deceiving, that 2 to the 10th power is very close to ten to the third power. So ten with a 3, 6, or 9 exponent is sorta’ like 2 with a 10, 20, and 30 exponent.

This really wasn’t an issue when data storage was measured by thousands. Whether a thousand was 1024 or 1000 didn’t make much difference. When storage went up to a million, the difference between power of 2 round off and base ten became that of 1,048,576 or 1,000,000 and when talking billions 1,073,741,824 versus 1,000,000,000. That means an error of 2.4% for thousands, 4.9% for millions, or 7.4% for billions. A couple of percent is a ‘reasonable’ error but getting towards 5% or more can be an issue. For terabytes, the error starts getting close to ten percent.

Another factor is that devices like hard drives are becoming less connected to binary addressing schemes like other memory devices. Hard drives are also the most common (cheapest) for terabyte storage, too, and where people are most likely to being trying to purchase storage space rather than addressing convenience (i.e. speed). So sales efforts have to be a bit less deceiving and that may be why a terabyte drive is advertised as TiB so a buyer knows he’s getting a 1,000,000,000,000 rather than 1,099,511,627,776 bytes of storage.

So if you’re use to MB and GB and see something like MiB or GiB, you’ll know someone is following standards and trying to be honest with you.

Models for gravity and heat

Thursday, December 10th, 2009

AJ Strata nails the core issue in How Not To Create A Historic Global Temp Index. His reference is that of keeping track of satellites. He notes the many factors that make orbit predictions longer than a week difficult. As more is learned about what influences satellite position, the model used for predicting position is improved.

This approach is in contrast to global temperature determinations where the raw data is adjusted (homogenized) in order to fit the current desired paradigm.

Basically what alarmists needed to do was not adjust data, they needed to create a thermal atmosphere model which would take into account siting characteristics both local and large. This would include distance from large bodies of water, altitude, latitude, etc. A three dimensional model that would explain why various stations have their unique siting profiles and temperature records. It would explain why temperatures near oceans fluctuate less than stations inland 100-200 miles. It would show how a global average increase of 1°C would result in a .6°C increase at high latitudes or altitudes. It would EXPLAIN the data variations in the measurements.

What this means is that the effort should not be trying to adjust raw meteorological data to create a homogenized data set than could then be subject to various statistical manipulations. Instead, create a model where you could input factors about weather station equipment and locale and have it predict a temperature that could be compared to the actual measured temperature.

In some respects, the model AJ Strata seeks is what is being used for short term weather forecasting. As far as I know, such models are not being used to qualify temperature readings in determining a global temperature index. Also, those models usually use as input an adjusted and gridded input from raw measures so they can get rather circular in their methods.

And just what is the value or meaning of a global temperature index, anyway? The earth is so large and so varied and weather phenomena so local and so small that it seems as if the effort is an attempt to average apples and oranges to be able to describe both as one thing.

Handbrake nine four

Monday, December 7th, 2009

HandBrake has just release version 0.9.4. It is one of the easiest ways to transcode recorded video or DVD’s to modern compressed multimedia files. This version supports 64 bit processors and narrows the focus somewhat towards m4v output with constant quality.

If you are copying your DVD collection to electronic storage for convenient use with a media player, you might want to change the default ‘normal’ preset to ‘high profile’ to take advantage of some of the extra goodies in the m4v format. For modern (post mid 90′s) movies with 5.1 channel surround sound, hit the audio tab and select the proper track for pass through. That often means a 400+ bit rate rather than a 200- rate. If there is a commentary track, you can add that as well. If you need subtitles, click on that tab and add them, too. These options do mean that you need a media player that can handle selecting these options. VLC and the Gnome Movie Player seem to meet this requirement.

Programs like HandBrake are good ways to learn about modern video technology. For instance, the ‘normal’ preset uses strict anamorphic scaling while ‘high profile’ uses loose. This seems backwards. It may be a bit of a trade-off. The idea is to take the original DVD 720×480 resolution coupled with the intended display resolution tied in with actual video aspect ratio to determine an output resolution that is adjusted slightly for optimum compression. The result may be something that is a bit better resolution than the original 480p often found on DVD’s due to pixel squashing.

When you transcode a DVD using the VIDEO_TS option, you may find titles not in the DVD menu. The Tolkien trilogy extended edition from New Line, for instance, has a couple of MTV items that really don’t fit into the tone expressed on the rest of the DVD set. They are available as Easter Eggs if you run the Windows software on the DVD’s. You’ll also find that some lead in and drop out video is on the disk as a short title, usually well under a minute in length. For the extended Tolkien, the lean in is the usual rating and license and the drop out is a fade to black with notice that you need to insert the next DVD for the rest of the movie.

It should also be noted that there is a bit of a discord in the matter of copying DVD’s for personal use. The recording industry seems to want to restrict your viewing of DVD’s to only when it comes off the original media. Some in the industry even want to allow only a certain number of viewings as well. The issue is playing out in the courts and in treaties and the fallout creates hoops and hurdles, and possibly penalties, for those who want to enjoy published media their own way.