Open:FactSet Forum

Normalizing and concording raw data feeds

(Mikolaj Franaszczuk) #1

One of the key benefits of getting data through the Open:Factset Marketplace is the Symbology concordance, which makes it easy to work with content from a variety of sources. But there are many additional standardization and normalization benefits that Factset provides. Our Data Feed teams take raw data and apply a slate of parsing and modeling rules to conform to a standard that makes ingestion and analysis less tedious.

Add a comment here if you have any questions about our data normalizing and symbol mapping process. Or if you think we can help with your own data standardizing process

To illustrate some of the transformations that go into standardizing a feed, lets use this hypothetical example of data about fireworks displays. Suppose there are two major providers of data about firework companies and their shows: Sparkledata and WorksRater. The raw feeds contain a lot of useful information about individual firework shows and their attributes and ratings, but as you can see from the file samples below, they use very different formats and paradigms. Making use of the raw feeds would require a lot of prep work and custom parsing dedicated to each provider.



123,“NorthEast Fireworks”,2018\10\1,Boston-MA,12,34,51,0.79



123,“NorthEast Fireworks”,2018\12\1,New York City-NY,67,99,40,0.54



ABC|Northeast Fireworks Inc.|20181001|BOSTON|MA|colorfulness|45|70%|MUSIC

ABC|Northeast Fireworks Inc.|20181001|BOSTON|MA|novelty|56|70%|MUSIC

ABC|Northeast Fireworks Inc.|20181001|BOSTON|MA|venue|76|70%|MUSIC

DEF|Dazzleworks Ltd.|20181110|PROVIDENCE|RI|colorfuleness|11|81%|NO_MUSIC

DEF|Dazzleworks Ltd.|20181110|PROVIDENCE|RI|novelty|80|81%|NO_MUSIC

DEF|Dazzleworks Ltd.|20181110|PROVIDENCE|RI|venue|33|81%|NO_MUSIC

XYZ|Firetainment Co.|20190101|BALTIMORE|MD|colorfulness|43|73%|MUSIC

XYZ|Firetainment Co.|20190101|BALTIMORE|MD|novelty|99|73%|MUSIC

XYZ|Firetainment Co.|20190101|BALTIMORE|MD|venue|21|73%|MUSIC

With Open:Factset feeds, much of the parsing and modeling work is already done for you. The data from each provider would be exposed in standardized tables after undergoing a slate of transformation, such as:

  • Normalization of the flat files into separate tables

  • Concordance with Factset entity symbology

  • Standardization of column names

  • Standardization of date formats (YYYY-MM-DD)

  • Standardization of how percentage values are presented

  • Standardization of text fields (e.g. removal of wrapping quotes)

  • Standardization of boolean flags (e.g. MUSIC\NO_MUSIC becomes 1\0)

  • Transposing of relevant data (e.g. Sparkledata’s attributes)

  • Clear separation of data into separate columns (e.g. city & state)

The final tables files\tables might look like this:

Sparkledata – sd_company_map

sd_id sd_company_name fs_entity_id
123 NorthEast Fireworks HIJ789-E
444 Dazzleworks YYYZZZ-E
872 Skyshow KLM432-E

Sparkledata – sd_scores

sd_id date sd_attribute sd_score
123 2018-10-01 colorfulness 12
123 2018-10-01 brightness 34
123 2018-10-01 novelty 51
444 2018-11-10 colorfulness 15
444 2018-11-10 brightneess 50
444 2018-11-10 novelty 71
872 2018-11-15 colorfulness 85
872 2018-11-15 brightness 12
872 2018-11-15 novelty 90
123 2018-12-01 colorfulness 67
123 2018-12-01 brightness 99
123 2018-12-01 novelty 40

Sparkledata – sd_show_info

sd_id date city state sd_audience_approval_pct
123 2018-10-01 Boston MA 79
444 2018-11-10 Providence RI 85
872 2018-11-10 Phoenix AZ 12
123 2018-12-01 New York City NY 54

WorksRater – wr_company_info

wr_id wr_company_name fs_entity_id
ABC Northeast Fireworks Inc. HIJ789-E
DEF Dazzleworks Ltd. YYYZZZ-E
XYZ Firetainment Co. BMD987-E

WorksRater – wr_scores

wr_id date wr_attribute wr_score
ABC 2018-10-01 colorfulness 45
ABC 2018-10-01 novelty 56
ABC 2018-10-01 venue 76
DEF 2018-11-10 colorfulness 11
DEF 2018-11-10 novelty 80
DEF 2018-11-10 venue 33
XYZ 2019-01-01 colorfulness 43
XYZ 2019-01-01 novelty 99
XYZ 2019-01-01 venue 21

WorksRater – wr_show_info

wr_id date city state wr_audience_approval_pct music_flag
ABC 2018-10-01 Boston MA 70 1
DEF 2018-11-10 Providence RI 81 0
XYZ 2019-01-01 Baltimore MD 73 1