Originally published November 7, 2020 @ 1:49 pm

Just a scripting exercise because I need to do something important, but I am procrastinating. The idea is simple: grab some URL with text containing somewhat structured data and convert it into a spreadsheet. I know, exciting…

I will use “The World’s Most Nutritious Foods” article from Pocket and try to make a spreadsheet out of it. The first step part below does these few things:

  1. Downloads the URL and dumps it to text format
  2. Removes leading spaces
  3. Grabs the title line for each food item and the seven subsequent lines
  4. Removes the number list prefix in the title line
  5. Removes any lower-case text in parentheses, i.e. “(lower-case text)”
tmpdir="$(mktemp -d)"
lynx --dump "${url}" | \
sed -r 's/^\s+//g' | \
grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \
sed -r 's/^[0-9]{1,3}\. //g' | \
sed -r 's/ \([a-z]\)//g' > "${tmpfile}"
Sample output
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"

86kcal, $0.21, per 100g

A bright orange tuber, sweet potatoes are only distantly related to
potatoes. They are rich in beta-carotene.


249kcal, $0.81, per 100g

Figs have been cultivated since ancient times. Eaten fresh or dried,
they are rich in the mineral manganese.


80kcal, $0.85, per 100g

Ginger contains high levels of antioxidants. In medicine, it is used as
a digestive stimulant and to treat colds.


All these steps, of course, are specific to the text you’re parsing. There is no universal approach to a task such as this.

The second step splits the file on the “ooo” separator and removes the trailing split file (that contains nothing of interest).

cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null
/bin/rm -f "$(ls | sort -V | tail -1)"
Sample output
[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls
xx00  xx04  xx08  xx12  xx16  xx20  xx24  xx28  xx32  xx36  xx40  xx44  xx48  xx52  xx56  xx60  xx64  xx68  xx72  xx76  xx80  xx84  xx88  xx92  xx96
xx01  xx05  xx09  xx13  xx17  xx21  xx25  xx29  xx33  xx37  xx41  xx45  xx49  xx53  xx57  xx61  xx65  xx69  xx73  xx77  xx81  xx85  xx89  xx93  xx97
xx02  xx06  xx10  xx14  xx18  xx22  xx26  xx30  xx34  xx38  xx42  xx46  xx50  xx54  xx58  xx62  xx66  xx70  xx74  xx78  xx82  xx86  xx90  xx94  xx98
xx03  xx07  xx11  xx15  xx19  xx23  xx27  xx31  xx35  xx39  xx43  xx47  xx51  xx55  xx59  xx63  xx67  xx71  xx75  xx79  xx83  xx87  xx91  xx95  xx99

[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04

72kcal, $1.98, per 100g

Used in folk medicine and as a vegetable, studies suggest burdock can
aid fat loss and limit inflammation.


Finally, we read each xx* file and extract the five elements: item name, nutrition score, calories, unit price, and description. The primary approach is to use pgrep regex with non-capturing groups. You can also head/tail specific line numbers, since their position is constant, but then you would still need to parse those lines. And we prepend the column headers to the output.

ls xx* | while read f; do
item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"
score="$(tail -1 "${f}" | awk '{print $NF}')"
calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"
price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"
description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"
echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""
done | \
(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv
cd ~ && ls -als nutrition.csv
/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"

The optional step is to convert the CSV to XLSX with unoconv (read more here).

unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv
file ~/nutrition.xlsx

And here’s the end result.


Sweet Potato49860.21A bright orange tuber, sweet potatoes are only distantly related to potatoes. They are rich in beta-carotene.
Figs492490.81Figs have been cultivated since ancient times. Eaten fresh or dried, they are rich in the mineral manganese.
Ginger49800.85Ginger contains high levels of antioxidants. In medicine, it is used as a digestive stimulant and to treat colds.
Pumpkin50260.2Pumpkins are rich in yellow and orange pigments. Especially xanthophyll esters and beta-carotene.
Burdock Root50721.98Used in folk medicine and as a vegetable, studies suggest burdock can aid fat loss and limit inflammation.
Brussels Sprouts50430.35A type of cabbage. Brussels sprouts originated in Brussels in the 1500s. They are rich in calcium and vitamin C.
Broccoli50340.42Broccoli heads consist of immature flower buds and stems. US consumption has risen five-fold in 50 years.
Cauliflower50310.44Unlike broccoli, cauliflower heads are degenerate shoot tips that are frequently white, lacking green chlorophyll.
Water Chestnuts50971.5The water chestnut is not a nut at all, but an aquatic vegetable that grows in mud underwater within marshes.
Cantaloupe Melons50340.27One of the foods richest in glutathione, an antioxidant that protects cells from toxins including free radicals.
Prunes502400.44Dried plums are very rich in health-promoting nutrients such as antioxidants and anthocyanins.
Common Octopus50821.5Though nutritious, recent evidence suggests octopus can carry harmful shellfish toxins and allergens.
Carrots51360.4Carrots first appeared in Afghanistan 1,100 years ago. Orange carrots were grown in Europe in the 1500s.
Winter Squash51340.24Unlike summer squashes, winter squashes are eaten in the mature fruit stage. The hard rind is usually not eaten.
Jalapeno Peppers51290.66The same species as other peppers. Carotenoid levels are 35 times higher in red jalapenos that have ripened.
Rhubarb51211.47Rhubarb is rich in minerals, vitamins, fibre and natural phytochemicals that have a role in maintaining health.
Pomegranates51831.31Their red and purple colour is produced by anthocyanins that have antioxidant and anti-inflammatory properties.
Red Currants51560.44Red currants are also rich in anthocyanins. White currants are the same species as red, whereas black currants differ.
Oranges51460.37Most citrus fruits grown worldwide are oranges. In many varieties, acidity declines with fruit ripeness.
Carp511271.4A high proportion of carp is protein, around 18%. Just under 6% is fat, and the fish contains zero sugar.
Hubbard Squash52408.77A variety of the species Cucurbita maxim. Tear-drop shaped, they are often cooked in lieu of pumpkins.
Kumquats52710.69An unusual citrus fruit, kumquats lack a pith inside and their tender rind is not separate like an orange peel.
Pompano521641.44Often called jacks, Florida pompanos are frequently-caught western Atlantic fish usually weighing under 2kg.
Pink Salmon521271.19These fish are rich in long-chain fatty acids, such as omega-3s, that improve blood cholesterol levels.
Sour Cherries53500.58Sour cherries (Prunus cerasus) are a different species to sweet cherries (P. avium). Usually processed or frozen.
Rainbow Trout531413.08Closely related to salmon, rainbow trout are medium-sized Pacific fish also rich in omega-3s.
Perch53911.54Pregnant and lactating women are advised not to eat perch. Though nutritious, it may contain traces of mercury.
Green Beans54310.28Green beans, known as string, snap or French beans, are rich in saponins, thought to reduce cholesterol levels.
Red Leaf Lettuce54161.55Evidence suggests lettuce was cultivated before 4500 BC. It contains almost no fat or sugar and is high in calcium.
Leeks54611.83Leeks are closely related to onions, shallots, chives and garlic. Their wild ancestor grows around the Mediterranean basin.
Cayenne Pepper5431822.19Powdered cayenne pepper is produced from a unique cultivar of the pepper species Capsicum annuum.
Green Kiwifruit54610.22Kiwifruit are native to China. Missionaries took them to New Zealand in the early 1900s, where they were domesticated.
Golden Kiwifruit54630.22Kiwifruits are edible berries rich in potassium and magnesium. Some golden kiwifruits have a red centre.
Grapefruit54320.27Grapefruits (Citrus paradisi) originated in the West Indies as a hybrid of the larger pomelo fruit.
Mackerel541392.94An oily fish, one serving can provide over 10 times more beneficial fatty acids than a serving of a lean fish such as cod.
Sockeye Salmon541313.51Another oily fish, rich in cholesterol-lowering fatty acids. Canned salmon with bones is a source of calcium.
Arugula55250.48A salad leaf, known as rocket. High levels of glucosinolates protect against cancer and cardiovascular disease.
Chives55250.22Though low in energy, chives are high in vitamins A and K. The green leaves contain a range of beneficial antioxidants.
Paprika552821.54Also extracted from the pepper species Capsicum annuum. A spice rich in ascorbic acid, an antioxidant.
Red Tomatoes56180.15A low-energy, nutrient-dense food that are an excellent source of folate, potassium and vitamins A, C and E.
Green Tomatoes56230.33Fruit that has not yet ripened or turned red. Consumption of tomatoes is associated with a decreased cancer risk.
Green Lettuce56151.55The cultivated lettuce (Lactuca sativa) is related to wild lettuce (L. serriola), a common weed in the US.
Taro Leaves56422.19Young taro leaves are relatively high in protein, containing more than the commonly eaten taro root.
Lima Beans561060.5Also known as butter beans, lima beans are high in carbohydrate, protein and manganese, while low in fat.
Eel561842.43A good source of riboflavin (vitamin B2), though the skin mucus of eels can contain harmful marine toxins.
Bluefin Tuna561442.13A large fish, rich in omega-3s. Pregnant women are advised to limit their intake, due to mercury contamination.
Coho Salmon561460.86A Pacific species also known as silver salmon. Relatively high levels of fat, as well as long-chain fatty acids.
Summer Squash57170.22Harvested when immature, while the rind is still tender and edible. Its name refers to its short storage life.
Navy Beans573370.49Also known as haricot or pea beans. The fibre in navy beans has been correlated with the reduction of colon cancer.
Plantain571220.38Banana fruits with a variety of antioxidant, antimicrobial, hypoglycaemic and anti-diabetic properties.
Podded Peas58420.62Peas are an excellent source of protein, carbohydrates, dietary fibre, minerals and water-soluble vitamins.
Cowpeas58440.68Also called black-eyed peas. As with many legumes, high in carbohydrate, containing more protein than cereals.
Butter Lettuce58130.39Also known as butterhead lettuce, and including Boston and bib varieties. Few calories. Popular in Europe.
Red Cherries58500.33A raw, unprocessed and unfrozen variety of sour cherries (Prunus cerasus). Native to Europe and Asia.
Walnuts586193.08Walnuts contain sizeable proportions of a-linolenic acid, the healthy omega-3 fatty acid made by plants.
Fresh Spinach59230.52Contains more minerals and vitamins (especially vitamin A, calcium, phosphorus and iron) than many salad crops. Spinach appears twice in the list (45 and 24) because the way it is prepared affects its nutritional value. Fresh spinach can lose nutritional value if stored at room temperature, and ranks lower than eating spinach that has been frozen, for instance.
Parsley59360.26A relative of celery, parsley was popular in Greek and Roman times. High levels of a range of beneficial minerals.
Herring591580.65An Atlantic fish, among the top five most caught of all species. Rich in omega-3s, long-chain fatty acids.
Sea Bass59971.98A generic name for a number of related medium-sized oily fish species. Popular in the Mediterranean area.
Chinese Cabbage60130.11Variants of the cabbage species Brassica rapa, often called pak-choi or Chinese mustard. Low calorie.
Cress60324.49The brassica Lepidium sativum, not to be confused with watercress Nasturtium officinale. High in iron.
Apricots60480.36A ’stone’ fruit relatively high in sugar, phytoestrogens and antioxidants, including the carotenoid beta-carotene.
Fish Roe601340.17Fish eggs (roe) contain high levels of vitamin B-12 and omega-3 fatty acids. Caviar often refers to sturgeon roe.
Whitefish601343.67Species of oily freshwater fish related to salmon. Common in the northern hemisphere. Rich in omega-3s.
Coriander61237.63A herb rich in carotenoids, used to treat ills including digestive complaints, coughs, chest pains and fever.
Romaine Lettuce61171.55Also known as cos lettuce, another variety of Lactuca sativa. The fresher the leaves, the more nutritious they are.
Mustard Leaves61270.29One of the oldest recorded spices. Contains sinigrin, a chemical thought to protect against inflammation.
Atlantic Cod61823.18A large white, low fat, protein-rich fish. Cod livers are a source of fish oil rich in fatty acids and vitamin D.
Whiting61900.6Various species, but often referring to the North Atlantic fish Merlangius merlangus that is related to cod.
Kale62490.62A leafy salad plant, rich in the minerals phosphorous, iron and calcium, and vitamins such as A and C.
Broccoli Raab62220.66Not to be confused with broccoli. It has thinner stems and smaller flowers, and is related to turnips.
Chili Peppers623241.2The pungent fruits of the Capsicum plant. Rich in capsaicinoid, carotenoid and ascorbic acid antioxidants.
Clams62861.78Lean, protein-rich shellfish. Often eaten lightly cooked, though care must be taken to avoid food poisoning.
Collards63320.74Another salad leaf belonging to the Brassica genus of plants. A headless cabbage closely related to kale.
Basil63232.31A spicy, sweet herb traditionally used to protect the heart. Thought to be an antifungal and antibacterial.
Chili Powder632825.63A source of phytochemicals such as vitamin C, E and A, as well as phenolic compounds and carotenoids.
Frozen Spinach64291.35A salad crop especially high in magnesium, folate, vitamin A and the carotenoids beta carotene and zeazanthin. Freezing spinach helps prevent the nutrients within from degrading, which is why frozen spinach ranks higher than fresh spinach .
Dandelion Greens64450.27The word dandelion means lion’s tooth. The leaves are an excellent source of vitamin A, vitamin C and calcium.
Pink Grapefruit64420.27The red flesh of pink varieties is due to the accumulation of carotenoid and lycopene pigments.
Scallops64694.19A shellfish low in fat, high in protein, fatty acids, potassium and sodium.
Pacific Cod64723.18Closely related to Atlantic cod. Its livers are a significant source of fish oil rich in fatty acids and vitamin D.
Red Cabbage65310.12Rich in vitamins. Its wild cabbage ancestor was a seaside plant of European or Mediterranean origin.
Green Onion65270.51Known as spring onions. High in copper, phosphorous and magnesium. One of the richest sources of vitamin K.
Alaska Pollock65923.67Also called walleye pollock, the species Gadus chalcogrammus is usually caught in the Bering Sea and Gulf of Alaska. A low fat content of less than 1%.
Pike65883.67A fast freshwater predatory fish. Nutritious but pregnant women must avoid, due to mercury contamination.
Green Peas67771.39Individual green peas contain high levels of phosphorous, magnesium, iron, zinc, copper and dietary fibre.
Tangerines67530.29An oblate orange citrus fruit. High in sugar and the carotenoid cryptoxanthin, a precursor to vitamin A.
Watercress68113.47Unique among vegetables, it grows in flowing water as a wild plant. Traditionally eaten to treat mineral deficiency.
Celery Flakes683196.1Celery that is dried and flaked to use as a condiment. An important source of vitamins, minerals and amino acids.
Dried Parsley6929212.46Parsley that is dried and ground to use as a spice. High in boron, fluoride and calcium for healthy bones and teeth.
Snapper691003.75A family of mainly marine fish, with red snapper the best known. Nutritious but can carry dangerous toxins.
Beet Greens70220.48The leaves of beetroot vegetables. High in calcium, iron, vitamin K and B group vitamins (especially riboflavin).
Pork Fat736320.95A good source of B vitamins and minerals. Pork fat is more unsaturated and healthier than lamb or beef fat.
Swiss Chard78190.29A very rare dietary source of betalains, phytochemicals thought to have antioxidant and other health properties.
Pumpkin Seeds845591.6Including the seeds of other squashes. One of the richest plant-based sources of iron and manganese.
Chia Seeds854861.76Tiny black seeds that contain high amounts of dietary fibre, protein, a-linolenic acid, phenolic acid and vitamins.
Flatfish88701.15Sole and flounder species. Generally free from mercury and a good source of the essential nutrient vitamin B1.
Ocean Perch89790.82The Atlantic species. A deep-water fish sometimes called rockfish. High in protein, low in saturated fats.
Cherimoya96751.84Cherimoya fruit is fleshy and sweet with a white pulp. Rich in sugar and vitamins A, C, B1, B2 and potassium.
Almonds975790.91Rich in mono-unsaturated fatty acids. Promote cardiovascular health and may help with diabetes.